[ACCEPTED]-How can I strip HTML in a string using Perl?-strip
Assuming the code is valid HTML (no stray 3 < or > operators)
$htmlCode =~ s|<.+?>||g;
If you need to remove 2 only bolds, h1's and br's
$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g
And you might want 1 to consider the HTML::Strip module
From perlfaq9: How do I remove HTML from a string?
The most correct way (albeit not the 19 fastest) is to use HTML::Parser from CPAN. Another 18 mostly correct way is to use HTML::FormatText 17 which not only removes HTML but also attempts 16 to do a little simple formatting of the 15 resulting plain text.
Many folks attempt 14 a simple-minded regular expression approach, like 13 s/<.*?>//g, but that fails in many 12 cases because the tags may continue over 11 line breaks, they may contain quoted angle-brackets, or 10 HTML comment may be present. Plus, folks 9 forget to convert entities--like < for 8 example.
Here's one "simple-minded" approach, that 7 works for most files:
#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more 6 complete solution, see the 3-stage striphtml 5 program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .
Here are some tricky cases 4 that you should think about when picking 3 a solution:
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other 2 tags, those solutions would also break on 1 text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->
You should definitely have a look at the 4 HTML::Restrict which allows you to strip away or restrict 3 the HTML tags allowed. A minimal example 2 that strips away all HTML tags:
use HTML::Restrict;
my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'
I would recommend 1 to stay away from HTML::Strip because it breaks utf8 encoding.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.