[ACCEPTED]-Regex to match all HTML tags except <p> and </p>-perl
If you insist on using a regex, something like 14 this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But 13 really, save yourself some headaches and 12 use a parser instead. CPAN has several 11 modules that are suitable. Here's an example 10 using the HTML::TokeParser module that comes with the extremely 9 capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input 8 in the form of a file name, an open file 7 handle, or a string. Wrapping the above 6 code in a library and making the destination 5 configurable (i.e., not just print
ing as in the 4 above) is not hard. The result will be 3 much more reliable, maintainable, and possibly 2 also faster (HTML::Parser uses a C-based 1 backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with 13 anything other than an HTML parser is just 12 asking for a world of pain. HTML is a really complex 11 language (which is one of the major reasons 10 that XHTML was created, which is much simpler 9 than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid 8 HTML document. (Well, it's missing the DOCTYPE 7 declaration, but other than that ...)
It 6 is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless 5 valid HTML that you're going to have to 4 deal with. You could, of course, devise a regex 3 to parse it, but, as others already suggested, using 2 an actual HTML parser is just sooo much 1 easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with 7 p tags with or without attributes and the 6 closing p tags, but will match pre and similar 5 tags, with or without attributes.
It doesn't 4 strip out attributes, but my source data 3 does not put them in. I may change this 2 later to do this, but this will suffice 1 for now.
Not sure why you are wanting to do this 5 - regex for HTML sanitisation isn't always 4 the best method (you need to remember to 3 sanitise attributes and such, remove javascript: hrefs 2 and the likes)... but, a regex to match 1 HTML tags that aren't <p></p>
:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except 6 for some flex generated tags which can be 5 :
with no spaces inside. I tried ti fix 4 it with a simple ? after \s and it looks like 3 it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags 2 from flex generated html text so i also 1 added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question 5 because it had a simple solution that wasn't 4 mentioned. (Found your question while doing 3 some research for a regex bounty quest.)
With all the disclaimers 2 about using regex to parse html, here is 1 a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
Since HTML is not a regular language I would 19 not expect a regular expression to do a 18 very good job at matching it. They might 17 be up to this task (though I'm not convinced), but 16 I would consider looking elsewhere; I'm 15 sure perl must have some off-the-shelf libraries 14 for manipulating HTML.
Anyway, I would think 13 that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily 12 (I don't know the vagaries of perl's regexp 11 syntax so I cannot help further). I am 10 assuming that \s means whitespace. Perhaps 9 it doesn't. Either way, you want something 8 that'll match attributes offset from the 7 tag name by whitespace. But it's more difficult 6 than that as people often put unescaped 5 angle brackets inside scripts and comments 4 and perhaps even quoted attribute values, which 3 you don't want to match against.
So as I 2 say, I don't really think regexps are the 1 right tool for the job.
Since HTML is not a regular language
HTML 2 isn't but HTML tags are and they can be 1 adequatly described by regular expressions.
Assuming that this will work in PERL as 3 it does in languages that claim to use PERL-compatible 2 syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre>
or <param>
tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That 1 should cover <p>
tags that have attributes, too.
You also might want to allow for whitespace 3 before the "p" in the p tag. Not sure how 2 often you'll run into this, but < p> is 1 perfectly valid HTML.
The original regex can be made to work with 9 very little effort:
<(?>/?)(?!p).+?>
The problem was that 8 the /? (or \?) gave up what it matched when 7 the assertion after it failed. Using a non-backtracking 6 group (?>...) around it takes care that 5 it never releases the matched slash, so 4 the (?!p) assertion is always anchored to 3 the start of the tag text.
(That said I agree 2 that generally parsing HTML with regexes 1 is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it 5 matches either a single letter except “p”, followed 4 by an optional whitespace and more characters, or 3 multiple letters (at least two).
/EDIT: I've 2 added the ability to handle attributes in 1 p
tags.
This works for me because all the solutions 3 above failed for other html tags starting 2 with p such as param pre progress, etc. It 1 also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.