[ACCEPTED]-Regex to match all HTML tags except <p> and </p>-perl

Accepted answer
Score: 38

If you insist on using a regex, something like 14 this will work in most cases:

# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;

Explanation:

s{
  <             # opening angled bracket
  (?>/?)        # ratchet past optional / 
  (?:
    [^pP]       # non-p tag
    |           # ...or...
    [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
  )
  [^>]*         # everything until closing angled bracket
  >             # closing angled bracket
 }{}gx; # replace with nothing, globally

But 13 really, save yourself some headaches and 12 use a parser instead. CPAN has several 11 modules that are suitable. Here's an example 10 using the HTML::TokeParser module that comes with the extremely 9 capable HTML::Parser CPAN distribution:

use strict;

use HTML::TokeParser;

my $parser = HTML::TokeParser->new('/some/file.html')
  or die "Could not open /some/file.html - $!";

while(my $t = $parser->get_token)
{
  # Skip start or end tags that are not "p" tags
  next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');

  # Print everything else normally (see HTML::TokeParser docs for explanation)
  if($t->[0] eq 'T')
  {
    print $t->[1];
  }
  else
  {
    print $t->[-1];
  }
}

HTML::Parser accepts input 8 in the form of a file name, an open file 7 handle, or a string. Wrapping the above 6 code in a library and making the destination 5 configurable (i.e., not just printing as in the 4 above) is not hard. The result will be 3 much more reliable, maintainable, and possibly 2 also faster (HTML::Parser uses a C-based 1 backend) than trying to use regular expressions.

Score: 16

In my opinion, trying to parse HTML with 13 anything other than an HTML parser is just 12 asking for a world of pain. HTML is a really complex 11 language (which is one of the major reasons 10 that XHTML was created, which is much simpler 9 than HTML).

For example, this:

<HTML /
  <HEAD /
    <TITLE / > /
    <P / >

is a complete, 100% well-formed, 100% valid 8 HTML document. (Well, it's missing the DOCTYPE 7 declaration, but other than that ...)

It 6 is semantically equivalent to

<html>
  <head>
    <title>
      &gt;
    </title>
  </head>
  <body>
    <p>
      &gt;
    </p>
  </body>
</html>

But it's nevertheless 5 valid HTML that you're going to have to 4 deal with. You could, of course, devise a regex 3 to parse it, but, as others already suggested, using 2 an actual HTML parser is just sooo much 1 easier.

Score: 14

I came up with this:

<(?!\/?p(?=>|\s.*>))\/?.*?>

x/
<           # Match open angle bracket
(?!         # Negative lookahead (Not matching and not consuming)
    \/?     # 0 or 1 /
    p           # p
    (?=     # Positive lookahead (Matching and not consuming)
    >       # > - No attributes
        |       # or
    \s      # whitespace
    .*      # anything up to 
    >       # close angle brackets - with attributes
    )           # close positive lookahead
)           # close negative lookahead
            # if we have got this far then we don't match
            # a p tag or closing p tag
            # with or without attributes
\/?         # optional close tag symbol (/)
.*?         # and anything up to
>           # first closing tag
/

This will now deal with 7 p tags with or without attributes and the 6 closing p tags, but will match pre and similar 5 tags, with or without attributes.

It doesn't 4 strip out attributes, but my source data 3 does not put them in. I may change this 2 later to do this, but this will suffice 1 for now.

Score: 4

Not sure why you are wanting to do this 5 - regex for HTML sanitisation isn't always 4 the best method (you need to remember to 3 sanitise attributes and such, remove javascript: hrefs 2 and the likes)... but, a regex to match 1 HTML tags that aren't <p></p>:

(<[^pP].*?>|</[^pP]>)

Verbose:

(
    <               # < opening tag
        [^pP].*?    # p non-p character, then non-greedy anything
    >               # > closing tag
|                   #   ....or....
    </              # </
        [^pP]       # a non-p tag
    >               # >
)
Score: 4

I used Xetius regex and it works fine. Except 6 for some flex generated tags which can be 5 :
with no spaces inside. I tried ti fix 4 it with a simple ? after \s and it looks like 3 it's working :

<(?!\/?p(?=>|\s?.*>))\/?.*?>

I'm using it to clear tags 2 from flex generated html text so i also 1 added more excepted tags :

<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Score: 3

Xetius, resurrecting this ancient question 5 because it had a simple solution that wasn't 4 mentioned. (Found your question while doing 3 some research for a regex bounty quest.)

With all the disclaimers 2 about using regex to parse html, here is 1 a simple way to do it.

#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";

See this live demo

Reference

How to match pattern except in situations s1, s2, s3

How to match a pattern unless...

Score: 2

Since HTML is not a regular language I would 19 not expect a regular expression to do a 18 very good job at matching it. They might 17 be up to this task (though I'm not convinced), but 16 I would consider looking elsewhere; I'm 15 sure perl must have some off-the-shelf libraries 14 for manipulating HTML.

Anyway, I would think 13 that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily 12 (I don't know the vagaries of perl's regexp 11 syntax so I cannot help further). I am 10 assuming that \s means whitespace. Perhaps 9 it doesn't. Either way, you want something 8 that'll match attributes offset from the 7 tag name by whitespace. But it's more difficult 6 than that as people often put unescaped 5 angle brackets inside scripts and comments 4 and perhaps even quoted attribute values, which 3 you don't want to match against.

So as I 2 say, I don't really think regexps are the 1 right tool for the job.

Score: 2

Since HTML is not a regular language

HTML 2 isn't but HTML tags are and they can be 1 adequatly described by regular expressions.

Score: 1

Assuming that this will work in PERL as 3 it does in languages that claim to use PERL-compatible 2 syntax:

/<\/?[^p][^>]*>/

EDIT:

But that won't match a <pre> or <param> tag, unfortunately.

This, perhaps?

/<\/?(?!p>|p )[^>]+>/

That 1 should cover <p> tags that have attributes, too.

Score: 1

You also might want to allow for whitespace 3 before the "p" in the p tag. Not sure how 2 often you'll run into this, but < p> is 1 perfectly valid HTML.

Score: 1

The original regex can be made to work with 9 very little effort:

 <(?>/?)(?!p).+?>

The problem was that 8 the /? (or \?) gave up what it matched when 7 the assertion after it failed. Using a non-backtracking 6 group (?>...) around it takes care that 5 it never releases the matched slash, so 4 the (?!p) assertion is always anchored to 3 the start of the tag text.

(That said I agree 2 that generally parsing HTML with regexes 1 is not the way to go).

Score: 0

Try this, it should work:

/<\/?([^p](\s.+?)?|..+?)>/

Explanation: it 5 matches either a single letter except “p”, followed 4 by an optional whitespace and more characters, or 3 multiple letters (at least two).

/EDIT: I've 2 added the ability to handle attributes in 1 p tags.

Score: 0

This works for me because all the solutions 3 above failed for other html tags starting 2 with p such as param pre progress, etc. It 1 also takes care of the html attributes too.

~(<\/?[^>]*(?<!<\/p|p)>)~ig

More Related questions