[ACCEPTED]-Filter out HTML tags and resolve entities in python-html

Accepted answer
Score: 39

Use lxml which is the best xml/html library 2 for python.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

And if you just want to sanitize 1 the html look at the lxml.html.clean module

Score: 16

Use BeautifulSoup! It's perfect for this, where you have 4 incoming markup of dubious virtue and need 3 to get something reasonable out of it. Just 2 pass in the original text, extract all the 1 string tags, and join them.

Score: 6

While I agree with Lucas that regular expressions 10 are not all that scary, I still think that 9 you should go with a specialized HTML parser. This 8 is because the HTML standard is hairy enough 7 (especially if you want to parse arbitrarily 6 "HTML" pages taken off the Internet) that 5 you would need to write a lot of code to 4 handle the corner cases. It seems that python includes one out of the box.

You 3 should also check out the python bindings for TidyLib which can clean 2 up broken HTML, making the success rate 1 of any HTML parsing much higher.

Score: 4

How about parsing the HTML data and extracting 3 the data with the help of the parser ?

I'd 2 try something like the author described 1 in chapter 8.3 in the Dive Into Python book

Score: 2
Score: 1

You might need something more complicated 8 than a regular expression. Web pages often 7 have angle brackets that aren't part of 6 a tag, like this:

 <div>5 < 7</div>

Stripping the tags with 5 regex will return the string "5 " and 4 treat

 < 7</div>

as a single tag and strip it out.

I 3 suggest looking for already-written code 2 that does this for you. I did a search and 1 found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.

Score: 0

Regular expressions are not scary, but writing 10 your own regexes to strip HTML is a sure 9 path to madness (and it won't work, either). Follow 8 the path of wisdom, and use one of the many 7 good HTML-parsing libraries.

Lucas' example 6 is also broken because "sub" is not a method 5 of a Python string. You'd have to "import 4 re", then call re.sub(pattern, repl, string). But 3 that's neither here nor there, as the correct 2 answer to your question does not involve 1 writing any regexes.

Score: 0

Looking at the amount of sense people are 6 demonstrating in other answers here, I'd 5 say that using a regex probably isn't the 4 best idea for your situation. Go for something 3 tried and tested, and treat my previous 2 answer as a demonstration that regexes need 1 not be that scary.

More Related questions