[ACCEPTED]-Filter out HTML tags and resolve entities in python-html
Use lxml which is the best xml/html library 2 for python.
import lxml.html t = lxml.html.fromstring("...") t.text_content()
And if you just want to sanitize 1 the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have 4 incoming markup of dubious virtue and need 3 to get something reasonable out of it. Just 2 pass in the original text, extract all the 1 string tags, and join them.
While I agree with Lucas that regular expressions 10 are not all that scary, I still think that 9 you should go with a specialized HTML parser. This 8 is because the HTML standard is hairy enough 7 (especially if you want to parse arbitrarily 6 "HTML" pages taken off the Internet) that 5 you would need to write a lot of code to 4 handle the corner cases. It seems that python includes one out of the box.
You 3 should also check out the python bindings for TidyLib which can clean 2 up broken HTML, making the success rate 1 of any HTML parsing much higher.
How about parsing the HTML data and extracting 3 the data with the help of the parser ?
I'd 2 try something like the author described 1 in chapter 8.3 in the Dive Into Python book
if you use django you might also use http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags ;)
You might need something more complicated 8 than a regular expression. Web pages often 7 have angle brackets that aren't part of 6 a tag, like this:
<div>5 < 7</div>
Stripping the tags with 5 regex will return the string "5 " and 4 treat
as a single tag and strip it out.
I 3 suggest looking for already-written code 2 that does this for you. I did a search and 1 found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing 10 your own regexes to strip HTML is a sure 9 path to madness (and it won't work, either). Follow 8 the path of wisdom, and use one of the many 7 good HTML-parsing libraries.
Lucas' example 6 is also broken because "sub" is not a method 5 of a Python string. You'd have to "import 4 re", then call re.sub(pattern, repl, string). But 3 that's neither here nor there, as the correct 2 answer to your question does not involve 1 writing any regexes.
Looking at the amount of sense people are 6 demonstrating in other answers here, I'd 5 say that using a regex probably isn't the 4 best idea for your situation. Go for something 3 tried and tested, and treat my previous 2 answer as a demonstration that regexes need 1 not be that scary.
More Related questions