[ACCEPTED]-Filter out HTML tags and resolve entities in python-html
Use lxml which is the best xml/html library 2 for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize 1 the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have 4 incoming markup of dubious virtue and need 3 to get something reasonable out of it. Just 2 pass in the original text, extract all the 1 string tags, and join them.
While I agree with Lucas that regular expressions 10 are not all that scary, I still think that 9 you should go with a specialized HTML parser. This 8 is because the HTML standard is hairy enough 7 (especially if you want to parse arbitrarily 6 "HTML" pages taken off the Internet) that 5 you would need to write a lot of code to 4 handle the corner cases. It seems that python includes one out of the box.
You 3 should also check out the python bindings for TidyLib which can clean 2 up broken HTML, making the success rate 1 of any HTML parsing much higher.
How about parsing the HTML data and extracting 3 the data with the help of the parser ?
I'd 2 try something like the author described 1 in chapter 8.3 in the Dive Into Python book
if you use django you might also use http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags ;)
0
You might need something more complicated 8 than a regular expression. Web pages often 7 have angle brackets that aren't part of 6 a tag, like this:
<div>5 < 7</div>
Stripping the tags with 5 regex will return the string "5 " and 4 treat
< 7</div>
as a single tag and strip it out.
I 3 suggest looking for already-written code 2 that does this for you. I did a search and 1 found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing 10 your own regexes to strip HTML is a sure 9 path to madness (and it won't work, either). Follow 8 the path of wisdom, and use one of the many 7 good HTML-parsing libraries.
Lucas' example 6 is also broken because "sub" is not a method 5 of a Python string. You'd have to "import 4 re", then call re.sub(pattern, repl, string). But 3 that's neither here nor there, as the correct 2 answer to your question does not involve 1 writing any regexes.
Looking at the amount of sense people are 6 demonstrating in other answers here, I'd 5 say that using a regex probably isn't the 4 best idea for your situation. Go for something 3 tried and tested, and treat my previous 2 answer as a demonstration that regexes need 1 not be that scary.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.