[ACCEPTED]-How can I read and parse the contents of a webpage in R-html-content-extraction

Accepted answer
Score: 34

Not really sure how you want to process 7 that page, because it's really messy. As 6 we re-learned in this famous stackoverflow question, it's not a good idea to do regex on 5 html, so you will definitely want to parse 4 this with the XML package.

Here's an example 3 to get you started:

webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character 2 vector of mostly just webpage text (along 1 with some javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
[4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 
Score: 3

Your best bet may be the XML package -- see 1 for example this previous question.

Score: 2

I know you asked for R. But maybe python+beautifullsoup 3 is the way forward here? Then do your analysis 2 with R you have scraped the screen with 1 beautifullsoup?

More Related questions