r - readHTMLTables -- Retrieving Country Names and urls of articles related to the heads of governments -


i'd make map of actual world presidents.

for this, want scrape images of each president wikipedia.

the first step getting data wiki page: http://en.wikipedia.org/wiki/list_of_current_heads_of_state_and_government

i have trouble getting country names , president page urls because table has rowspans.

for moment, code looks below it's not ok because of row spanning..

    library(xml)         u = "http://en.wikipedia.org/wiki/list_of_current_heads_of_state_and_government"     doc = htmlparse(u)     tb = getnodeset(doc, "//table")[[3]]      statenames <- readhtmltable(tb)$state     presidenturls <- xpathsapply(tb, "//table/tr/td[2]/a[2]/@href") 

any idea welcome!

mat

if there heterogeneity in table, don't think can deal problem single line of code. in case, td has colspan=2, while others don't. can selected , processed separately filters following:

    nations1 <- xpathsapply(tb, "//table/tr[td[@colspan='2']]/td[1]/a/text()")     nations2 <- xpathsapply(tb, "//table/tr[count(td)=3]/td[1]/a/text()") 

should meet other types of conditions in table, keep in mind xpath has more.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -