r - readHTMLTables -- Retrieving Country Names and urls of articles related to the heads of governments -
i'd make map of actual world presidents.
for this, want scrape images of each president wikipedia.
the first step getting data wiki page: http://en.wikipedia.org/wiki/list_of_current_heads_of_state_and_government
i have trouble getting country names , president page urls because table has rowspans.
for moment, code looks below it's not ok because of row spanning..
library(xml) u = "http://en.wikipedia.org/wiki/list_of_current_heads_of_state_and_government" doc = htmlparse(u) tb = getnodeset(doc, "//table")[[3]] statenames <- readhtmltable(tb)$state presidenturls <- xpathsapply(tb, "//table/tr/td[2]/a[2]/@href")
any idea welcome!
mat
if there heterogeneity in table, don't think can deal problem single line of code. in case, td
has colspan=2
, while others don't. can selected , processed separately filters following:
nations1 <- xpathsapply(tb, "//table/tr[td[@colspan='2']]/td[1]/a/text()") nations2 <- xpathsapply(tb, "//table/tr[count(td)=3]/td[1]/a/text()")
should meet other types of conditions in table, keep in mind xpath has more.
Comments
Post a Comment