python - How to get a good result from scrapy -
i trying scrape details wikipedia using scrapy. able scrape messy , poor result. since new python , scrapy, having difficulty on fixing this.
here's code:
from scrapy.spider import basespider scrapy.selector import htmlxpathselector wikipedia.items import wikipediaitem class wikipediaspider(basespider): name = "wiki" allowed_domains = ["wikipedia.org"] start_urls = ["http://en.wikipedia.org/wiki/main_page"] def parse(self, response): hxs = htmlxpathselector(response) sites = hxs.select('//table[@id="mp-upper"]/tr') items = [] site in sites: item = wikipediaitem() item['title'] = site.select('.//a/text()').extract() item['link'] = site.select('.//a/@href').extract() item['details'] = site.select('.//p/text()').extract() items.append(item) return items
and result:
2013-04-19 02:18:48+0800 [wiki] debug: scraped <200 http://en.wikipedia.org/wiki/main_page> {'details': [u' fungal species found in moist habitats in ', u'. species produces brown ', u' ', u' of varying shapes 40 millimetres (1.6\xa0in) across, , tall, thin ', u' 62 millimetres (2.4\xa0in) long, @ base of large , well-defined "bulb". stem varies in colour, whitish, pale yellow-brown, pale red-brown, pale brown , grey-brown observed. species produces unusually shaped, irregular ', u', each few thick protrusions. feature helps differentiate other species otherwise similar in appearance , ', u'. grows in ', u' association ', u', , species named. however, particular species favoured fungus unclear , may include ', u' , ', u' taxa. mushrooms grow ground, among mosses or ', u'. species first described in 2009, , within genus ', u', part of ', u' ', u'. ', u' ', u' collected shore of lake near ', u', finland. species has been recorded in sweden and, @ least in areas, relatively common. (', u')', u'recently featured: ', u'\xa0\u2013 ', u'\xa0\u2013 ', u': ', u' ', u' ', u'more anniversaries: ', u' ', u' '], 'link': [u'/wiki/file:inocybe_saliceticola.jpg', u'/wiki/inocybe_saliceticola', u'/wiki/nordic_countries', u'/wiki/mushrooms', u'/wiki/pileus_(mycology)', u'/wiki/stipe_(mycology)', u'/wiki/spore', u'/wiki/habit_(biology)', u'/wiki/mycorrhizal', u'/wiki/willow', u'/wiki/beech', u'/wiki/alder', u'/wiki/detritus', u'/wiki/section_(botany)', u'/wiki/holotype', u'/wiki/nurmes', u'/wiki/inocybe_saliceticola', u'/wiki/thistle,_utah', u'/wiki/be_here_now_(album)', u'/wiki/sumatran_rhinoceros', u'/wiki/wikipedia:today%27s_featured_article/april_2013', u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', u'/wiki/wikipedia:featured_articles', u'/wiki/wikipedia:recent_additions', u'/wiki/file:ezra_meeker_1921_crop.jpg', u'/wiki/ezra_meeker', u'/wiki/oregon_trail', u'/wiki/bullock_cart', u'/wiki/italy_at_the_2009_mediterranean_games', u'/wiki/2009_mediterranean_games_medal_table', u'/wiki/cossack_hetman', u'/wiki/ivan_petrizhitsky-kulaga', u'/wiki/cossacks', u'/wiki/fokus_(magazine)', u'/wiki/amir_garrett', u'/wiki/college_basketball', u'/wiki/fastball', u'/wiki/armenian_genocide', u'/wiki/karin_dialect', u'/wiki/scottish_american', u'/wiki/daniel_pennie_house', u'/wiki/wikipedia:recent_additions', u'/wiki/wikipedia:your_first_article', u'/wiki/template_talk:did_you_know', u'/wiki/slang', u'/wiki/hammer', u'/wiki/church_(building)', u'/wiki/wikipedia:today%27s_articles_for_improvement', u'/wiki/file:2013_boston_marathon_aftermath_people.jpg', u'/wiki/west_fertilizer_plant_explosion', u'/wiki/west,_texas', u'/wiki/texas', u'/wiki/moment_magnitude_scale', u'/wiki/2013_sistan_and_baluchestan_earthquake', u'/wiki/sistan_and_baluchestan_province', u'/wiki/15_april_2013_iraq_attacks', u'/wiki/boston_marathon_bombings', u'/wiki/2013_boston_marathon', u'/wiki/death_and_state_funeral_of_hugo_ch%c3%a1vez', u'/wiki/nicol%c3%a1s_maduro', u'/wiki/venezuelan_presidential_election,_2013', u'/wiki/list_of_presidents_of_venezuela', u'/wiki/adam_scott_(golfer)', u'/wiki/2013_masters_tournament', u'/wiki/government_of_india', u'/wiki/bollywood', u'/wiki/pran', u'/wiki/dadasaheb_phalke_award', u'/wiki/deaths_in_2013', u'/wiki/colin_davis', u'/wiki/maria_tallchief', u'/wiki/jonathan_winters', u'//en.wikinews.org/wiki/main_page', u'/wiki/portal:current_events', u'/wiki/april_18', u'/wiki/file:stpetes.jpg', u'/wiki/1506', u'/wiki/st._peter%27s_basilica', u'/wiki/vatican_city', u'/wiki/old_st._peter%27s_basilica', u'/wiki/1689', u'/wiki/militia_(united_states)', u'/wiki/boston', u'/wiki/1689_boston_revolt', u'/wiki/dominion_of_new_england', u'/wiki/1923', u'/wiki/new_york_yankees', u'/wiki/major_league_baseball', u'/wiki/yankee_stadium_(1923)', u'/wiki/1938', u'/wiki/superman', u'/wiki/jerry_siegel', u'/wiki/joe_shuster', u'/wiki/action_comics_1', u'/wiki/superhero', u'/wiki/comic_book', u'/wiki/1947', u'/wiki/list_of_the_largest_artificial_non-nuclear_explosions', u'/wiki/royal_navy', u'/wiki/tonne', u'/wiki/ammunition', u'/wiki/heligoland', u'/wiki/1949', u'/wiki/republic_of_ireland', u'/wiki/commonwealth_of_nations', u'/wiki/1996', u'/wiki/1996_shelling_of_qana', u'/wiki/qana', u'/wiki/operation_grapes_of_wrath', u'/wiki/united_nations_interim_force_in_lebanon', u'/wiki/april_17', u'/wiki/april_18', u'/wiki/april_19', u'/wiki/wikipedia:selected_anniversaries/april', u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', u'/wiki/list_of_historical_anniversaries', u'/wiki/coordinated_universal_time', u'//en.wikipedia.org/w/index.php?title=main_page&action=purge'], 'title': [u'inocybe saliceticola', u'nordic countries', u'mushrooms', u'caps', u'stems', u'spores', u'habit', u'mycorrhizal', u'willow', u'beech', u'alder', u'detritus', u'section', u'holotype', u'nurmes', u'thistle, utah', u'be here now', u'sumatran rhinoceros', u'archive' u'list of historical anniversaries', u'utc', u'reload page']}
i can't access same page did, result obtain erratic because wikipedia text full of links. when site.select('.//p/text()')
, select text directly under node <p>
. means what's inside subnodes <a href=..>text</a>
isn't scraped. links tags split result, end strange list.
if want retrieve every node can use
contents = site.select('.//p/node()').extract() item['details'] = ''.join(contents)
that way you'll have inside <p>
tags (including <a>
tags). if want text without links tags can use strip_html(item['details'])
(actually, contents = site.select('.//p//text()').extract()
might work , more xpath oriented).
Comments
Post a Comment