python - How to get a good result from scrapy -


i trying scrape details wikipedia using scrapy. able scrape messy , poor result. since new python , scrapy, having difficulty on fixing this.

here's code:

from scrapy.spider import basespider  scrapy.selector import htmlxpathselector  wikipedia.items import wikipediaitem  class wikipediaspider(basespider):     name = "wiki"     allowed_domains = ["wikipedia.org"]     start_urls = ["http://en.wikipedia.org/wiki/main_page"]      def parse(self, response):         hxs = htmlxpathselector(response)         sites = hxs.select('//table[@id="mp-upper"]/tr')         items = []         site in sites:             item = wikipediaitem()             item['title'] = site.select('.//a/text()').extract()             item['link'] = site.select('.//a/@href').extract()             item['details'] = site.select('.//p/text()').extract()             items.append(item)         return items 

and result:

2013-04-19 02:18:48+0800 [wiki] debug: scraped <200 http://en.wikipedia.org/wiki/main_page>  {'details': [u' fungal species found in moist habitats in ',  u'. species produces brown ',                  u' ',                   u' of varying shapes 40 millimetres (1.6\xa0in) across, , tall, thin ',                   u' 62 millimetres (2.4\xa0in) long, @ base of large , well-defined "bulb". stem varies in colour, whitish, pale yellow-brown, pale red-brown, pale brown , grey-brown observed. species produces unusually shaped, irregular ',                   u', each few thick protrusions. feature helps differentiate other species otherwise similar in appearance , ',                   u'. grows in ',                   u' association ',                   u', , species named. however, particular species favoured fungus unclear , may include ',                   u' , ',                   u' taxa. mushrooms grow ground, among mosses or ',                   u'. species first described in 2009, , within genus ',                   u', part of ',                   u' ',                   u'. ',                   u' ',                   u' collected shore of lake near ',                   u', finland. species has been recorded in sweden and, @  least in areas, relatively common. (',                   u')',                   u'recently featured: ',                   u'\xa0\u2013 ',                   u'\xa0\u2013 ',                   u': ',                   u' ',                   u' ',                   u'more anniversaries: ',                   u' ',                   u' '],       'link': [u'/wiki/file:inocybe_saliceticola.jpg',                u'/wiki/inocybe_saliceticola',                u'/wiki/nordic_countries',                u'/wiki/mushrooms',                u'/wiki/pileus_(mycology)',                u'/wiki/stipe_(mycology)',                u'/wiki/spore',                u'/wiki/habit_(biology)',                u'/wiki/mycorrhizal',                u'/wiki/willow',                u'/wiki/beech',                u'/wiki/alder',                u'/wiki/detritus',                u'/wiki/section_(botany)',                u'/wiki/holotype',                u'/wiki/nurmes',                u'/wiki/inocybe_saliceticola',                u'/wiki/thistle,_utah',                u'/wiki/be_here_now_(album)',                u'/wiki/sumatran_rhinoceros',                u'/wiki/wikipedia:today%27s_featured_article/april_2013',                u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',                u'/wiki/wikipedia:featured_articles',                u'/wiki/wikipedia:recent_additions',                u'/wiki/file:ezra_meeker_1921_crop.jpg',                u'/wiki/ezra_meeker',                u'/wiki/oregon_trail',                u'/wiki/bullock_cart',                u'/wiki/italy_at_the_2009_mediterranean_games',                u'/wiki/2009_mediterranean_games_medal_table',                u'/wiki/cossack_hetman',                u'/wiki/ivan_petrizhitsky-kulaga',                u'/wiki/cossacks',                u'/wiki/fokus_(magazine)',                u'/wiki/amir_garrett',                u'/wiki/college_basketball',                 u'/wiki/fastball',                u'/wiki/armenian_genocide',                u'/wiki/karin_dialect',                u'/wiki/scottish_american',                u'/wiki/daniel_pennie_house',                u'/wiki/wikipedia:recent_additions',                u'/wiki/wikipedia:your_first_article',                u'/wiki/template_talk:did_you_know',                u'/wiki/slang',                u'/wiki/hammer',                u'/wiki/church_(building)',                u'/wiki/wikipedia:today%27s_articles_for_improvement',                u'/wiki/file:2013_boston_marathon_aftermath_people.jpg',                u'/wiki/west_fertilizer_plant_explosion',                u'/wiki/west,_texas',                u'/wiki/texas',                u'/wiki/moment_magnitude_scale',                u'/wiki/2013_sistan_and_baluchestan_earthquake',                u'/wiki/sistan_and_baluchestan_province',                u'/wiki/15_april_2013_iraq_attacks',                u'/wiki/boston_marathon_bombings',                u'/wiki/2013_boston_marathon',                u'/wiki/death_and_state_funeral_of_hugo_ch%c3%a1vez',                u'/wiki/nicol%c3%a1s_maduro',                u'/wiki/venezuelan_presidential_election,_2013',                u'/wiki/list_of_presidents_of_venezuela',                u'/wiki/adam_scott_(golfer)',                u'/wiki/2013_masters_tournament',                u'/wiki/government_of_india',                u'/wiki/bollywood',                u'/wiki/pran',                u'/wiki/dadasaheb_phalke_award',                u'/wiki/deaths_in_2013',                u'/wiki/colin_davis',                u'/wiki/maria_tallchief',                u'/wiki/jonathan_winters',                u'//en.wikinews.org/wiki/main_page',                u'/wiki/portal:current_events',                u'/wiki/april_18',                u'/wiki/file:stpetes.jpg',                u'/wiki/1506',                u'/wiki/st._peter%27s_basilica',                u'/wiki/vatican_city',                u'/wiki/old_st._peter%27s_basilica',                u'/wiki/1689',                u'/wiki/militia_(united_states)',                u'/wiki/boston',                u'/wiki/1689_boston_revolt',                u'/wiki/dominion_of_new_england',                u'/wiki/1923',                u'/wiki/new_york_yankees',                u'/wiki/major_league_baseball',                u'/wiki/yankee_stadium_(1923)',                u'/wiki/1938',                u'/wiki/superman',                u'/wiki/jerry_siegel',                u'/wiki/joe_shuster',                u'/wiki/action_comics_1',                u'/wiki/superhero',                u'/wiki/comic_book',                u'/wiki/1947',                u'/wiki/list_of_the_largest_artificial_non-nuclear_explosions',                u'/wiki/royal_navy',                u'/wiki/tonne',                u'/wiki/ammunition',                u'/wiki/heligoland',                u'/wiki/1949',                u'/wiki/republic_of_ireland',                u'/wiki/commonwealth_of_nations',                u'/wiki/1996',                u'/wiki/1996_shelling_of_qana',                u'/wiki/qana',                u'/wiki/operation_grapes_of_wrath',                u'/wiki/united_nations_interim_force_in_lebanon',                u'/wiki/april_17',                u'/wiki/april_18',                u'/wiki/april_19',                u'/wiki/wikipedia:selected_anniversaries/april',                u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',                u'/wiki/list_of_historical_anniversaries',                u'/wiki/coordinated_universal_time',                u'//en.wikipedia.org/w/index.php?title=main_page&action=purge'],  'title': [u'inocybe saliceticola',   u'nordic countries',                 u'mushrooms',                 u'caps',                 u'stems',                 u'spores',                 u'habit',                 u'mycorrhizal',                 u'willow',                 u'beech',                 u'alder',                 u'detritus',                 u'section',                 u'holotype',                 u'nurmes',                 u'thistle, utah',                 u'be here now',                 u'sumatran rhinoceros',                 u'archive'                 u'list of historical anniversaries',                 u'utc',                 u'reload page']} 

i can't access same page did, result obtain erratic because wikipedia text full of links. when site.select('.//p/text()'), select text directly under node <p>. means what's inside subnodes <a href=..>text</a> isn't scraped. links tags split result, end strange list.

if want retrieve every node can use

contents = site.select('.//p/node()').extract() item['details'] = ''.join(contents) 

that way you'll have inside <p> tags (including <a>tags). if want text without links tags can use strip_html(item['details']) (actually, contents = site.select('.//p//text()').extract() might work , more xpath oriented).


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -