python - Twisted DeferredList only runs its callback half the time -


i'm trying make simple web scraper using twisted. have working, whenever try scrape more few hundred sites, hang indefinitely no discernible reason. seems work, except when stops @ end couple sites left process.

i used tutorial here: http://technicae.cogitat.io/2008/06/async-batching-with-twisted-walkthrough.html blueprint.

here code:

class spider:     """twisted-based html retrieval system."""      def __init__(self, queue, url_list):         self.process_queue = queue         self.start_urls = []         url in url_list:             self.start_urls.append(url)      def crawl(self):         """extracts information each website in start_urls."""         deferreds = []         sem = defer.deferredsemaphore(30)         url in self.start_urls:             d = sem.run(self._crawl, url, self.process_queue)             deferreds.append(d)         dl = defer.deferredlist(deferreds, consumeerrors=1)         dl.addcallback(self.finish, self.process_queue)         dl.addcallback(self.shutdown)         reactor.run()      def _crawl(self, url, queue):         d = getpage(url, timeout=10)         d.addcallback(self.parse, url, queue)         d.adderrback(self.parse_error, url, queue)         return d      def parse(self, result, url, queue):         print 'parsing:', url         data = {'body': result, 'url': url}         response = response(data['url'], data['body'])         queue.put(response)         return data      def parse_error(self, result, url, queue):         print 'errback from:', url         data = {'body': 'error', 'url': url}         response = response(data['url'], data['body'])         queue.put(response)         return data      def finish(self, results, queue):         (valid, data) in results:             if valid:                 print 'success:', data['url']             else:                 print 'failed:', data['url']         finish_signal = response('finished', 'done')         queue.put(finish_signal)      def shutdown(self, ignore):         reactor.stop() 

i running section of code in larger program, hence queue.

any suggestions making deferredlist fire? or ideas why it's firing half time, , failing without exceptions other half?

it's frustrating, since works small numbers of url's (1-100) fails when scaled up. new twisted, messed errbacks, can't figure out what, or how fix it...

also, before answers 'use scrapy!' can't use scrapy reasons won't here. assume program last hope , must work.

edit:

full standalone code people can run directly:

import sys twisted.internet import defer, reactor twisted.web.client import getpage  class seerspider:     """twisted-based html retrieval system."""      def __init__(self, queue, url_list):         self.process_queue = queue         self.start_urls = []         url in url_list:             self.start_urls.append(url)      def crawl(self):         """extracts information each website in url_list."""         deferreds = []         sem = defer.deferredsemaphore(30)         url in self.start_urls:             d = sem.run(self._crawl, url, self.process_queue)             deferreds.append(d)         dl = defer.deferredlist(deferreds, consumeerrors=true)         dl.addcallback(self.finish, self.process_queue)         dl.addcallback(self.shutdown)         reactor.run()      def _crawl(self, url, queue):         d = getpage(url, timeout=10)         d.addcallback(self.parse, url, queue)         d.adderrback(self.parse_error, url, queue)         return d      def parse(self, result, url, queue):         data = {'body': result, 'url': url}         response = response(data['url'], data['body'])         print response.url         return data      def parse_error(self, result, url, queue):         data = {'body': 'error','url': url}         response = response(data['url'], data['body'])         print response.url         return data      def finish(self, results, queue):         finish_signal = response('finished', 'done')         print finish_signal.url      def shutdown(self, ignore):         reactor.stop()  class response:     def __init__(self, url, text):         self.url = url         self.body = text  url_list = ['http://google.com/', 'http://example.com', 'http://facebook.com'] # work, make list bigger find bug spider = seerspider(none, url_list) spider.crawl() 

it looks you're mixing standard library's multiprocessing library use of twisted. if you're not careful this, random things break. example, perhaps reactor satisfying of i/o events in 1 process , rest in process.

it's hard sure problem, though, since sample code in question incomplete (you might think rest of program boring, of boring details taken define behavior of program, they're important question).


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -