python - Why doesn't downloading text file work correctly? -
i using python 3.3.1. have created function called download_file()
downloads file , saves disk.
#!/usr/bin/python3 # -*- coding: utf8 -*- import datetime import os import urllib.error import urllib.request def download_file(*urls, download_location=os.getcwd(), debugging=false): """downloads files provided multiple url arguments. provide url files downloaded strings. separate files downloaded comma. function download files , save in folder provided keyword-argument download_location. if download_location not provided, file saved in current working directory. folder download_location created if doesn't exist. not worry trailing slash @ end download_location. code take carry of you. if download encounters error alert , provide information error code , error reason (if received server). normal usage: >>> download_file('http://localhost/index.html', 'http://localhost/info.php') >>> download_file('http://localhost/index.html', 'http://localhost/info.php', download_location='/home/aditya/download/test') >>> download_file('http://localhost/index.html', 'http://localhost/info.php', download_location='/home/aditya/download/test/') in debug mode, files not downloaded, neither there attempt establish connection server. prints out filename , url have been attempted downloaded in normal mode. default, debug mode inactive. in order activate it, need supply keyword-argument 'debugging=true', like: >>> download_file('http://localhost/index.html', 'http://localhost/info.php', debugging=true) >>> download_file('http://localhost/index.html', 'http://localhost/info.php', download_location='/home/aditya/download/test', debugging=true) """ # append trailing slash @ end of download_location if not # present if download_location[-1] != '/': download_location = download_location + '/' # create folder download_location if not present os.makedirs(download_location, exist_ok=true) # other variables time_format = '%y-%b-%d %h:%m:%s' # '2000-jan-01 22:10:00' # "request headers" information file downloaded accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' accept_encoding = 'gzip, deflate' accept_language = 'en-us,en;q=0.5' connection = 'keep-alive' user_agent = 'mozilla/5.0 (x11; ubuntu; linux i686; rv:20.0) \ gecko/20100101 firefox/20.0' headers = {'accept': accept, 'accept-encoding': accept_encoding, 'accept-language': accept_language, 'connection': connection, 'user-agent': user_agent, } # loop through files downloaded url in urls: filename = os.path.basename(url) if not debugging: try: request_sent = urllib.request.request(url, none, headers) response_received = urllib.request.urlopen(request_sent) except urllib.error.urlerror error_encountered: print(datetime.datetime.now().strftime(time_format), ':', filename, '- file not downloaded.') if hasattr(error_encountered, 'code'): print(' ' * 22, 'error code -', error_encountered.code) if hasattr(error_encountered, 'reason'): print(' ' * 22, 'reason -', error_encountered.reason) else: read_response = response_received.read() output_file = download_location + filename open(output_file, 'wb') downloaded_file: downloaded_file.write(read_response) print(datetime.datetime.now().strftime(time_format), ':', filename, '- downloaded successfully.') else: print(datetime.datetime.now().strftime(time_format), ': debugging :', filename, 'would downloaded :\n', ' ' * 21, url)
this function works downloading pdfs, images , other formats, giving trouble text documents html files. suspect problem has line @ end:
with open(output_file, 'wb') downloaded_file:
so, have tried opening in wt
mode well. have tried work w
mode only. doesn't solve problem.
the other problem might have been encoding have included second line as:
# -*- coding: utf8 -*-
but still doesn't work. might problem , how make work both text , binary files?
example of doesn't work:
>>>download_file("http://docs.python.org/3/tutorial/index.html")
when open in gedit, displayed as:
similarly when opened in firefox:
the file downloading has been sent gzip encoding -- can see if zcat index.html
, downloaded file appears correctly. in code, might want add like:
if response_received.headers.get('content-encoding') == 'gzip': read_response = zlib.decompress(read_response, 16 + zlib.max_wbits)
edit:
well, can't why works on windows (and unfortunately don't have windows box test on), if post dump of response (i.e. convert response object string) might give insight. presumably server chose not send gzip encoding, given code pretty explicit headers, i'm not sure different.
it's worth mentioning headers explicitly specified gzip , deflate allowed (see accept_encoding
). if remove header shouldn't have worry decompressing response in case.
Comments
Post a Comment