function - Python: Scrapy start_urls list able to handle .format()? -
i want parse list of stocks trying format end of start_urls
list can add symbol instead of entire url.
spider class start_urls
inside stock_list
method:
class myspider(basespider): symbols = ["scmp"] name = "dozen" allowed_domains = ["yahoo.com"] def stock_list(stock): start_urls = [] symb in symbols: start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb)) return start_urls def parse(self, response): hxs = htmlxpathselector(response) revenue = hxs.select('//td[@align="right"]') items = [] rev in revenue: item = dozenitem() item["revenue"] = rev.xpath("./strong/text()").extract() items.append(item) return items[0:3]
it runs correctly if rid of stock_list
, simple start_urls
normal, not export more empty file.
also, should possibly try sys.arv
setup type stock symbol argument @ command line when run $ scrapy crawl dozen -o items.csv
???
typically shell prints out 2015-04-25 14:50:57-0400 [dozen] debug: crawled (200) <get http://finance.yahoo.com/q/is?s=scmp+income+statement&annual>
among log/debug printout, not include it, implying isn't correctly formatting start_urls
i use loop, this:
class myspider(basespider): stock = ["scmp", "appl", "goog"] name = "dozen" allowed_domains = ["yahoo.com"] def stock_list(stock): start_urls = [] in stock: start_urls.append("http://finance.yahoo.com/q/is?s={}".format(i)) return start_urls start_urls = stock_list(stock)
then assign function call have @ bottom.
update
using scrapy 0.24
# -*- coding: utf-8 -*- import scrapy scrapy.selector import selector class myspider(scrapy.spider): symbols = ["scmp"] name = "yahoo" allowed_domains = ["yahoo.com"] def stock_list(symbols): start_urls = [] symb in symbols: start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb)) return start_urls start_urls = stock_list(symbols) def parse(self, response): revenue = selector(response=response).xpath('//td[@align="right"]').extract() print(revenue)
you may want tweak xpath want; seems pulling fair amount of stuff. i've tested , scraping working expected.
Comments
Post a Comment