function - Python: Scrapy start_urls list able to handle .format()? -

August 15, 2010

i want parse list of stocks trying format end of start_urls list can add symbol instead of entire url.

spider class start_urls inside stock_list method:

class myspider(basespider):     symbols = ["scmp"]     name =  "dozen"     allowed_domains = ["yahoo.com"]       def stock_list(stock):     start_urls = []     symb in symbols:         start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))     return start_urls  def parse(self, response):     hxs = htmlxpathselector(response)     revenue = hxs.select('//td[@align="right"]')     items = []     rev in revenue:         item = dozenitem()         item["revenue"] = rev.xpath("./strong/text()").extract()         items.append(item)     return items[0:3]

it runs correctly if rid of stock_list , simple start_urls normal, not export more empty file.

also, should possibly try sys.arv setup type stock symbol argument @ command line when run $ scrapy crawl dozen -o items.csv???

typically shell prints out 2015-04-25 14:50:57-0400 [dozen] debug: crawled (200) <get http://finance.yahoo.com/q/is?s=scmp+income+statement&annual> among log/debug printout, not include it, implying isn't correctly formatting start_urls

i use loop, this:

class myspider(basespider):     stock = ["scmp", "appl", "goog"]     name =  "dozen"     allowed_domains = ["yahoo.com"]     def stock_list(stock):         start_urls = []         in stock:                         start_urls.append("http://finance.yahoo.com/q/is?s={}".format(i))         return start_urls     start_urls = stock_list(stock)

then assign function call have @ bottom.

update

using scrapy 0.24

# -*- coding: utf-8 -*- import scrapy scrapy.selector import selector  class myspider(scrapy.spider):      symbols = ["scmp"]     name =  "yahoo"     allowed_domains = ["yahoo.com"]      def stock_list(symbols):         start_urls = []         symb in symbols:             start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))         return start_urls     start_urls = stock_list(symbols)      def parse(self, response):         revenue = selector(response=response).xpath('//td[@align="right"]').extract()         print(revenue)

you may want tweak xpath want; seems pulling fair amount of stuff. i've tested , scraping working expected.

Search This Blog

UV code

function - Python: Scrapy start_urls list able to handle .format()? -

update

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -