class - Python: Scrapy exports raw data instead of text() only? -

April 15, 2010

i'm exporting class:

class myspider(basespider):     name =  "dozen"     allowed_domains = ["yahoo.com"]     start_urls = ["http://finance.yahoo.com/q/is?s=scmp+income+statement&annual"]      def parse(self, response):         hxs = htmlxpathselector(response)         revenue = hxs.select('//td[@align="right"]')         items = []         rev in revenue:             item = dozenitem()             item["revenue"] = rev.xpath("./strong/text()")             items.append(item)         return items[:7]

and getting this:

[<htmlxpathselector xpath='./strong/text()' data=u'\n                            115,450\xa0\xa0\n '>]

but want 115,450.

if add .extract() end of item["revenue"] line, exports nothing.

here section of html includes i'm trying grab:

<tr> <td colspan="2"> <strong>total revenue</strong> </td> <td align="right"> <strong>115,450&nbsp;&nbsp;</strong> </td> <td align="right"> <strong>89,594&nbsp;&nbsp;</strong> </td> <td align="right"> <strong>81,487&nbsp;&nbsp;</strong> </td> </tr>

you trying use broad of xpath expression first selection. try this:

def parse(self, response):     revenue = response.xpath('//td[@align="right"]/strong/text()')     items = []     rev in revenue:         item = dozenitem()         item["revenue"] = rev.re('\d*,\d*')         items.append(item)     return items[:3]

Search This Blog

UV code

class - Python: Scrapy exports raw data instead of text() only? -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -