python - Scrapy - Issue with xpath on an xml crawl -

August 15, 2014

i'm trying make simple spider grab xml , spit out in new format experiment. seems there code contained within xml spat out. format want (no code or value tag) along lines of this: <body>don't forget me weekend!</body>

i think using xpath wrong i'm not sure i'm doing wrong.

spider

from scrapy.contrib.spiders import xmlfeedspider crawler.items import crawleritem  class sitespider(xmlfeedspider):     name = 'site'     allowed_domains = ['www.w3schools.com']     start_urls = ['http://www.w3schools.com/xml/note.xml']     itertag = 'note'      def parse_node(self, response):         xxs = xmlxpathselector(response)         = xxs.select('//to')         = xxs.select('//from')         heading = xxs.select('//heading')            body = xxs.select('//body')                    return item

input

<note> <to>tove</to> <from>jani</from> <heading>reminder</heading> <body>don't forget me weekend!</body> </note>

curreont (wrong) output

<?xml version="1.0" encoding="utf-8"?> <items>    <item>       <body>          <value>&lt;body&gt;don't forget me weekend!&lt;/body&gt;</value>       </body>       <to>          <value>&lt;to&gt;tove&lt;/to&gt;</value>       </to>       <who>          <value>&lt;from&gt;jani&lt;/from&gt;</value>       </who>       <heading>          <value>&lt;heading&gt;reminder&lt;/heading&gt;</value>       </heading>    </item> </items>

the signature of parse_node() incorrect. there should selector argument given should call xpath() method on, example:

def parse_node(self, response, selector):     = selector.xpath('//to/text()').extract()     = selector.xpath('//from/text()').extract()     print to,

prints:

[u'tove'] [u'jani']

Search This Blog

UV code

python - Scrapy - Issue with xpath on an xml crawl -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -