python - Scrapy - Issue with xpath on an xml crawl -
i'm trying make simple spider grab xml , spit out in new format experiment. seems there code contained within xml spat out. format want (no code or value tag) along lines of this: <body>don't forget me weekend!</body>
i think using xpath wrong i'm not sure i'm doing wrong.
spider
from scrapy.contrib.spiders import xmlfeedspider crawler.items import crawleritem class sitespider(xmlfeedspider): name = 'site' allowed_domains = ['www.w3schools.com'] start_urls = ['http://www.w3schools.com/xml/note.xml'] itertag = 'note' def parse_node(self, response): xxs = xmlxpathselector(response) = xxs.select('//to') = xxs.select('//from') heading = xxs.select('//heading') body = xxs.select('//body') return item
input
<note> <to>tove</to> <from>jani</from> <heading>reminder</heading> <body>don't forget me weekend!</body> </note>
curreont (wrong) output
<?xml version="1.0" encoding="utf-8"?> <items> <item> <body> <value><body>don't forget me weekend!</body></value> </body> <to> <value><to>tove</to></value> </to> <who> <value><from>jani</from></value> </who> <heading> <value><heading>reminder</heading></value> </heading> </item> </items>
the signature of parse_node()
incorrect. there should selector
argument given should call xpath()
method on, example:
def parse_node(self, response, selector): = selector.xpath('//to/text()').extract() = selector.xpath('//from/text()').extract() print to,
prints:
[u'tove'] [u'jani']
Comments
Post a Comment