python - Scrapy creating XML feed wraps content in "value" tags -
i've had bit of on here code pretty works. issue in process of generating xml, wraps content in "value" tags when don't want to. according doc's due this:
unless overriden in :meth:
serialize_fieldmethod, multi-valued fields exported serializing each value inside<value>element. convenience, multi-valued fields common.
this output:
<?xml version="1.0" encoding="utf-8"?> <items> <item> <body> <value>don't forget me weekend!</value> </body> <to> <value>tove</value> </to> <who> <value>jani</value> </who> <heading> <value>reminder</value> </heading> </item> </items> what send xml exporter seems this, don't know why think's it's multivalue?
{'body': [u"don't forget me weekend!"], 'heading': [u'reminder'], 'to': [u'tove'], 'who': [u'jani']} pipeline.py
from scrapy import signals scrapy.contrib.exporter import xmlitemexporter class xmlexportpipeline(object): def __init__(self): self.files = {} @classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) return pipeline def spider_opened(self, spider): file = open('%s_products.xml' % spider.name, 'w+b') self.files[spider] = file self.exporter = xmlitemexporter(file) self.exporter.start_exporting() def spider_closed(self, spider): self.exporter.finish_exporting() file = self.files.pop(spider) file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item spider.py
from scrapy.contrib.spiders import xmlfeedspider crawler.items import crawleritem class sitespider(xmlfeedspider): name = 'site' allowed_domains = ['www.w3schools.com'] start_urls = ['http://www.w3schools.com/xml/note.xml'] itertag = 'note' def parse_node(self, response, selector): item = crawleritem() item['to'] = selector.xpath('//to/text()').extract() item['who'] = selector.xpath('//from/text()').extract() item['heading'] = selector.xpath('//heading/text()').extract() item['body'] = selector.xpath('//body/text()').extract() return item any appreciated. want same output without redundant tags.
the extract() method return list of values, if there single value result, example: [4], [3,4,5] or none. avoid this, if know there 1 value, can select like:
item['to'] = selector.xpath('//to/text()').extract()[0] note: aware can result in exception thrown in case extract() returns none , trying index that. in such uncertain cases, trick use:
item['to'] = (selector.xpath('...').extract() or [''])[0] or write custom function first element:
def extract_first(selector, default=none): val = selector.extract() return val[0] if val else default this way can have default value in case desired value not found:
item['to'] = extract_first(selector.xpath(...)) # first or none item['to'] = extract_first(selector.xpath(...), 'not-found') # first of 'not-found'
Comments
Post a Comment