html - python RE findall() return value is an entire string -
i writing crawler parts of html file. cannot figure out how use re.findall().
here example, when want find ... part in file, may write this:
re.findall("<div>.*\</div>", result_page) if result_page string "<div> </div> <div> </div>", result
['<div> </div> <div> </div>'] only entire string. not want, expecting 2 divs separately. should do?
quoting the documentation,
the
'*','+', ,'?'qualifiers greedy; match text possible. adding'?'after qualifier makes perform match in non-greedy or minimal fashion; few characters possible matched.
just add question mark:
in [6]: re.findall("<div>.*?</div>", result_page) out[6]: ['<div> </div>', '<div> </div>'] also, shouldn't use regex parse html, since there're html parsers made that. example using beautifulsoup 4:
in [7]: import bs4 in [8]: [str(tag) tag in bs4.beautifulsoup(result_page)('div')] out[8]: ['<div> </div>', '<div> </div>']
Comments
Post a Comment