html - python RE findall() return value is an entire string -
i writing crawler parts of html file. cannot figure out how use re.findall().
here example, when want find ... part in file, may write this:
re.findall("<div>.*\</div>", result_page)
if result_page string "<div> </div> <div> </div>"
, result
['<div> </div> <div> </div>']
only entire string. not want, expecting 2 divs separately. should do?
quoting the documentation,
the
'*'
,'+'
, ,'?'
qualifiers greedy; match text possible. adding'?'
after qualifier makes perform match in non-greedy or minimal fashion; few characters possible matched.
just add question mark:
in [6]: re.findall("<div>.*?</div>", result_page) out[6]: ['<div> </div>', '<div> </div>']
also, shouldn't use regex parse html, since there're html parsers made that. example using beautifulsoup 4:
in [7]: import bs4 in [8]: [str(tag) tag in bs4.beautifulsoup(result_page)('div')] out[8]: ['<div> </div>', '<div> </div>']
Comments
Post a Comment