html - python RE findall() return value is an entire string -


i writing crawler parts of html file. cannot figure out how use re.findall().

here example, when want find ... part in file, may write this:

re.findall("<div>.*\</div>", result_page) 

if result_page string "<div> </div> <div> </div>", result

['<div> </div> <div> </div>'] 

only entire string. not want, expecting 2 divs separately. should do?

quoting the documentation,

the '*', '+', , '?' qualifiers greedy; match text possible. adding '?' after qualifier makes perform match in non-greedy or minimal fashion; few characters possible matched.

just add question mark:

in [6]: re.findall("<div>.*?</div>", result_page) out[6]: ['<div> </div>', '<div> </div>'] 

also, shouldn't use regex parse html, since there're html parsers made that. example using beautifulsoup 4:

in [7]: import bs4  in [8]: [str(tag) tag in bs4.beautifulsoup(result_page)('div')] out[8]: ['<div> </div>', '<div> </div>'] 

Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -