html - python RE findall() return value is an entire string -

March 15, 2010

i writing crawler parts of html file. cannot figure out how use re.findall().

here example, when want find ... part in file, may write this:

re.findall("<div>.*\</div>", result_page)

if result_page string "<div> </div> <div> </div>", result

['<div> </div> <div> </div>']

only entire string. not want, expecting 2 divs separately. should do?

quoting the documentation,

the '*', '+', , '?' qualifiers greedy; match text possible. adding '?' after qualifier makes perform match in non-greedy or minimal fashion; few characters possible matched.

just add question mark:

in [6]: re.findall("<div>.*?</div>", result_page) out[6]: ['<div> </div>', '<div> </div>']

also, shouldn't use regex parse html, since there're html parsers made that. example using beautifulsoup 4:

in [7]: import bs4  in [8]: [str(tag) tag in bs4.beautifulsoup(result_page)('div')] out[8]: ['<div> </div>', '<div> </div>']

Search This Blog

UV code

html - python RE findall() return value is an entire string -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -