html - Parsing through python using beautiful soup -


i'm trying parse through poorly structured website of restaurant , print out menu headers like:

"bento box", "bara chirashi set", etc

i'm using python library beautiful soup, i'm having trouble getting proper output:

import requests bs4 import beautifulsoup  url = ('http://www.sushitaro.com/menu-lunch.html') r = requests.get(url, auth=('user', 'pass'))  data = r.text  soup = beautifulsoup(data) datalist = list()  string in soup.findall('b'):     datalist.append(string)  print(datalist) 

this returns many elements, returned html not text, , textual contents messy unicode characters , excess whitespace.

i'm having trouble this, appreciated.

it sounds want names of menu items website in question. page scraping can tricky and, more learning library, have @ structure of page. here, example, prices bold if want names of menu items have find different distinguishing feature. in case, site designer has incremented font size 1 each menu title so, following code through definition of "soup", can grab , menu titles with:

import requests  bs4 import beautifulsoup url = ('http://www.sushitaro.com/menu-lunch.html') r = requests.get(url, auth=('user', 'pass')) data = r.text soup = beautifulsoup(data)  menutitleshtml = soup.findall('font', {"size": "+1"}) 

now, return lot of html , not text. assume familiar python list comprehensions handy here. if want text can try instead:

menutitlesdirty = [titlehtml.text titlehtml in menutitleshtml] 

but you'll notice titles have lot of excess whitespace, including in unicode, , characters '@s.' since seem want ascii menu titles can convert ascii, ignoring errors, clean unicode. can substitute single space matches on regular expression captures undesired characters: newline characters, spaces , @s. can apply ".strip()", removing spaces @ end of our strings. in sum is:

import re badchars = re.compile('[\s@]+') menutitles = [badchars.sub(" ", dirtytitle.encode('ascii', 'ignore')).strip() dirtytitle in menutitlesdirty] 

this returns suggested wanted:

['lunch bento box',  'bara chirashi set',  'tekka chirashi set',  'sushi mori set',  'sushi jo set',  'sushi tokujo set',  'sashimi & tempura teishoku',  'tokujo sashimi',  "today's lunch special",  'saba shioyaki teishoku',  'katsu don set',  'tem don set',  'cold soba or udon w/one topping',  'hot soup udon or soba w/one topping'] 

to sum up: page scraping messy , iterative process want use discrepancy on page advantage. python repl friend, here. gives , others idea of many tools, in python more , beautiful soup in particular, can process.


Comments

Popular posts from this blog

shopping cart - Page redirect not working PHP -

php - How to modify a menu to show sub-menus -

python - Installing PyDev in eclipse is failed -