html - Parsing through python using beautiful soup -
i'm trying parse through poorly structured website of restaurant , print out menu headers like:
"bento box", "bara chirashi set", etc
i'm using python library beautiful soup, i'm having trouble getting proper output:
import requests bs4 import beautifulsoup url = ('http://www.sushitaro.com/menu-lunch.html') r = requests.get(url, auth=('user', 'pass')) data = r.text soup = beautifulsoup(data) datalist = list() string in soup.findall('b'): datalist.append(string) print(datalist) this returns many elements, returned html not text, , textual contents messy unicode characters , excess whitespace.
i'm having trouble this, appreciated.
it sounds want names of menu items website in question. page scraping can tricky and, more learning library, have @ structure of page. here, example, prices bold if want names of menu items have find different distinguishing feature. in case, site designer has incremented font size 1 each menu title so, following code through definition of "soup", can grab , menu titles with:
import requests bs4 import beautifulsoup url = ('http://www.sushitaro.com/menu-lunch.html') r = requests.get(url, auth=('user', 'pass')) data = r.text soup = beautifulsoup(data) menutitleshtml = soup.findall('font', {"size": "+1"}) now, return lot of html , not text. assume familiar python list comprehensions handy here. if want text can try instead:
menutitlesdirty = [titlehtml.text titlehtml in menutitleshtml] but you'll notice titles have lot of excess whitespace, including in unicode, , characters '@s.' since seem want ascii menu titles can convert ascii, ignoring errors, clean unicode. can substitute single space matches on regular expression captures undesired characters: newline characters, spaces , @s. can apply ".strip()", removing spaces @ end of our strings. in sum is:
import re badchars = re.compile('[\s@]+') menutitles = [badchars.sub(" ", dirtytitle.encode('ascii', 'ignore')).strip() dirtytitle in menutitlesdirty] this returns suggested wanted:
['lunch bento box', 'bara chirashi set', 'tekka chirashi set', 'sushi mori set', 'sushi jo set', 'sushi tokujo set', 'sashimi & tempura teishoku', 'tokujo sashimi', "today's lunch special", 'saba shioyaki teishoku', 'katsu don set', 'tem don set', 'cold soba or udon w/one topping', 'hot soup udon or soba w/one topping'] to sum up: page scraping messy , iterative process want use discrepancy on page advantage. python repl friend, here. gives , others idea of many tools, in python more , beautiful soup in particular, can process.
Comments
Post a Comment