PDF Data and Table Scraping to Excel -
i'm trying figure out way increase productivity of data entry job.
what looking come way scrape data pdf , input excel.
more data working grocery store flyers. stands have manually enter every deal in flyer database. sample of flyer http://weeklyspecials.safeway.com/customer_frame.jsp?drpstoreid=1551
what hoping have columns products, price, , predefined options (loyalty cards, coupons, select variety... sort of thing).
any appreciated, , if need more specific let me know.
after looking @ specific pdf linked op, have not quite displaying typical table format.
it contains many images inside "cells", cells not strictly vertically or horizontally aligned:
so isn't 'nice' table, extremely ugly , awkward 1 work with...
having said that, i'll have add:
extracting 'nice' tables pdfs in general extremely difficult...
standard pdfs not provide hints semantics of draw on page: distinction syntax provides distinctions between vector elements (lines, fills,...), images , text.
whether character part of table or part of line or lonely, single character within otherwise empty area not easy recognize programmatically parsing pdf source code.
for background why pdf file format should never, ever thought of suitable hosting extractable, structured data, see article:
why updating dollars docs difficult (propublica-website)
...but doing tabulapdf works well!
having said above let me add this:
for amazing open source family of tools gets better , better week week extracting tabular data pdfs (unless scanned pages) -- contradicting said in introductionary paragraphs! -- check out tabulapdf. see these links:
tabula-extractor written in ruby. in background makes use of pdfbox (which written in java) , few other third-party libs. run, tabula-extractor requires jruby-1.7 installed.
installing tabula-extractor
i'm using 'bleeding-edge' version of tabula-extractor directly github source code repository. getting work extremely easy, since on system jruby-1.7.4_0 present:
mkdir ~/svn-stuff cd ~/svn-stuff git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
included in git clone required libraries, no need install pdfbox. command line tool in /bin/
subdirectory.
exploring command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h tabula helps extract tables pdfs usage: tabula [options] <pdf_file> [options] are: --pages, -p <s>: comma separated list of ranges, or all. examples: --pages 1-3,5-7, --pages 3 or --pages all. default --pages 1 (default: 1) --area, -a <s>: portion of page analyze (top,left,bottom,right). example: --area 269.875,12.75,790.5,561. default entire page --columns, -c <s>: x coordinates of column boundaries. example --columns 10.1,20.2,30.3 --password, -s <s>: password decrypt document. default empty (default: ) --guess, -g: guess portion of page analyze per page. --debug, -d: print detected table areas instead of processing. --format, -f <s>: output format (csv,tsv,html,json) (default: csv) --outfile, -o <s>: write output <file> instead of stdout (default: -) --spreadsheet, -r: force pdf extracted using spreadsheet-style extraction (if there ruling lines separating each cell, in pdf of excel spreadsheet) --no-spreadsheet, -n: force pdf not extracted using spreadsheet-style extraction (if there ruling lines separating each cell, in pdf of excel spreadsheet) --silent, -i: suppress stderr output. --use-line-returns, -u: use embedded line returns in cells. (only in spreadsheet mode.) --version, -v: print version , exit --help, -h: show message
extracting table op wants
i'm not trying extract ugly table op's monster pdf. i'll leave excercise these readers feeling adventurous enough...
instead, i'll demo how extract 'nice' table. i'll take pages 651-653 official pdf-1.7 specification, here represented screenshots:
i used command:
~/svn-stuff/git.tabula-extractor/bin/tabula \ -p 651,652,653 -g -n -u -f csv \ ~/downloads/pdfs/pdf32000_2008.pdf
after importing generated csv libreoffice calc, spreadsheet looks this:
to me looks perfect extraction of table did spread on 3 different pdf pages. (even newlines used within table cells made spreadsheet.)
update
here asciinema screencast (which can download , re-play locally in linux/macosx/unix terminal of asciinema
command line tool), starring tabula-extractor
:
Comments
Post a Comment