PDF Data and Table Scraping to Excel -


i'm trying figure out way increase productivity of data entry job.

what looking come way scrape data pdf , input excel.

more data working grocery store flyers. stands have manually enter every deal in flyer database. sample of flyer http://weeklyspecials.safeway.com/customer_frame.jsp?drpstoreid=1551

what hoping have columns products, price, , predefined options (loyalty cards, coupons, select variety... sort of thing).

any appreciated, , if need more specific let me know.

after looking @ specific pdf linked op, have not quite displaying typical table format.

it contains many images inside "cells", cells not strictly vertically or horizontally aligned:

page 6 pdf linked in op

so isn't 'nice' table, extremely ugly , awkward 1 work with...


having said that, i'll have add:

extracting 'nice' tables pdfs in general extremely difficult...

standard pdfs not provide hints semantics of draw on page: distinction syntax provides distinctions between vector elements (lines, fills,...), images , text.

whether character part of table or part of line or lonely, single character within otherwise empty area not easy recognize programmatically parsing pdf source code.

for background why pdf file format should never, ever thought of suitable hosting extractable, structured data, see article:

why updating dollars docs difficult (propublica-website)

...but doing tabulapdf works well!

having said above let me add this:

tabula-extractor written in ruby. in background makes use of pdfbox (which written in java) , few other third-party libs. run, tabula-extractor requires jruby-1.7 installed.


installing tabula-extractor

i'm using 'bleeding-edge' version of tabula-extractor directly github source code repository. getting work extremely easy, since on system jruby-1.7.4_0 present:

mkdir ~/svn-stuff cd ~/svn-stuff git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor 

included in git clone required libraries, no need install pdfbox. command line tool in /bin/ subdirectory.

exploring command line options:

~/svn-stuff/git.tabula-extractor/bin/tabula -h  tabula helps extract tables pdfs  usage:        tabula [options] <pdf_file> [options] are:          --pages, -p <s>:   comma separated list of ranges, or all. examples:                             --pages 1-3,5-7, --pages 3 or --pages all. default                             --pages 1 (default: 1)           --area, -a <s>:   portion of page analyze                             (top,left,bottom,right). example: --area                             269.875,12.75,790.5,561. default entire page        --columns, -c <s>:   x coordinates of column boundaries. example                             --columns 10.1,20.2,30.3       --password, -s <s>:   password decrypt document. default empty                             (default: )              --guess, -g:   guess portion of page analyze per page.              --debug, -d:   print detected table areas instead of processing.         --format, -f <s>:   output format (csv,tsv,html,json) (default: csv)        --outfile, -o <s>:   write output <file> instead of stdout (default:                             -)        --spreadsheet, -r:   force pdf extracted using spreadsheet-style                             extraction (if there ruling lines separating                             each cell, in pdf of excel spreadsheet)     --no-spreadsheet, -n:   force pdf not extracted using                             spreadsheet-style extraction (if there ruling                             lines separating each cell, in pdf of excel                             spreadsheet)             --silent, -i:   suppress stderr output.   --use-line-returns, -u:   use embedded line returns in cells. (only in                             spreadsheet mode.)            --version, -v:   print version , exit               --help, -h:   show message 

extracting table op wants

i'm not trying extract ugly table op's monster pdf. i'll leave excercise these readers feeling adventurous enough...

instead, i'll demo how extract 'nice' table. i'll take pages 651-653 official pdf-1.7 specification, here represented screenshots:

pages 651-653 of official pdf-1.7 specification

i used command:

 ~/svn-stuff/git.tabula-extractor/bin/tabula \    -p 651,652,653 -g -n -u -f csv            \     ~/downloads/pdfs/pdf32000_2008.pdf 

after importing generated csv libreoffice calc, spreadsheet looks this:

screenshot libreoffice after importing csv

to me looks perfect extraction of table did spread on 3 different pdf pages. (even newlines used within table cells made spreadsheet.)


update

here asciinema screencast (which can download , re-play locally in linux/macosx/unix terminal of asciinema command line tool), starring tabula-extractor:

asciicast


Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -