Skip to content

Build sheet decoder for CCC vehicle build sheets in pdf format.

Notifications You must be signed in to change notification settings

j-phi/build_sheet_decoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Build Sheet Decoder

Build sheet scraper for CCC vehicle build sheets in pdf format.

Utilizes pdf2htmlEX to convert .pdf to .html, then parses html using beautifulsoup. See BuildSheetDecoder.py.

To-do:

  • Remove first and second column output of OEM data
  • Fix mfgInstalled() skipping first item in section
  • Fix double escapes for quotation marks in mfgInstalled()
  • Concat list items [2:] in sub-lists of OEMInstalledList()
  • Build additional error checking
  • Use regex to verify divs based on contents in addition to class designations assigned by pdf2htmlEX
  • Add output functions to create .csv or add directly to database
  • For standard equipment (final) section, determine if it's critical that same-section lines should be grouped together. If so, consider rule to group with previous line if previous line div width >480px & current line begins with lowercase or a parenthesis was opened previously and not yet closed. This will likely not be perfect, but would work for our two sample documents. Unfortunately, there is no css style that dictates the color of the row background; it's a full-page image file. Alternatively, use scipy.misc.imread to build list of cartesian points where colors change in a vertical column on the image in the background of the page and compare to stylesheet {bottom} position of the divs to determine if they are in the same "table row".

buildStatDict()

alt tag


submissionChecklist() & mfgInstalledList()

alt tag


OEMInstalledList()

alt tag


stdEquip()

alt tag

Test run

alt tag

About

Build sheet decoder for CCC vehicle build sheets in pdf format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages