PANDAS and PDF Tables

PANDAS and PDF Tables

I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.

import tabula
import pandas as pd

pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))

df.query("STATUS=='Active'")
COUNTYELECTION DISTSTATUSDEMREPCONGREWORINDWEPREFOTHBLANKTOTALCOUNTY ELECTION DISTUnnamed: 0
0AlbanyALBANY 001001Active6719100100013101NaNNaN
3AlbanyALBANY 001002Active286203011900045374NaNNaN
6AlbanyALBANY 001003Active468329311900081613NaNNaN
9AlbanyALBANY 001004Active449256141510174576NaNNaN
12AlbanyALBANY 001005Active50000100006NaNNaN
7NaNNaNActive291594003101189476Albany WATERVLIET 004004NaN
10NaNNaNActive339213294357000202847Albany WESTERLO 000001NaN
13NaNNaNActive348119171543000160693Albany WESTERLO 000002NaN
16NaNNaNActive314137262235000196712Albany WESTERLO 000003NaN
19NaNNaNActive91,65835,8212,9915256089,612491419841,926183,402Albany County TotalNaN

There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.

from poppler import load_from_file, PageRenderer

pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(page_1_text),
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)

df
COUNTY ELECTION DISTSTATUSDEMREPCONGREWORINDWEP REF OTH BLANK TOTAL
Albany ALBANY 001001Active67.019.01.00.00.01.00.00.00.013.0101.0
Inactive5.02.00.00.00.00.00.00.00.02.09.0
Total72.021.01.00.00.01.00.00.00.015.0110.0
Albany ALBANY 001002Active286.020.03.00.01.019.00.00.00.045.0374.0
Inactive27.08.00.00.00.03.00.00.00.012.050.0
Total313.028.03.00.01.022.00.00.00.057.0424.0
Albany ALBANY 001003Active468.032.09.03.01.019.00.00.00.081.0613.0
Inactive56.05.00.00.00.02.00.00.00.019.082.0
Total524.037.09.03.01.021.00.00.00.0100.0695.0
Albany ALBANY 001004Active449.025.06.01.04.015.01.00.01.074.0576.0
Inactive64.00.01.00.00.04.00.00.00.013.082.0
Total513.025.07.01.04.019.01.00.01.087.0658.0
Albany ALBANY 001005Active5.00.00.00.00.01.00.00.00.00.06.0
Total5.00.00.00.00.01.00.00.00.00.06.0
Albany ALBANY 001006Active462.011.05.02.04.014.00.00.01.057.0556.0
Inactive63.03.00.00.00.05.00.00.01.012.084.0
Total525.014.05.02.04.019.00.00.02.069.0640.0
Albany ALBANY 001007Active435.019.01.00.02.019.00.00.00.062.0538.0
Inactive60.03.00.00.02.03.00.00.00.07.075.0
Total495.022.01.00.04.022.00.00.00.069.0613.0
Albany ALBANY 001008Active160.07.01.00.02.02.00.00.01.037.0210.0
Page 1 of 46NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Leave a Reply

Your email address will not be published. Required fields are marked *