PANDAS and PDF Tables
PANDAS and PDF Tables
I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.
import tabula
import pandas as pd
pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))
df.query("STATUS=='Active'")
COUNTY | ELECTION DIST | STATUS | DEM | REP | CON | GRE | WOR | IND | WEP | REF | OTH | BLANK | TOTAL | COUNTY ELECTION DIST | Unnamed: 0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albany | ALBANY 001001 | Active | 67 | 19 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 13 | 101 | NaN | NaN |
3 | Albany | ALBANY 001002 | Active | 286 | 20 | 3 | 0 | 1 | 19 | 0 | 0 | 0 | 45 | 374 | NaN | NaN |
6 | Albany | ALBANY 001003 | Active | 468 | 32 | 9 | 3 | 1 | 19 | 0 | 0 | 0 | 81 | 613 | NaN | NaN |
9 | Albany | ALBANY 001004 | Active | 449 | 25 | 6 | 1 | 4 | 15 | 1 | 0 | 1 | 74 | 576 | NaN | NaN |
12 | Albany | ALBANY 001005 | Active | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | NaN | NaN |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
7 | NaN | NaN | Active | 291 | 59 | 4 | 0 | 0 | 31 | 0 | 1 | 1 | 89 | 476 | Albany WATERVLIET 004004 | NaN |
10 | NaN | NaN | Active | 339 | 213 | 29 | 4 | 3 | 57 | 0 | 0 | 0 | 202 | 847 | Albany WESTERLO 000001 | NaN |
13 | NaN | NaN | Active | 348 | 119 | 17 | 1 | 5 | 43 | 0 | 0 | 0 | 160 | 693 | Albany WESTERLO 000002 | NaN |
16 | NaN | NaN | Active | 314 | 137 | 26 | 2 | 2 | 35 | 0 | 0 | 0 | 196 | 712 | Albany WESTERLO 000003 | NaN |
19 | NaN | NaN | Active | 91,658 | 35,821 | 2,991 | 525 | 608 | 9,612 | 49 | 14 | 198 | 41,926 | 183,402 | Albany County Total | NaN |
There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.
from poppler import load_from_file, PageRenderer
pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(page_1_text),
skiprows=3,
sep='\s{2,}',
engine='python',
error_bad_lines=False)
df
COUNTY ELECTION DIST | STATUS | DEM | REP | CON | GRE | WOR | IND | WEP REF OTH BLANK TOTAL | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Albany ALBANY 001001 | Active | 67.0 | 19.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 13.0 | 101.0 |
Inactive | 5.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | |
Total | 72.0 | 21.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 15.0 | 110.0 | |
Albany ALBANY 001002 | Active | 286.0 | 20.0 | 3.0 | 0.0 | 1.0 | 19.0 | 0.0 | 0.0 | 0.0 | 45.0 | 374.0 |
Inactive | 27.0 | 8.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 12.0 | 50.0 | |
Total | 313.0 | 28.0 | 3.0 | 0.0 | 1.0 | 22.0 | 0.0 | 0.0 | 0.0 | 57.0 | 424.0 | |
Albany ALBANY 001003 | Active | 468.0 | 32.0 | 9.0 | 3.0 | 1.0 | 19.0 | 0.0 | 0.0 | 0.0 | 81.0 | 613.0 |
Inactive | 56.0 | 5.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 19.0 | 82.0 | |
Total | 524.0 | 37.0 | 9.0 | 3.0 | 1.0 | 21.0 | 0.0 | 0.0 | 0.0 | 100.0 | 695.0 | |
Albany ALBANY 001004 | Active | 449.0 | 25.0 | 6.0 | 1.0 | 4.0 | 15.0 | 1.0 | 0.0 | 1.0 | 74.0 | 576.0 |
Inactive | 64.0 | 0.0 | 1.0 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 13.0 | 82.0 | |
Total | 513.0 | 25.0 | 7.0 | 1.0 | 4.0 | 19.0 | 1.0 | 0.0 | 1.0 | 87.0 | 658.0 | |
Albany ALBANY 001005 | Active | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
Total | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | |
Albany ALBANY 001006 | Active | 462.0 | 11.0 | 5.0 | 2.0 | 4.0 | 14.0 | 0.0 | 0.0 | 1.0 | 57.0 | 556.0 |
Inactive | 63.0 | 3.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 1.0 | 12.0 | 84.0 | |
Total | 525.0 | 14.0 | 5.0 | 2.0 | 4.0 | 19.0 | 0.0 | 0.0 | 2.0 | 69.0 | 640.0 | |
Albany ALBANY 001007 | Active | 435.0 | 19.0 | 1.0 | 0.0 | 2.0 | 19.0 | 0.0 | 0.0 | 0.0 | 62.0 | 538.0 |
Inactive | 60.0 | 3.0 | 0.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 7.0 | 75.0 | |
Total | 495.0 | 22.0 | 1.0 | 0.0 | 4.0 | 22.0 | 0.0 | 0.0 | 0.0 | 69.0 | 613.0 | |
Albany ALBANY 001008 | Active | 160.0 | 7.0 | 1.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | 37.0 | 210.0 |
Page 1 of 46 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |