geocompx – Geographic data analysis in R and Python: comparing code and outputs for vector data
Python 📍
📽️ Videos
PANDAS and PDF Tables
PANDAS and PDF Tables
I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.
import tabula
import pandas as pd
pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))
df.query("STATUS=='Active'")
| COUNTY | ELECTION DIST | STATUS | DEM | REP | CON | GRE | WOR | IND | WEP | REF | OTH | BLANK | TOTAL | COUNTY ELECTION DIST | Unnamed: 0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albany | ALBANY 001001 | Active | 67 | 19 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 13 | 101 | NaN | NaN |
| 3 | Albany | ALBANY 001002 | Active | 286 | 20 | 3 | 0 | 1 | 19 | 0 | 0 | 0 | 45 | 374 | NaN | NaN |
| 6 | Albany | ALBANY 001003 | Active | 468 | 32 | 9 | 3 | 1 | 19 | 0 | 0 | 0 | 81 | 613 | NaN | NaN |
| 9 | Albany | ALBANY 001004 | Active | 449 | 25 | 6 | 1 | 4 | 15 | 1 | 0 | 1 | 74 | 576 | NaN | NaN |
| 12 | Albany | ALBANY 001005 | Active | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | NaN | NaN |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 7 | NaN | NaN | Active | 291 | 59 | 4 | 0 | 0 | 31 | 0 | 1 | 1 | 89 | 476 | Albany WATERVLIET 004004 | NaN |
| 10 | NaN | NaN | Active | 339 | 213 | 29 | 4 | 3 | 57 | 0 | 0 | 0 | 202 | 847 | Albany WESTERLO 000001 | NaN |
| 13 | NaN | NaN | Active | 348 | 119 | 17 | 1 | 5 | 43 | 0 | 0 | 0 | 160 | 693 | Albany WESTERLO 000002 | NaN |
| 16 | NaN | NaN | Active | 314 | 137 | 26 | 2 | 2 | 35 | 0 | 0 | 0 | 196 | 712 | Albany WESTERLO 000003 | NaN |
| 19 | NaN | NaN | Active | 91,658 | 35,821 | 2,991 | 525 | 608 | 9,612 | 49 | 14 | 198 | 41,926 | 183,402 | Albany County Total | NaN |
There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.
from poppler import load_from_file, PageRenderer
pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(page_1_text),
skiprows=3,
sep='\s{2,}',
engine='python',
error_bad_lines=False)
df
| COUNTY ELECTION DIST | STATUS | DEM | REP | CON | GRE | WOR | IND | WEP REF OTH BLANK TOTAL | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Albany ALBANY 001001 | Active | 67.0 | 19.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 13.0 | 101.0 |
| Inactive | 5.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | |
| Total | 72.0 | 21.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 15.0 | 110.0 | |
| Albany ALBANY 001002 | Active | 286.0 | 20.0 | 3.0 | 0.0 | 1.0 | 19.0 | 0.0 | 0.0 | 0.0 | 45.0 | 374.0 |
| Inactive | 27.0 | 8.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 12.0 | 50.0 | |
| Total | 313.0 | 28.0 | 3.0 | 0.0 | 1.0 | 22.0 | 0.0 | 0.0 | 0.0 | 57.0 | 424.0 | |
| Albany ALBANY 001003 | Active | 468.0 | 32.0 | 9.0 | 3.0 | 1.0 | 19.0 | 0.0 | 0.0 | 0.0 | 81.0 | 613.0 |
| Inactive | 56.0 | 5.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 19.0 | 82.0 | |
| Total | 524.0 | 37.0 | 9.0 | 3.0 | 1.0 | 21.0 | 0.0 | 0.0 | 0.0 | 100.0 | 695.0 | |
| Albany ALBANY 001004 | Active | 449.0 | 25.0 | 6.0 | 1.0 | 4.0 | 15.0 | 1.0 | 0.0 | 1.0 | 74.0 | 576.0 |
| Inactive | 64.0 | 0.0 | 1.0 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 13.0 | 82.0 | |
| Total | 513.0 | 25.0 | 7.0 | 1.0 | 4.0 | 19.0 | 1.0 | 0.0 | 1.0 | 87.0 | 658.0 | |
| Albany ALBANY 001005 | Active | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
| Total | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | |
| Albany ALBANY 001006 | Active | 462.0 | 11.0 | 5.0 | 2.0 | 4.0 | 14.0 | 0.0 | 0.0 | 1.0 | 57.0 | 556.0 |
| Inactive | 63.0 | 3.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 1.0 | 12.0 | 84.0 | |
| Total | 525.0 | 14.0 | 5.0 | 2.0 | 4.0 | 19.0 | 0.0 | 0.0 | 2.0 | 69.0 | 640.0 | |
| Albany ALBANY 001007 | Active | 435.0 | 19.0 | 1.0 | 0.0 | 2.0 | 19.0 | 0.0 | 0.0 | 0.0 | 62.0 | 538.0 |
| Inactive | 60.0 | 3.0 | 0.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 7.0 | 75.0 | |
| Total | 495.0 | 22.0 | 1.0 | 0.0 | 4.0 | 22.0 | 0.0 | 0.0 | 0.0 | 69.0 | 613.0 | |
| Albany ALBANY 001008 | Active | 160.0 | 7.0 | 1.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | 37.0 | 210.0 |
| Page 1 of 46 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data
Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data … 📜
For example, to find the most valuable parcels in Albany County.
from arcgis.features import FeatureLayer
lyr_url = 'https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Tax_Parcel_Centroid_Points/MapServer/0'
layer = FeatureLayer(lyr_url)
query_result1 = layer.query(where="COUNTY_NAME='Albany' AND FULL_MARKET_VAL > 100000000",
out_fields='PARCEL_ADDR,CITYTOWN_NAME,FULL_MARKET_VAL,OWNER_TYPE', out_sr='4326')
df=query_result1.sdf.sort_values(by='FULL_MARKET_VAL', ascending=False)
df['Full Market Value'] = df['FULL_MARKET_VAL'].map('${:,.0f}'.format)
df
| OBJECTID | PARCEL_ADDR | CITYTOWN_NAME | FULL_MARKET_VAL | OWNER_TYPE | SHAPE | Full Market Value | |
|---|---|---|---|---|---|---|---|
| 11 | 26652 | 64 Eagle St | Albany | 1204254925 | 2 | {“x”: -73.75980312511581, “y”: 42.650469918250… | $1,204,254,925 |
| 3 | 9150 | 1200 Washington Ave | Albany | 886298715 | 2 | {“x”: -73.81092293494828, “y”: 42.679257168282… | $886,298,715 |
| 4 | 10208 | 1400 Washington Ave | Albany | 642398287 | 2 | {“x”: -73.82369286130952, “y”: 42.685845700657… | $642,398,287 |
| 0 | 885 | 251 Fuller Rd | Albany | 440042827 | 2 | {“x”: -73.83559002316825, “y”: 42.690208093507… | $440,042,827 |
| 5 | 18164 | 632 New Scotland Ave | Albany | 377568201 | 8 | {“x”: -73.80381341626146, “y”: 42.655758957669… | $377,568,201 |
| 1 | 906 | 141 Fuller Rd | Albany | 321199143 | 2 | {“x”: -73.83323986150171, “y”: 42.693189748928… | $321,199,143 |
| 19 | 108087 | See Card 1067 | Watervliet | 280898876 | 1 | {“x”: -73.70670724174552, “y”: 42.719628647232… | $280,898,876 |
| 15 | 65380 | 737 Alb Shaker Rd | Colonie | 263916100 | 3 | {“x”: -73.80365248218001, “y”: 42.747956678125… | $263,916,100 |
| 9 | 21923 | 304 Madison Ave | Albany | 234265418 | 2 | {“x”: -73.76227373289564, “y”: 42.648000674457… | $234,265,418 |
| 2 | 907 | 201 Fuller Rd | Albany | 203426124 | 2 | {“x”: -73.83362605353057, “y”: 42.692609131686… | $203,426,124 |
| 16 | 69999 | 515 Loudon Rd | Colonie | 166065600 | 8 | {“x”: -73.74958475282632, “y”: 42.719321807666… | $166,065,600 |
| 7 | 20592 | 47 New Scotland Ave | Albany | 162276338 | 8 | {“x”: -73.77597163421673, “y”: 42.653565689693… | $162,276,338 |
| 6 | 20574 | 132 S Lake Ave | Albany | 146296360 | 2 | {“x”: -73.77970918544908, “y”: 42.654390366929… | $146,296,360 |
| 8 | 20597 | 113 Holland Ave | Albany | 143498501 | 2 | {“x”: -73.77306688593143, “y”: 42.650762742870… | $143,498,501 |
| 17 | 78203 | Mannsville | Colonie | 142570400 | 1 | {“x”: -73.71245452369443, “y”: 42.718124477080… | $142,570,400 |
| 18 | 95509 | 1 Crossgates Mall Rd | Guilderland | 130554700 | 8 | {“x”: -73.84702700595471, “y”: 42.687699053797… | $130,554,700 |
| 10 | 24521 | 86 S Swan St | Albany | 128436403 | 2 | {“x”: -73.75980563770365, “y”: 42.653931892804… | $128,436,403 |
| 13 | 46883 | 1916 US 9W | Coeymans | 110000000 | 8 | {“x”: -73.83388475575597, “y”: 42.488730743021… | $110,000,000 |
| 12 | 35152 | 380 River Rd | Bethlehem | 105263158 | 8 | {“x”: -73.76445503554325, “y”: 42.595925419330… | $105,263,158 |
| 14 | 65097 | 15 Wolf Rd | Colonie | 101967213 | 8 | {“x”: -73.81423716588279, “y”: 42.709939498581… | $101,967,213 |



