Python

Python is an interpreted high-level general-purpose programming language. Python’s design philosophy emphasizes code readability with its notable use of significant indentation.

I don’t love Python for Geospatial Projects 🐍

Last night I went back for a second a look at the world of Geospatial Technology in Python. While Python’s ArcGIS and QGIS bindings are widely touted — and are best way to automate things within ArcGIS or QGIS — you are much better off using R programming language for quick, low-code GIS tasks outside of ArcGIS or QGIS.

Python has a lot of advantages for certain things:

  1. It is a good scripting language that is widely supported in applications.
  2. Python is generally a stronger language for building applications to run on web servers
  3. Both ArcGIS and QGIS have really good Python bindings

But as a stand-alone platform, the Python Geospatial libraries rather suck, and are undeveloped. To be sure you can make maps in Python, you can preform various geospatial operations like transformations, raster math and geometric operations. But it takes a lot of work within Python to get nice looking maps using matplotlib, and but you don’t have access to wealth of Census shapefiles or Census data at your finger tip, and Python’s dot chaining method isn’t necessarily as elegant or readable.

I would argue that the R Programming Language and RStudio are superior in many ways over working directly with Python:

  1. R Programs using tigris library, which gives you instant access to the Census Bureau TIGER/Line with just a single command that can be easily joined again or queried against other data. There is nothing like tigris in Python. If you want to County or County Subdivision lines in Python, you will have to manually download the shapefile and then load it into Geopandas. I’ve looked for things like tigris in Python and it doesn’t exist. The basis of most maps in my experience comes from Census TIGER/Line at least in United States. Cartograpy in Python does have access to Natural Earth Dataset, but that isn’t as good as TIGER/Line in the United States.
  2. There are Census Libraries in Python but they aren’t nearly as up to date, have access to nearly as much Census data or Census TIGER/Line. A lot of maps that you make involve plotting Census data, and that requires both the TIGER/Line and the raw data. tidycensus joins them together as one command, no need to download the TIGER/Line separately like in Python.
  3. While you can chain commands in Python and GeoPandas, the chaining mechanism in R is much stronger and flexible. Often in R you can exchange, transform, load and output a map in a single chain of R commands using the tidyverse and ggplot.
  4. ggplot2 is vastly superior to matplotlib for making maps. ggplot2 has sensible defaults, the output is clean and easily theme-able. ggplot2 main limitation is that is best for simpler, easy to read SVG maps. ggplot2 can be a bit strict in enforcing how Hadley Wickham thinks a map should be presented.Β  matplotlib is more flexible in overlaying and designing maps. Of course, for complicated maps, it’s still best to export the data as a shapefile or geopackage and load it into a full GIS platform like QGIS or ArcGIS.
  5. In general, R has a quirky syntax with cute and weird names, but with more sensible defaults, it often gets geospatial work done with less code and work then the same thing done in Python with Rasterio or GeoPandas. A lot of complicated exchange-load-transform things can be done in one line of code with R. People say that Python is a compact syntax, but it really isn’t compared to R’s geospatial libraries.

I still use Python for some QGIS work but I don’t recommend it for work outside of QGIS or ArcGIS. Python should be seen as a good language to automate processes within QGIS or ArcGIS but the state of geospatial tools in Python is weak when you get away from those automating those graphical GIS applications. If you want your work to end up in a graphical GIS program for additional manual tweaking after automating things, then use Python. But if you are processing GIS data from start to finish, your best bet is R.

Bulk Download NYS Election District Enrollments

For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx

url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"

#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.

import glob as g
import pandas as pd

df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))

df = df[(df['STATUS']=='Active')]

df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])

df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])

df

COUNTYMunicipalityWardEDELECTION DISTSTATUSDEMREPCONWOROTHBLANKTOTAL
1AlbanyAlbany001001ALBANY 001001Active74.09.00.00.00.016.099.0
5AlbanyAlbany001002ALBANY 001002Active311.016.02.01.010.047.0387.0
9AlbanyAlbany001003ALBANY 001003Active472.026.05.01.027.0121.0652.0
13AlbanyAlbany001004ALBANY 001004Active437.030.02.03.012.092.0576.0
17AlbanyAlbany001005ALBANY 001005Active13.00.00.00.00.00.013.0
53YatesMilo000006Milo 000006Active204.0409.013.03.050.0182.0861.0
57YatesPotter000001Potter 000001Active144.0460.025.01.051.0226.0907.0
61YatesStarkey000001Starkey 000001Active187.0370.015.05.057.0209.0843.0
65YatesStarkey000002Starkey 000002Active189.0433.018.04.046.0201.0891.0
69YatesTorrey000001Torrey 000001Active185.0313.09.03.039.0159.0708.0

And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:

tdf=df.groupby(by=['COUNTY']).sum()
pd.cut((tdf.div(tdf.iloc[:,:-1].sum(axis=1),axis=0)*100).sort_values(by='BLANK',ascending=False)['DEM'],3,labels=['Rep','Mix','Dem']).sort_values()
COUNTY
Rensselaer      Rep
Genesee         Rep
Oswego          Rep
Livingston      Rep
Schuyler        Rep
Orleans         Rep
Seneca          Rep
Otsego          Rep
Warren          Rep
St.Lawrence     Rep
Cattaraugus     Rep
Chemung         Rep
Cayuga          Rep
Tioga           Rep
Yates           Rep
Broome          Rep
Franklin        Rep
Oneida          Rep
Fulton          Rep
Allegany        Rep
Essex           Rep
Niagara         Rep
Steuben         Rep
Herkimer        Rep
Lewis           Rep
Delaware        Rep
Wyoming         Rep
Chenango        Rep
Hamilton        Rep
Greene          Rep
Sullivan        Rep
Wayne           Rep
Jefferson       Rep
Ontario         Rep
Chautauqua      Rep
Putnam          Rep
Madison         Rep
Washington      Rep
Schoharie       Rep
Dutchess        Rep
Montgomery      Rep
Clinton         Rep
Orange          Rep
Cortland        Rep
Suffolk         Rep
Onondaga        Rep
Saratoga        Rep
Erie            Mix
Schenectady     Mix
Ulster          Mix
Rockland        Mix
Columbia        Mix
Monroe          Mix
Westchester     Mix
Richmond        Mix
Albany          Mix
Tompkins        Mix
Nassau          Mix
Queens          Dem
Bronx           Dem
New York        Dem
Kings           Dem
Name: DEM, dtype: category
Categories (3, object): ['Rep' < 'Mix' < 'Dem']

If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:

import pandas as pd
df = pd.read_csv('/home/andy/Desktop/AlbanyED_nov16.txt',
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)
df=df[df['STATUS']=='Active']
df=df.dropna()

Hundreds of malicious Python packages found stealing sensitive data

Hundreds of malicious Python packages found stealing sensitive data

A malicious campaign that researchers observed growing more complex over the past half year, has been planting on open-source platforms hundreds of info-stealing packages that counted about 75,000 downloads.

The campaign has been monitored since early April by analysts at Checkmarx's Supply Chain Security team, who discovered?272 packages with code for stealing sensitive data from targeted systems.

PANDAS and PDF Tables

PANDAS and PDF Tables

I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.

import tabula
import pandas as pd

pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))

df.query("STATUS=='Active'")
COUNTYELECTION DISTSTATUSDEMREPCONGREWORINDWEPREFOTHBLANKTOTALCOUNTY ELECTION DISTUnnamed: 0
0AlbanyALBANY 001001Active6719100100013101NaNNaN
3AlbanyALBANY 001002Active286203011900045374NaNNaN
6AlbanyALBANY 001003Active468329311900081613NaNNaN
9AlbanyALBANY 001004Active449256141510174576NaNNaN
12AlbanyALBANY 001005Active50000100006NaNNaN
7NaNNaNActive291594003101189476Albany WATERVLIET 004004NaN
10NaNNaNActive339213294357000202847Albany WESTERLO 000001NaN
13NaNNaNActive348119171543000160693Albany WESTERLO 000002NaN
16NaNNaNActive314137262235000196712Albany WESTERLO 000003NaN
19NaNNaNActive91,65835,8212,9915256089,612491419841,926183,402Albany County TotalNaN

There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.

from poppler import load_from_file, PageRenderer

pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(page_1_text),
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)

df
COUNTY ELECTION DISTSTATUSDEMREPCONGREWORINDWEP REF OTH BLANK TOTAL
Albany ALBANY 001001Active67.019.01.00.00.01.00.00.00.013.0101.0
Inactive5.02.00.00.00.00.00.00.00.02.09.0
Total72.021.01.00.00.01.00.00.00.015.0110.0
Albany ALBANY 001002Active286.020.03.00.01.019.00.00.00.045.0374.0
Inactive27.08.00.00.00.03.00.00.00.012.050.0
Total313.028.03.00.01.022.00.00.00.057.0424.0
Albany ALBANY 001003Active468.032.09.03.01.019.00.00.00.081.0613.0
Inactive56.05.00.00.00.02.00.00.00.019.082.0
Total524.037.09.03.01.021.00.00.00.0100.0695.0
Albany ALBANY 001004Active449.025.06.01.04.015.01.00.01.074.0576.0
Inactive64.00.01.00.00.04.00.00.00.013.082.0
Total513.025.07.01.04.019.01.00.01.087.0658.0
Albany ALBANY 001005Active5.00.00.00.00.01.00.00.00.00.06.0
Total5.00.00.00.00.01.00.00.00.00.06.0
Albany ALBANY 001006Active462.011.05.02.04.014.00.00.01.057.0556.0
Inactive63.03.00.00.00.05.00.00.01.012.084.0
Total525.014.05.02.04.019.00.00.02.069.0640.0
Albany ALBANY 001007Active435.019.01.00.02.019.00.00.00.062.0538.0
Inactive60.03.00.00.02.03.00.00.00.07.075.0
Total495.022.01.00.04.022.00.00.00.069.0613.0
Albany ALBANY 001008Active160.07.01.00.02.02.00.00.01.037.0210.0
Page 1 of 46NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data

Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data … πŸ“œ

For example, to find the most valuable parcels in Albany County.

from arcgis.features import FeatureLayer
lyr_url = 'https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Tax_Parcel_Centroid_Points/MapServer/0'
layer = FeatureLayer(lyr_url)
query_result1 = layer.query(where="COUNTY_NAME='Albany' AND FULL_MARKET_VAL > 100000000",
                                    out_fields='PARCEL_ADDR,CITYTOWN_NAME,FULL_MARKET_VAL,OWNER_TYPE', out_sr='4326')

df=query_result1.sdf.sort_values(by='FULL_MARKET_VAL', ascending=False)
df['Full Market Value'] = df['FULL_MARKET_VAL'].map('${:,.0f}'.format)

df
  OBJECTID PARCEL_ADDR CITYTOWN_NAME FULL_MARKET_VAL OWNER_TYPE SHAPE Full Market Value
11 26652 64 Eagle St Albany 1204254925 2 {“x”: -73.75980312511581, “y”: 42.650469918250… $1,204,254,925
3 9150 1200 Washington Ave Albany 886298715 2 {“x”: -73.81092293494828, “y”: 42.679257168282… $886,298,715
4 10208 1400 Washington Ave Albany 642398287 2 {“x”: -73.82369286130952, “y”: 42.685845700657… $642,398,287
0 885 251 Fuller Rd Albany 440042827 2 {“x”: -73.83559002316825, “y”: 42.690208093507… $440,042,827
5 18164 632 New Scotland Ave Albany 377568201 8 {“x”: -73.80381341626146, “y”: 42.655758957669… $377,568,201
1 906 141 Fuller Rd Albany 321199143 2 {“x”: -73.83323986150171, “y”: 42.693189748928… $321,199,143
19 108087 See Card 1067 Watervliet 280898876 1 {“x”: -73.70670724174552, “y”: 42.719628647232… $280,898,876
15 65380 737 Alb Shaker Rd Colonie 263916100 3 {“x”: -73.80365248218001, “y”: 42.747956678125… $263,916,100
9 21923 304 Madison Ave Albany 234265418 2 {“x”: -73.76227373289564, “y”: 42.648000674457… $234,265,418
2 907 201 Fuller Rd Albany 203426124 2 {“x”: -73.83362605353057, “y”: 42.692609131686… $203,426,124
16 69999 515 Loudon Rd Colonie 166065600 8 {“x”: -73.74958475282632, “y”: 42.719321807666… $166,065,600
7 20592 47 New Scotland Ave Albany 162276338 8 {“x”: -73.77597163421673, “y”: 42.653565689693… $162,276,338
6 20574 132 S Lake Ave Albany 146296360 2 {“x”: -73.77970918544908, “y”: 42.654390366929… $146,296,360
8 20597 113 Holland Ave Albany 143498501 2 {“x”: -73.77306688593143, “y”: 42.650762742870… $143,498,501
17 78203 Mannsville Colonie 142570400 1 {“x”: -73.71245452369443, “y”: 42.718124477080… $142,570,400
18 95509 1 Crossgates Mall Rd Guilderland 130554700 8 {“x”: -73.84702700595471, “y”: 42.687699053797… $130,554,700
10 24521 86 S Swan St Albany 128436403 2 {“x”: -73.75980563770365, “y”: 42.653931892804… $128,436,403
13 46883 1916 US 9W Coeymans 110000000 8 {“x”: -73.83388475575597, “y”: 42.488730743021… $110,000,000
12 35152 380 River Rd Bethlehem 105263158 8 {“x”: -73.76445503554325, “y”: 42.595925419330… $105,263,158
14 65097 15 Wolf Rd Colonie 101967213 8 {“x”: -73.81423716588279, “y”: 42.709939498581… $101,967,213