Python

Python is an interpreted high-level general-purpose programming language. Python’s design philosophy emphasizes code readability with its notable use of significant indentation.

Bulk Download NYS Election District Enrollments

For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx

url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"

#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.

import glob as g
import pandas as pd

df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))

df = df[(df['STATUS']=='Active')]

df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])

df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])

df

COUNTYMunicipalityWardEDELECTION DISTSTATUSDEMREPCONWOROTHBLANKTOTAL
1AlbanyAlbany001001ALBANY 001001Active74.09.00.00.00.016.099.0
5AlbanyAlbany001002ALBANY 001002Active311.016.02.01.010.047.0387.0
9AlbanyAlbany001003ALBANY 001003Active472.026.05.01.027.0121.0652.0
13AlbanyAlbany001004ALBANY 001004Active437.030.02.03.012.092.0576.0
17AlbanyAlbany001005ALBANY 001005Active13.00.00.00.00.00.013.0
53YatesMilo000006Milo 000006Active204.0409.013.03.050.0182.0861.0
57YatesPotter000001Potter 000001Active144.0460.025.01.051.0226.0907.0
61YatesStarkey000001Starkey 000001Active187.0370.015.05.057.0209.0843.0
65YatesStarkey000002Starkey 000002Active189.0433.018.04.046.0201.0891.0
69YatesTorrey000001Torrey 000001Active185.0313.09.03.039.0159.0708.0

And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:

tdf=df.groupby(by=['COUNTY']).sum()
pd.cut((tdf.div(tdf.iloc[:,:-1].sum(axis=1),axis=0)*100).sort_values(by='BLANK',ascending=False)['DEM'],3,labels=['Rep','Mix','Dem']).sort_values()
COUNTY
Rensselaer      Rep
Genesee         Rep
Oswego          Rep
Livingston      Rep
Schuyler        Rep
Orleans         Rep
Seneca          Rep
Otsego          Rep
Warren          Rep
St.Lawrence     Rep
Cattaraugus     Rep
Chemung         Rep
Cayuga          Rep
Tioga           Rep
Yates           Rep
Broome          Rep
Franklin        Rep
Oneida          Rep
Fulton          Rep
Allegany        Rep
Essex           Rep
Niagara         Rep
Steuben         Rep
Herkimer        Rep
Lewis           Rep
Delaware        Rep
Wyoming         Rep
Chenango        Rep
Hamilton        Rep
Greene          Rep
Sullivan        Rep
Wayne           Rep
Jefferson       Rep
Ontario         Rep
Chautauqua      Rep
Putnam          Rep
Madison         Rep
Washington      Rep
Schoharie       Rep
Dutchess        Rep
Montgomery      Rep
Clinton         Rep
Orange          Rep
Cortland        Rep
Suffolk         Rep
Onondaga        Rep
Saratoga        Rep
Erie            Mix
Schenectady     Mix
Ulster          Mix
Rockland        Mix
Columbia        Mix
Monroe          Mix
Westchester     Mix
Richmond        Mix
Albany          Mix
Tompkins        Mix
Nassau          Mix
Queens          Dem
Bronx           Dem
New York        Dem
Kings           Dem
Name: DEM, dtype: category
Categories (3, object): ['Rep' < 'Mix' < 'Dem']

If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:

import pandas as pd
df = pd.read_csv('/home/andy/Desktop/AlbanyED_nov16.txt',
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)
df=df[df['STATUS']=='Active']
df=df.dropna()

Hundreds of malicious Python packages found stealing sensitive data

Hundreds of malicious Python packages found stealing sensitive data

A malicious campaign that researchers observed growing more complex over the past half year, has been planting on open-source platforms hundreds of info-stealing packages that counted about 75,000 downloads.

The campaign has been monitored since early April by analysts at Checkmarx's Supply Chain Security team, who discovered?272 packages with code for stealing sensitive data from targeted systems.

PANDAS and PDF Tables

PANDAS and PDF Tables

I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.

import tabula
import pandas as pd

pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))

df.query("STATUS=='Active'")
COUNTYELECTION DISTSTATUSDEMREPCONGREWORINDWEPREFOTHBLANKTOTALCOUNTY ELECTION DISTUnnamed: 0
0AlbanyALBANY 001001Active6719100100013101NaNNaN
3AlbanyALBANY 001002Active286203011900045374NaNNaN
6AlbanyALBANY 001003Active468329311900081613NaNNaN
9AlbanyALBANY 001004Active449256141510174576NaNNaN
12AlbanyALBANY 001005Active50000100006NaNNaN
7NaNNaNActive291594003101189476Albany WATERVLIET 004004NaN
10NaNNaNActive339213294357000202847Albany WESTERLO 000001NaN
13NaNNaNActive348119171543000160693Albany WESTERLO 000002NaN
16NaNNaNActive314137262235000196712Albany WESTERLO 000003NaN
19NaNNaNActive91,65835,8212,9915256089,612491419841,926183,402Albany County TotalNaN

There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.

from poppler import load_from_file, PageRenderer

pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(page_1_text),
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)

df
COUNTY ELECTION DISTSTATUSDEMREPCONGREWORINDWEP REF OTH BLANK TOTAL
Albany ALBANY 001001Active67.019.01.00.00.01.00.00.00.013.0101.0
Inactive5.02.00.00.00.00.00.00.00.02.09.0
Total72.021.01.00.00.01.00.00.00.015.0110.0
Albany ALBANY 001002Active286.020.03.00.01.019.00.00.00.045.0374.0
Inactive27.08.00.00.00.03.00.00.00.012.050.0
Total313.028.03.00.01.022.00.00.00.057.0424.0
Albany ALBANY 001003Active468.032.09.03.01.019.00.00.00.081.0613.0
Inactive56.05.00.00.00.02.00.00.00.019.082.0
Total524.037.09.03.01.021.00.00.00.0100.0695.0
Albany ALBANY 001004Active449.025.06.01.04.015.01.00.01.074.0576.0
Inactive64.00.01.00.00.04.00.00.00.013.082.0
Total513.025.07.01.04.019.01.00.01.087.0658.0
Albany ALBANY 001005Active5.00.00.00.00.01.00.00.00.00.06.0
Total5.00.00.00.00.01.00.00.00.00.06.0
Albany ALBANY 001006Active462.011.05.02.04.014.00.00.01.057.0556.0
Inactive63.03.00.00.00.05.00.00.01.012.084.0
Total525.014.05.02.04.019.00.00.02.069.0640.0
Albany ALBANY 001007Active435.019.01.00.02.019.00.00.00.062.0538.0
Inactive60.03.00.00.02.03.00.00.00.07.075.0
Total495.022.01.00.04.022.00.00.00.069.0613.0
Albany ALBANY 001008Active160.07.01.00.02.02.00.00.01.037.0210.0
Page 1 of 46NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data

Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data … πŸ“œ

For example, to find the most valuable parcels in Albany County.

from arcgis.features import FeatureLayer
lyr_url = 'https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Tax_Parcel_Centroid_Points/MapServer/0'
layer = FeatureLayer(lyr_url)
query_result1 = layer.query(where="COUNTY_NAME='Albany' AND FULL_MARKET_VAL > 100000000",
                                    out_fields='PARCEL_ADDR,CITYTOWN_NAME,FULL_MARKET_VAL,OWNER_TYPE', out_sr='4326')

df=query_result1.sdf.sort_values(by='FULL_MARKET_VAL', ascending=False)
df['Full Market Value'] = df['FULL_MARKET_VAL'].map('${:,.0f}'.format)

df
  OBJECTID PARCEL_ADDR CITYTOWN_NAME FULL_MARKET_VAL OWNER_TYPE SHAPE Full Market Value
11 26652 64 Eagle St Albany 1204254925 2 {“x”: -73.75980312511581, “y”: 42.650469918250… $1,204,254,925
3 9150 1200 Washington Ave Albany 886298715 2 {“x”: -73.81092293494828, “y”: 42.679257168282… $886,298,715
4 10208 1400 Washington Ave Albany 642398287 2 {“x”: -73.82369286130952, “y”: 42.685845700657… $642,398,287
0 885 251 Fuller Rd Albany 440042827 2 {“x”: -73.83559002316825, “y”: 42.690208093507… $440,042,827
5 18164 632 New Scotland Ave Albany 377568201 8 {“x”: -73.80381341626146, “y”: 42.655758957669… $377,568,201
1 906 141 Fuller Rd Albany 321199143 2 {“x”: -73.83323986150171, “y”: 42.693189748928… $321,199,143
19 108087 See Card 1067 Watervliet 280898876 1 {“x”: -73.70670724174552, “y”: 42.719628647232… $280,898,876
15 65380 737 Alb Shaker Rd Colonie 263916100 3 {“x”: -73.80365248218001, “y”: 42.747956678125… $263,916,100
9 21923 304 Madison Ave Albany 234265418 2 {“x”: -73.76227373289564, “y”: 42.648000674457… $234,265,418
2 907 201 Fuller Rd Albany 203426124 2 {“x”: -73.83362605353057, “y”: 42.692609131686… $203,426,124
16 69999 515 Loudon Rd Colonie 166065600 8 {“x”: -73.74958475282632, “y”: 42.719321807666… $166,065,600
7 20592 47 New Scotland Ave Albany 162276338 8 {“x”: -73.77597163421673, “y”: 42.653565689693… $162,276,338
6 20574 132 S Lake Ave Albany 146296360 2 {“x”: -73.77970918544908, “y”: 42.654390366929… $146,296,360
8 20597 113 Holland Ave Albany 143498501 2 {“x”: -73.77306688593143, “y”: 42.650762742870… $143,498,501
17 78203 Mannsville Colonie 142570400 1 {“x”: -73.71245452369443, “y”: 42.718124477080… $142,570,400
18 95509 1 Crossgates Mall Rd Guilderland 130554700 8 {“x”: -73.84702700595471, “y”: 42.687699053797… $130,554,700
10 24521 86 S Swan St Albany 128436403 2 {“x”: -73.75980563770365, “y”: 42.653931892804… $128,436,403
13 46883 1916 US 9W Coeymans 110000000 8 {“x”: -73.83388475575597, “y”: 42.488730743021… $110,000,000
12 35152 380 River Rd Bethlehem 105263158 8 {“x”: -73.76445503554325, “y”: 42.595925419330… $105,263,158
14 65097 15 Wolf Rd Colonie 101967213 8 {“x”: -73.81423716588279, “y”: 42.709939498581… $101,967,213