Bulk Download NYS Election District Enrollments

For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx

url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"

#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.

import glob as g
import pandas as pd

df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))

df = df[(df['STATUS']=='Active')]

df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])

df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])

df

COUNTYMunicipalityWardEDELECTION DISTSTATUSDEMREPCONWOROTHBLANKTOTAL
1AlbanyAlbany001001ALBANY 001001Active74.09.00.00.00.016.099.0
5AlbanyAlbany001002ALBANY 001002Active311.016.02.01.010.047.0387.0
9AlbanyAlbany001003ALBANY 001003Active472.026.05.01.027.0121.0652.0
13AlbanyAlbany001004ALBANY 001004Active437.030.02.03.012.092.0576.0
17AlbanyAlbany001005ALBANY 001005Active13.00.00.00.00.00.013.0
53YatesMilo000006Milo 000006Active204.0409.013.03.050.0182.0861.0
57YatesPotter000001Potter 000001Active144.0460.025.01.051.0226.0907.0
61YatesStarkey000001Starkey 000001Active187.0370.015.05.057.0209.0843.0
65YatesStarkey000002Starkey 000002Active189.0433.018.04.046.0201.0891.0
69YatesTorrey000001Torrey 000001Active185.0313.09.03.039.0159.0708.0

And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:

tdf=df.groupby(by=['COUNTY']).sum()
pd.cut((tdf.div(tdf.iloc[:,:-1].sum(axis=1),axis=0)*100).sort_values(by='BLANK',ascending=False)['DEM'],3,labels=['Rep','Mix','Dem']).sort_values()
COUNTY
Rensselaer      Rep
Genesee         Rep
Oswego          Rep
Livingston      Rep
Schuyler        Rep
Orleans         Rep
Seneca          Rep
Otsego          Rep
Warren          Rep
St.Lawrence     Rep
Cattaraugus     Rep
Chemung         Rep
Cayuga          Rep
Tioga           Rep
Yates           Rep
Broome          Rep
Franklin        Rep
Oneida          Rep
Fulton          Rep
Allegany        Rep
Essex           Rep
Niagara         Rep
Steuben         Rep
Herkimer        Rep
Lewis           Rep
Delaware        Rep
Wyoming         Rep
Chenango        Rep
Hamilton        Rep
Greene          Rep
Sullivan        Rep
Wayne           Rep
Jefferson       Rep
Ontario         Rep
Chautauqua      Rep
Putnam          Rep
Madison         Rep
Washington      Rep
Schoharie       Rep
Dutchess        Rep
Montgomery      Rep
Clinton         Rep
Orange          Rep
Cortland        Rep
Suffolk         Rep
Onondaga        Rep
Saratoga        Rep
Erie            Mix
Schenectady     Mix
Ulster          Mix
Rockland        Mix
Columbia        Mix
Monroe          Mix
Westchester     Mix
Richmond        Mix
Albany          Mix
Tompkins        Mix
Nassau          Mix
Queens          Dem
Bronx           Dem
New York        Dem
Kings           Dem
Name: DEM, dtype: category
Categories (3, object): ['Rep' < 'Mix' < 'Dem']

If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:

import pandas as pd
df = pd.read_csv('/home/andy/Desktop/AlbanyED_nov16.txt',
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)
df=df[df['STATUS']=='Active']
df=df.dropna()

Leave a Reply

Your email address will not be published. Required fields are marked *