Method Chaining Classes in Python
That's how you do it in Python. But I still prefer the %>% operator in R or the Unix pipeline.
Why ads? π€ / Privacy Policy π³
That's how you do it in Python. But I still prefer the %>% operator in R or the Unix pipeline.
For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx
url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"
#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.
import glob as g
import pandas as pd
df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))
df = df[(df['STATUS']=='Active')]
df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])
df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])
df
COUNTY | Municipality | Ward | ED | ELECTION DIST | STATUS | DEM | REP | CON | WOR | OTH | BLANK | TOTAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Albany | Albany | 001 | 001 | ALBANY 001001 | Active | 74.0 | 9.0 | 0.0 | 0.0 | 0.0 | 16.0 | 99.0 |
5 | Albany | Albany | 001 | 002 | ALBANY 001002 | Active | 311.0 | 16.0 | 2.0 | 1.0 | 10.0 | 47.0 | 387.0 |
9 | Albany | Albany | 001 | 003 | ALBANY 001003 | Active | 472.0 | 26.0 | 5.0 | 1.0 | 27.0 | 121.0 | 652.0 |
13 | Albany | Albany | 001 | 004 | ALBANY 001004 | Active | 437.0 | 30.0 | 2.0 | 3.0 | 12.0 | 92.0 | 576.0 |
17 | Albany | Albany | 001 | 005 | ALBANY 001005 | Active | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … |
53 | Yates | Milo | 000 | 006 | Milo 000006 | Active | 204.0 | 409.0 | 13.0 | 3.0 | 50.0 | 182.0 | 861.0 |
57 | Yates | Potter | 000 | 001 | Potter 000001 | Active | 144.0 | 460.0 | 25.0 | 1.0 | 51.0 | 226.0 | 907.0 |
61 | Yates | Starkey | 000 | 001 | Starkey 000001 | Active | 187.0 | 370.0 | 15.0 | 5.0 | 57.0 | 209.0 | 843.0 |
65 | Yates | Starkey | 000 | 002 | Starkey 000002 | Active | 189.0 | 433.0 | 18.0 | 4.0 | 46.0 | 201.0 | 891.0 |
69 | Yates | Torrey | 000 | 001 | Torrey 000001 | Active | 185.0 | 313.0 | 9.0 | 3.0 | 39.0 | 159.0 | 708.0 |
And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:
tdf=df.groupby(by=['COUNTY']).sum()
pd.cut((tdf.div(tdf.iloc[:,:-1].sum(axis=1),axis=0)*100).sort_values(by='BLANK',ascending=False)['DEM'],3,labels=['Rep','Mix','Dem']).sort_values()
COUNTY Rensselaer Rep Genesee Rep Oswego Rep Livingston Rep Schuyler Rep Orleans Rep Seneca Rep Otsego Rep Warren Rep St.Lawrence Rep Cattaraugus Rep Chemung Rep Cayuga Rep Tioga Rep Yates Rep Broome Rep Franklin Rep Oneida Rep Fulton Rep Allegany Rep Essex Rep Niagara Rep Steuben Rep Herkimer Rep Lewis Rep Delaware Rep Wyoming Rep Chenango Rep Hamilton Rep Greene Rep Sullivan Rep Wayne Rep Jefferson Rep Ontario Rep Chautauqua Rep Putnam Rep Madison Rep Washington Rep Schoharie Rep Dutchess Rep Montgomery Rep Clinton Rep Orange Rep Cortland Rep Suffolk Rep Onondaga Rep Saratoga Rep Erie Mix Schenectady Mix Ulster Mix Rockland Mix Columbia Mix Monroe Mix Westchester Mix Richmond Mix Albany Mix Tompkins Mix Nassau Mix Queens Dem Bronx Dem New York Dem Kings Dem Name: DEM, dtype: category Categories (3, object): ['Rep' < 'Mix' < 'Dem']
If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:
import pandas as pd
df = pd.read_csv('/home/andy/Desktop/AlbanyED_nov16.txt',
skiprows=3,
sep='\s{2,}',
engine='python',
error_bad_lines=False)
df=df[df['STATUS']=='Active']
df=df.dropna()
A malicious campaign that researchers observed growing more complex over the past half year, has been planting on open-source platforms hundreds of info-stealing packages that counted about 75,000 downloads.
The campaign has been monitored since early April by analysts at Checkmarx's Supply Chain Security team, who discovered?272 packages with code for stealing sensitive data from targeted systems.
I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.
import tabula
import pandas as pd
pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))
df.query("STATUS=='Active'")
COUNTY | ELECTION DIST | STATUS | DEM | REP | CON | GRE | WOR | IND | WEP | REF | OTH | BLANK | TOTAL | COUNTY ELECTION DIST | Unnamed: 0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albany | ALBANY 001001 | Active | 67 | 19 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 13 | 101 | NaN | NaN |
3 | Albany | ALBANY 001002 | Active | 286 | 20 | 3 | 0 | 1 | 19 | 0 | 0 | 0 | 45 | 374 | NaN | NaN |
6 | Albany | ALBANY 001003 | Active | 468 | 32 | 9 | 3 | 1 | 19 | 0 | 0 | 0 | 81 | 613 | NaN | NaN |
9 | Albany | ALBANY 001004 | Active | 449 | 25 | 6 | 1 | 4 | 15 | 1 | 0 | 1 | 74 | 576 | NaN | NaN |
12 | Albany | ALBANY 001005 | Active | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | NaN | NaN |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
7 | NaN | NaN | Active | 291 | 59 | 4 | 0 | 0 | 31 | 0 | 1 | 1 | 89 | 476 | Albany WATERVLIET 004004 | NaN |
10 | NaN | NaN | Active | 339 | 213 | 29 | 4 | 3 | 57 | 0 | 0 | 0 | 202 | 847 | Albany WESTERLO 000001 | NaN |
13 | NaN | NaN | Active | 348 | 119 | 17 | 1 | 5 | 43 | 0 | 0 | 0 | 160 | 693 | Albany WESTERLO 000002 | NaN |
16 | NaN | NaN | Active | 314 | 137 | 26 | 2 | 2 | 35 | 0 | 0 | 0 | 196 | 712 | Albany WESTERLO 000003 | NaN |
19 | NaN | NaN | Active | 91,658 | 35,821 | 2,991 | 525 | 608 | 9,612 | 49 | 14 | 198 | 41,926 | 183,402 | Albany County Total | NaN |
There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.
from poppler import load_from_file, PageRenderer
pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(page_1_text),
skiprows=3,
sep='\s{2,}',
engine='python',
error_bad_lines=False)
df
COUNTY ELECTION DIST | STATUS | DEM | REP | CON | GRE | WOR | IND | WEP REF OTH BLANK TOTAL | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Albany ALBANY 001001 | Active | 67.0 | 19.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 13.0 | 101.0 |
Inactive | 5.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | |
Total | 72.0 | 21.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 15.0 | 110.0 | |
Albany ALBANY 001002 | Active | 286.0 | 20.0 | 3.0 | 0.0 | 1.0 | 19.0 | 0.0 | 0.0 | 0.0 | 45.0 | 374.0 |
Inactive | 27.0 | 8.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 12.0 | 50.0 | |
Total | 313.0 | 28.0 | 3.0 | 0.0 | 1.0 | 22.0 | 0.0 | 0.0 | 0.0 | 57.0 | 424.0 | |
Albany ALBANY 001003 | Active | 468.0 | 32.0 | 9.0 | 3.0 | 1.0 | 19.0 | 0.0 | 0.0 | 0.0 | 81.0 | 613.0 |
Inactive | 56.0 | 5.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 19.0 | 82.0 | |
Total | 524.0 | 37.0 | 9.0 | 3.0 | 1.0 | 21.0 | 0.0 | 0.0 | 0.0 | 100.0 | 695.0 | |
Albany ALBANY 001004 | Active | 449.0 | 25.0 | 6.0 | 1.0 | 4.0 | 15.0 | 1.0 | 0.0 | 1.0 | 74.0 | 576.0 |
Inactive | 64.0 | 0.0 | 1.0 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 13.0 | 82.0 | |
Total | 513.0 | 25.0 | 7.0 | 1.0 | 4.0 | 19.0 | 1.0 | 0.0 | 1.0 | 87.0 | 658.0 | |
Albany ALBANY 001005 | Active | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
Total | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | |
Albany ALBANY 001006 | Active | 462.0 | 11.0 | 5.0 | 2.0 | 4.0 | 14.0 | 0.0 | 0.0 | 1.0 | 57.0 | 556.0 |
Inactive | 63.0 | 3.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 1.0 | 12.0 | 84.0 | |
Total | 525.0 | 14.0 | 5.0 | 2.0 | 4.0 | 19.0 | 0.0 | 0.0 | 2.0 | 69.0 | 640.0 | |
Albany ALBANY 001007 | Active | 435.0 | 19.0 | 1.0 | 0.0 | 2.0 | 19.0 | 0.0 | 0.0 | 0.0 | 62.0 | 538.0 |
Inactive | 60.0 | 3.0 | 0.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 7.0 | 75.0 | |
Total | 495.0 | 22.0 | 1.0 | 0.0 | 4.0 | 22.0 | 0.0 | 0.0 | 0.0 | 69.0 | 613.0 | |
Albany ALBANY 001008 | Active | 160.0 | 7.0 | 1.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | 37.0 | 210.0 |
Page 1 of 46 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
For example, to find the most valuable parcels in Albany County.
from arcgis.features import FeatureLayer
lyr_url = 'https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Tax_Parcel_Centroid_Points/MapServer/0'
layer = FeatureLayer(lyr_url)
query_result1 = layer.query(where="COUNTY_NAME='Albany' AND FULL_MARKET_VAL > 100000000",
out_fields='PARCEL_ADDR,CITYTOWN_NAME,FULL_MARKET_VAL,OWNER_TYPE', out_sr='4326')
df=query_result1.sdf.sort_values(by='FULL_MARKET_VAL', ascending=False)
df['Full Market Value'] = df['FULL_MARKET_VAL'].map('${:,.0f}'.format)
df
OBJECTID | PARCEL_ADDR | CITYTOWN_NAME | FULL_MARKET_VAL | OWNER_TYPE | SHAPE | Full Market Value | |
---|---|---|---|---|---|---|---|
11 | 26652 | 64 Eagle St | Albany | 1204254925 | 2 | {“x”: -73.75980312511581, “y”: 42.650469918250… | $1,204,254,925 |
3 | 9150 | 1200 Washington Ave | Albany | 886298715 | 2 | {“x”: -73.81092293494828, “y”: 42.679257168282… | $886,298,715 |
4 | 10208 | 1400 Washington Ave | Albany | 642398287 | 2 | {“x”: -73.82369286130952, “y”: 42.685845700657… | $642,398,287 |
0 | 885 | 251 Fuller Rd | Albany | 440042827 | 2 | {“x”: -73.83559002316825, “y”: 42.690208093507… | $440,042,827 |
5 | 18164 | 632 New Scotland Ave | Albany | 377568201 | 8 | {“x”: -73.80381341626146, “y”: 42.655758957669… | $377,568,201 |
1 | 906 | 141 Fuller Rd | Albany | 321199143 | 2 | {“x”: -73.83323986150171, “y”: 42.693189748928… | $321,199,143 |
19 | 108087 | See Card 1067 | Watervliet | 280898876 | 1 | {“x”: -73.70670724174552, “y”: 42.719628647232… | $280,898,876 |
15 | 65380 | 737 Alb Shaker Rd | Colonie | 263916100 | 3 | {“x”: -73.80365248218001, “y”: 42.747956678125… | $263,916,100 |
9 | 21923 | 304 Madison Ave | Albany | 234265418 | 2 | {“x”: -73.76227373289564, “y”: 42.648000674457… | $234,265,418 |
2 | 907 | 201 Fuller Rd | Albany | 203426124 | 2 | {“x”: -73.83362605353057, “y”: 42.692609131686… | $203,426,124 |
16 | 69999 | 515 Loudon Rd | Colonie | 166065600 | 8 | {“x”: -73.74958475282632, “y”: 42.719321807666… | $166,065,600 |
7 | 20592 | 47 New Scotland Ave | Albany | 162276338 | 8 | {“x”: -73.77597163421673, “y”: 42.653565689693… | $162,276,338 |
6 | 20574 | 132 S Lake Ave | Albany | 146296360 | 2 | {“x”: -73.77970918544908, “y”: 42.654390366929… | $146,296,360 |
8 | 20597 | 113 Holland Ave | Albany | 143498501 | 2 | {“x”: -73.77306688593143, “y”: 42.650762742870… | $143,498,501 |
17 | 78203 | Mannsville | Colonie | 142570400 | 1 | {“x”: -73.71245452369443, “y”: 42.718124477080… | $142,570,400 |
18 | 95509 | 1 Crossgates Mall Rd | Guilderland | 130554700 | 8 | {“x”: -73.84702700595471, “y”: 42.687699053797… | $130,554,700 |
10 | 24521 | 86 S Swan St | Albany | 128436403 | 2 | {“x”: -73.75980563770365, “y”: 42.653931892804… | $128,436,403 |
13 | 46883 | 1916 US 9W | Coeymans | 110000000 | 8 | {“x”: -73.83388475575597, “y”: 42.488730743021… | $110,000,000 |
12 | 35152 | 380 River Rd | Bethlehem | 105263158 | 8 | {“x”: -73.76445503554325, “y”: 42.595925419330… | $105,263,158 |
14 | 65097 | 15 Wolf Rd | Colonie | 101967213 | 8 | {“x”: -73.81423716588279, “y”: 42.709939498581… | $101,967,213 |