Python is an interpreted high-level general-purpose programming language. Python’s design philosophy emphasizes code readability with its notable use of significant indentation.
Last night I went back for a second a look at the world of Geospatial Technology in Python. While Python’s ArcGIS and QGIS bindings are widely touted — and are best way to automate things within ArcGIS or QGIS — you are much better off using R programming language for quick, low-code GIS tasks outside of ArcGIS or QGIS.
Python has a lot of advantages for certain things:
It is a good scripting language that is widely supported in applications.
Python is generally a stronger language for building applications to run on web servers
Both ArcGIS and QGIS have really good Python bindings
But as a stand-alone platform, the Python Geospatial libraries rather suck, and are undeveloped. To be sure you can make maps in Python, you can preform various geospatial operations like transformations, raster math and geometric operations. But it takes a lot of work within Python to get nice looking maps using matplotlib, and but you don’t have access to wealth of Census shapefiles or Census data at your finger tip, and Python’s dot chaining method isn’t necessarily as elegant or readable.
I would argue that the R Programming Language and RStudio are superior in many ways over working directly with Python:
R Programs using tigris library, which gives you instant access to the Census Bureau TIGER/Line with just a single command that can be easily joined again or queried against other data. There is nothing like tigris in Python. If you want to County or County Subdivision lines in Python, you will have to manually download the shapefile and then load it into Geopandas. I’ve looked for things like tigris in Python and it doesn’t exist. The basis of most maps in my experience comes from Census TIGER/Line at least in United States. Cartograpy in Python does have access to Natural Earth Dataset, but that isn’t as good as TIGER/Line in the United States.
There are Census Libraries in Python but they aren’t nearly as up to date, have access to nearly as much Census data or Census TIGER/Line. A lot of maps that you make involve plotting Census data, and that requires both the TIGER/Line and the raw data. tidycensus joins them together as one command, no need to download the TIGER/Line separately like in Python.
While you can chain commands in Python and GeoPandas, the chaining mechanism in R is much stronger and flexible. Often in R you can exchange, transform, load and output a map in a single chain of R commands using the tidyverse and ggplot.
ggplot2 is vastly superior to matplotlib for making maps. ggplot2 has sensible defaults, the output is clean and easily theme-able. ggplot2 main limitation is that is best for simpler, easy to read SVG maps. ggplot2 can be a bit strict in enforcing how Hadley Wickham thinks a map should be presented.Β matplotlib is more flexible in overlaying and designing maps. Of course, for complicated maps, it’s still best to export the data as a shapefile or geopackage and load it into a full GIS platform like QGIS or ArcGIS.
In general, R has a quirky syntax with cute and weird names, but with more sensible defaults, it often gets geospatial work done with less code and work then the same thing done in Python with Rasterio or GeoPandas. A lot of complicated exchange-load-transform things can be done in one line of code with R. People say that Python is a compact syntax, but it really isn’t compared to R’s geospatial libraries.
I still use Python for some QGIS work but I don’t recommend it for work outside of QGIS or ArcGIS. Python should be seen as a good language to automate processes within QGIS or ArcGIS but the state of geospatial tools in Python is weak when you get away from those automating those graphical GIS applications. If you want your work to end up in a graphical GIS program for additional manual tweaking after automating things, then use Python. But if you are processing GIS data from start to finish, your best bet is R.
For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx
url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"
#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.
import glob as g
import pandas as pd
df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))
df = df[(df['STATUS']=='Active')]
df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])
df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])
df
COUNTY
Municipality
Ward
ED
ELECTION DIST
STATUS
DEM
REP
CON
WOR
OTH
BLANK
TOTAL
1
Albany
Albany
001
001
ALBANY 001001
Active
74.0
9.0
0.0
0.0
0.0
16.0
99.0
5
Albany
Albany
001
002
ALBANY 001002
Active
311.0
16.0
2.0
1.0
10.0
47.0
387.0
9
Albany
Albany
001
003
ALBANY 001003
Active
472.0
26.0
5.0
1.0
27.0
121.0
652.0
13
Albany
Albany
001
004
ALBANY 001004
Active
437.0
30.0
2.0
3.0
12.0
92.0
576.0
17
Albany
Albany
001
005
ALBANY 001005
Active
13.0
0.0
0.0
0.0
0.0
0.0
13.0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
53
Yates
Milo
000
006
Milo 000006
Active
204.0
409.0
13.0
3.0
50.0
182.0
861.0
57
Yates
Potter
000
001
Potter 000001
Active
144.0
460.0
25.0
1.0
51.0
226.0
907.0
61
Yates
Starkey
000
001
Starkey 000001
Active
187.0
370.0
15.0
5.0
57.0
209.0
843.0
65
Yates
Starkey
000
002
Starkey 000002
Active
189.0
433.0
18.0
4.0
46.0
201.0
891.0
69
Yates
Torrey
000
001
Torrey 000001
Active
185.0
313.0
9.0
3.0
39.0
159.0
708.0
And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:
COUNTY
Rensselaer Rep
Genesee Rep
Oswego Rep
Livingston Rep
Schuyler Rep
Orleans Rep
Seneca Rep
Otsego Rep
Warren Rep
St.Lawrence Rep
Cattaraugus Rep
Chemung Rep
Cayuga Rep
Tioga Rep
Yates Rep
Broome Rep
Franklin Rep
Oneida Rep
Fulton Rep
Allegany Rep
Essex Rep
Niagara Rep
Steuben Rep
Herkimer Rep
Lewis Rep
Delaware Rep
Wyoming Rep
Chenango Rep
Hamilton Rep
Greene Rep
Sullivan Rep
Wayne Rep
Jefferson Rep
Ontario Rep
Chautauqua Rep
Putnam Rep
Madison Rep
Washington Rep
Schoharie Rep
Dutchess Rep
Montgomery Rep
Clinton Rep
Orange Rep
Cortland Rep
Suffolk Rep
Onondaga Rep
Saratoga Rep
Erie Mix
Schenectady Mix
Ulster Mix
Rockland Mix
Columbia Mix
Monroe Mix
Westchester Mix
Richmond Mix
Albany Mix
Tompkins Mix
Nassau Mix
Queens Dem
Bronx Dem
New York Dem
Kings Dem
Name: DEM, dtype: category
Categories (3, object): ['Rep' < 'Mix' < 'Dem']
If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:
A malicious campaign that researchers observed growing more complex over the past half year, has been planting on open-source platforms hundreds of info-stealing packages that counted about 75,000 downloads.
The campaign has been monitored since early April by analysts at Checkmarx's Supply Chain Security team, who discovered?272 packages with code for stealing sensitive data from targeted systems.
I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.
import tabula
import pandas as pd
pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))
df.query("STATUS=='Active'")