Python | Maps, Photos, Videos, Aerial Photography, Charts

Show Only ...
Maps - Photos - Videos

I don’t love Python for Geospatial Projects

Andy | August 21, 2024

Last night I went back for a second a look at the world of Geospatial Technology in Python. While Python’s ArcGIS and QGIS bindings are widely touted — and are best way to automate things within ArcGIS or QGIS — you are much better off using R programming language for quick, low-code GIS tasks outside of ArcGIS or QGIS.

Python has a lot of advantages for certain things:

It is a good scripting language that is widely supported in applications.
Python is generally a stronger language for building applications to run on web servers
Both ArcGIS and QGIS have really good Python bindings

But as a stand-alone platform, the Python Geospatial libraries rather suck, and are undeveloped. To be sure you can make maps in Python, you can preform various geospatial operations like transformations, raster math and geometric operations. But it takes a lot of work within Python to get nice looking maps using matplotlib, and but you don’t have access to wealth of Census shapefiles or Census data at your finger tip, and Python’s dot chaining method isn’t necessarily as elegant or readable.

I would argue that the R Programming Language and RStudio are superior in many ways over working directly with Python:

R Programs using tigris library, which gives you instant access to the Census Bureau TIGER/Line with just a single command that can be easily joined again or queried against other data. There is nothing like tigris in Python. If you want to County or County Subdivision lines in Python, you will have to manually download the shapefile and then load it into Geopandas. I’ve looked for things like tigris in Python and it doesn’t exist. The basis of most maps in my experience comes from Census TIGER/Line at least in United States. Cartograpy in Python does have access to Natural Earth Dataset, but that isn’t as good as TIGER/Line in the United States.
There are Census Libraries in Python but they aren’t nearly as up to date, have access to nearly as much Census data or Census TIGER/Line. A lot of maps that you make involve plotting Census data, and that requires both the TIGER/Line and the raw data. tidycensus joins them together as one command, no need to download the TIGER/Line separately like in Python.
While you can chain commands in Python and GeoPandas, the chaining mechanism in R is much stronger and flexible. Often in R you can exchange, transform, load and output a map in a single chain of R commands using the tidyverse and ggplot.
ggplot2 is vastly superior to matplotlib for making maps. ggplot2 has sensible defaults, the output is clean and easily theme-able. ggplot2 main limitation is that is best for simpler, easy to read SVG maps. ggplot2 can be a bit strict in enforcing how Hadley Wickham thinks a map should be presented. matplotlib is more flexible in overlaying and designing maps. Of course, for complicated maps, it’s still best to export the data as a shapefile or geopackage and load it into a full GIS platform like QGIS or ArcGIS.
In general, R has a quirky syntax with cute and weird names, but with more sensible defaults, it often gets geospatial work done with less code and work then the same thing done in Python with Rasterio or GeoPandas. A lot of complicated exchange-load-transform things can be done in one line of code with R. People say that Python is a compact syntax, but it really isn’t compared to R’s geospatial libraries.

I still use Python for some QGIS work but I don’t recommend it for work outside of QGIS or ArcGIS. Python should be seen as a good language to automate processes within QGIS or ArcGIS but the state of geospatial tools in Python is weak when you get away from those automating those graphical GIS applications. If you want your work to end up in a graphical GIS program for additional manual tweaking after automating things, then use Python. But if you are processing GIS data from start to finish, your best bet is R.

Method Chaining Classes in Python

Andy | April 15, 2024

That's how you do it in Python. But I still prefer the %>% operator in R or the Unix pipeline.

Bulk Download NYS Election District Enrollments

Andy | March 20, 2024

For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.

01import os
02import requests
03from urllib.parse import urljoin
04from bs4 import BeautifulSoup
05 
06yr = '12'
07mon = 'nov'
08ext = '.pdf' #.xlsx
09 
10url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"
11 
12#If there is no such folder, the script will create one automatically
13folder_location = r''+mon+yr+'-Enrollment'
14if not os.path.exists(folder_location):os.mkdir(folder_location)
15 
16response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
17soup= BeautifulSoup(response.text, "html.parser")     
18for link in soup.select("a[href$='"+mon+yr+ext+"']"):
19    #Name the pdf files using the last portion of each link which are unique in this case
20    filename = os.path.join(folder_location,link['href'].split('/')[-1])
21    with open(filename, 'wb') as f:
22        f.write(requests.get(urljoin(url,link['href'])).content)

What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.

01import glob as g
02import pandas as pd
03 
04df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))
05 
06df = df[(df['STATUS']=='Active')]
07 
08df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
09df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
10df.insert(3,'ED', df['ELECTION DIST'].str[-3:])
11 
12df=df.dropna()
13df=df.sort_values(by=['COUNTY','ELECTION DIST'])
14 
15df

	COUNTY	Municipality	Ward	ED	ELECTION DIST	STATUS	DEM	REP	CON	WOR	OTH	BLANK	TOTAL
1	Albany	Albany	001	001	ALBANY 001001	Active	74.0	9.0	0.0	0.0	0.0	16.0	99.0
5	Albany	Albany	001	002	ALBANY 001002	Active	311.0	16.0	2.0	1.0	10.0	47.0	387.0
9	Albany	Albany	001	003	ALBANY 001003	Active	472.0	26.0	5.0	1.0	27.0	121.0	652.0
13	Albany	Albany	001	004	ALBANY 001004	Active	437.0	30.0	2.0	3.0	12.0	92.0	576.0
17	Albany	Albany	001	005	ALBANY 001005	Active	13.0	0.0	0.0	0.0	0.0	0.0	13.0
…	…	…	…	…	…	…	…	…	…	…	…	…	…
53	Yates	Milo	000	006	Milo 000006	Active	204.0	409.0	13.0	3.0	50.0	182.0	861.0
57	Yates	Potter	000	001	Potter 000001	Active	144.0	460.0	25.0	1.0	51.0	226.0	907.0
61	Yates	Starkey	000	001	Starkey 000001	Active	187.0	370.0	15.0	5.0	57.0	209.0	843.0
65	Yates	Starkey	000	002	Starkey 000002	Active	189.0	433.0	18.0	4.0	46.0	201.0	891.0
69	Yates	Torrey	000	001	Torrey 000001	Active	185.0	313.0	9.0	3.0	39.0	159.0	708.0

And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:

`1`	`tdf=df.groupby(by=['COUNTY']).sum()`

`2`	`pd.cut((tdf.div(tdf.iloc[:,:-1].sum(axis=1),axis=0)*100).sort_values(by='BLANK',ascending=False)['DEM'],3,labels=['Rep','Mix','Dem']).sort_values()`

COUNTY
Rensselaer      Rep
Genesee         Rep
Oswego          Rep
Livingston      Rep
Schuyler        Rep
Orleans         Rep
Seneca          Rep
Otsego          Rep
Warren          Rep
St.Lawrence     Rep
Cattaraugus     Rep
Chemung         Rep
Cayuga          Rep
Tioga           Rep
Yates           Rep
Broome          Rep
Franklin        Rep
Oneida          Rep
Fulton          Rep
Allegany        Rep
Essex           Rep
Niagara         Rep
Steuben         Rep
Herkimer        Rep
Lewis           Rep
Delaware        Rep
Wyoming         Rep
Chenango        Rep
Hamilton        Rep
Greene          Rep
Sullivan        Rep
Wayne           Rep
Jefferson       Rep
Ontario         Rep
Chautauqua      Rep
Putnam          Rep
Madison         Rep
Washington      Rep
Schoharie       Rep
Dutchess        Rep
Montgomery      Rep
Clinton         Rep
Orange          Rep
Cortland        Rep
Suffolk         Rep
Onondaga        Rep
Saratoga        Rep
Erie            Mix
Schenectady     Mix
Ulster          Mix
Rockland        Mix
Columbia        Mix
Monroe          Mix
Westchester     Mix
Richmond        Mix
Albany          Mix
Tompkins        Mix
Nassau          Mix
Queens          Dem
Bronx           Dem
New York        Dem
Kings           Dem
Name: DEM, dtype: category
Categories (3, object): ['Rep' < 'Mix' < 'Dem']

If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:

1import pandas as pd
2df = pd.read_csv('/home/andy/Desktop/AlbanyED_nov16.txt',
3   skiprows=3,
4   sep='\s{2,}',
5   engine='python',
6   error_bad_lines=False)
7df=df[df['STATUS']=='Active']
8df=df.dropna()

Hundreds of malicious Python packages found stealing sensitive data

Andy | October 5, 2023

Hundreds of malicious Python packages found stealing sensitive data

A malicious campaign that researchers observed growing more complex over the past half year, has been planting on open-source platforms hundreds of info-stealing packages that counted about 75,000 downloads.

The campaign has been monitored since early April by analysts at Checkmarx's Supply Chain Security team, who discovered?272 packages with code for stealing sensitive data from targeted systems.

comparing code and outputs for vector data

Andy | September 7, 2023

geocompx – Geographic data analysis in R and Python: comparing code and outputs for vector data

PANDAS and PDF Tables

Andy | August 17, 2023

PANDAS and PDF Tables

I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.

1import tabula
2import pandas as pd
3 
4pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
5df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))
6 
7df.query("STATUS=='Active'")

	COUNTY	ELECTION DIST	STATUS	DEM	REP	CON	GRE	WOR	IND	WEP	REF	OTH	BLANK	TOTAL	COUNTY ELECTION DIST	Unnamed: 0
0	Albany	ALBANY 001001	Active	67	19	1	0	0	1	0	0	0	13	101	NaN	NaN
3	Albany	ALBANY 001002	Active	286	20	3	0	1	19	0	0	0	45	374	NaN	NaN
6	Albany	ALBANY 001003	Active	468	32	9	3	1	19	0	0	0	81	613	NaN	NaN
9	Albany	ALBANY 001004	Active	449	25	6	1	4	15	1	0	1	74	576	NaN	NaN
12	Albany	ALBANY 001005	Active	5	0	0	0	0	1	0	0	0	0	6	NaN	NaN
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
7	NaN	NaN	Active	291	59	4	0	0	31	0	1	1	89	476	Albany WATERVLIET 004004	NaN
10	NaN	NaN	Active	339	213	29	4	3	57	0	0	0	202	847	Albany WESTERLO 000001	NaN
13	NaN	NaN	Active	348	119	17	1	5	43	0	0	0	160	693	Albany WESTERLO 000002	NaN
16	NaN	NaN	Active	314	137	26	2	2	35	0	0	0	196	712	Albany WESTERLO 000003	NaN
19	NaN	NaN	Active	91,658	35,821	2,991	525	608	9,612	49	14	198	41,926	183,402	Albany County Total	NaN

There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.

01from poppler import load_from_file, PageRenderer
02 
03pdf_document = load_from_file(pdfFile)
04page_1 = pdf_document.create_page(0)
05page_1_text = page_1.text()
06 
07import pandas as pd
08from io import StringIO
09 
10df = pd.read_csv(StringIO(page_1_text),
11   skiprows=3,
12   sep='\s{2,}',
13   engine='python',
14   error_bad_lines=False)
15 
16df

				COUNTY ELECTION DIST	STATUS	DEM	REP	CON	GRE	WOR	IND	WEP REF OTH BLANK TOTAL
Albany ALBANY 001001	Active	67.0	19.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	13.0	101.0
Inactive	5.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	9.0
Total	72.0	21.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	15.0	110.0
Albany ALBANY 001002	Active	286.0	20.0	3.0	0.0	1.0	19.0	0.0	0.0	0.0	45.0	374.0
Inactive	27.0	8.0	0.0	0.0	0.0	3.0	0.0	0.0	0.0	12.0	50.0
Total	313.0	28.0	3.0	0.0	1.0	22.0	0.0	0.0	0.0	57.0	424.0
Albany ALBANY 001003	Active	468.0	32.0	9.0	3.0	1.0	19.0	0.0	0.0	0.0	81.0	613.0
Inactive	56.0	5.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	19.0	82.0
Total	524.0	37.0	9.0	3.0	1.0	21.0	0.0	0.0	0.0	100.0	695.0
Albany ALBANY 001004	Active	449.0	25.0	6.0	1.0	4.0	15.0	1.0	0.0	1.0	74.0	576.0
Inactive	64.0	0.0	1.0	0.0	0.0	4.0	0.0	0.0	0.0	13.0	82.0
Total	513.0	25.0	7.0	1.0	4.0	19.0	1.0	0.0	1.0	87.0	658.0
Albany ALBANY 001005	Active	5.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	6.0
Total	5.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	6.0
Albany ALBANY 001006	Active	462.0	11.0	5.0	2.0	4.0	14.0	0.0	0.0	1.0	57.0	556.0
Inactive	63.0	3.0	0.0	0.0	0.0	5.0	0.0	0.0	1.0	12.0	84.0
Total	525.0	14.0	5.0	2.0	4.0	19.0	0.0	0.0	2.0	69.0	640.0
Albany ALBANY 001007	Active	435.0	19.0	1.0	0.0	2.0	19.0	0.0	0.0	0.0	62.0	538.0
Inactive	60.0	3.0	0.0	0.0	2.0	3.0	0.0	0.0	0.0	7.0	75.0
Total	495.0	22.0	1.0	0.0	4.0	22.0	0.0	0.0	0.0	69.0	613.0
Albany ALBANY 001008	Active	160.0	7.0	1.0	0.0	2.0	2.0	0.0	0.0	1.0	37.0	210.0
Page 1 of 46	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data

Andy | August 5, 2023

Been playing around a bit with PANDAS and ArcGIS Feature Servers for Tax Data …

For example, to find the most valuable parcels in Albany County.

from arcgis.features import FeatureLayer
lyr_url = 'https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Tax_Parcel_Centroid_Points/MapServer/0'
layer = FeatureLayer(lyr_url)
query_result1 = layer.query(where="COUNTY_NAME='Albany' AND FULL_MARKET_VAL > 100000000", 
                                    out_fields='PARCEL_ADDR,CITYTOWN_NAME,FULL_MARKET_VAL,OWNER_TYPE', out_sr='4326')

df=query_result1.sdf.sort_values(by='FULL_MARKET_VAL', ascending=False)
df['Full Market Value'] = df['FULL_MARKET_VAL'].map('${:,.0f}'.format)

df

	OBJECTID	PARCEL_ADDR	CITYTOWN_NAME	FULL_MARKET_VAL	OWNER_TYPE	SHAPE	Full Market Value
11	26652	64 Eagle St	Albany	1204254925	2	{“x”: -73.75980312511581, “y”: 42.650469918250…	$1,204,254,925
3	9150	1200 Washington Ave	Albany	886298715	2	{“x”: -73.81092293494828, “y”: 42.679257168282…	$886,298,715
4	10208	1400 Washington Ave	Albany	642398287	2	{“x”: -73.82369286130952, “y”: 42.685845700657…	$642,398,287
0	885	251 Fuller Rd	Albany	440042827	2	{“x”: -73.83559002316825, “y”: 42.690208093507…	$440,042,827
5	18164	632 New Scotland Ave	Albany	377568201	8	{“x”: -73.80381341626146, “y”: 42.655758957669…	$377,568,201
1	906	141 Fuller Rd	Albany	321199143	2	{“x”: -73.83323986150171, “y”: 42.693189748928…	$321,199,143
19	108087	See Card 1067	Watervliet	280898876	1	{“x”: -73.70670724174552, “y”: 42.719628647232…	$280,898,876
15	65380	737 Alb Shaker Rd	Colonie	263916100	3	{“x”: -73.80365248218001, “y”: 42.747956678125…	$263,916,100
9	21923	304 Madison Ave	Albany	234265418	2	{“x”: -73.76227373289564, “y”: 42.648000674457…	$234,265,418
2	907	201 Fuller Rd	Albany	203426124	2	{“x”: -73.83362605353057, “y”: 42.692609131686…	$203,426,124
16	69999	515 Loudon Rd	Colonie	166065600	8	{“x”: -73.74958475282632, “y”: 42.719321807666…	$166,065,600
7	20592	47 New Scotland Ave	Albany	162276338	8	{“x”: -73.77597163421673, “y”: 42.653565689693…	$162,276,338
6	20574	132 S Lake Ave	Albany	146296360	2	{“x”: -73.77970918544908, “y”: 42.654390366929…	$146,296,360
8	20597	113 Holland Ave	Albany	143498501	2	{“x”: -73.77306688593143, “y”: 42.650762742870…	$143,498,501
17	78203	Mannsville	Colonie	142570400	1	{“x”: -73.71245452369443, “y”: 42.718124477080…	$142,570,400
18	95509	1 Crossgates Mall Rd	Guilderland	130554700	8	{“x”: -73.84702700595471, “y”: 42.687699053797…	$130,554,700
10	24521	86 S Swan St	Albany	128436403	2	{“x”: -73.75980563770365, “y”: 42.653931892804…	$128,436,403
13	46883	1916 US 9W	Coeymans	110000000	8	{“x”: -73.83388475575597, “y”: 42.488730743021…	$110,000,000
12	35152	380 River Rd	Bethlehem	105263158	8	{“x”: -73.76445503554325, “y”: 42.595925419330…	$105,263,158
14	65097	15 Wolf Rd	Colonie	101967213	8	{“x”: -73.81423716588279, “y”: 42.709939498581…	$101,967,213