US Census

Maps that look at the US Census at the macro-perspective of all counties in the United States.

Here is a list of the ten most Hispanic counties in New York State from the 2020 US Census

Here is a list of the ten most Hispanic counties in New York State from the 2020 US Census.

CountyPercent Hispanic
Bronx54.7625579396111
Queens27.7643315385306
Westchester26.8138904900857
New York23.7650737700612
Orange22.3627619545987
Suffolk21.8202133794694
Rockland19.6409412140254
Richmond19.5583634394157
Kings18.8747087980808
Nassau18.3715271956635

Here is how you can create this list using PANDAS. You will need to get the PL-94 171 Redistricting data, the Legacy File Format Header Records, and expand the ZIP file and place in the appropriate directory described below.

import pandas as pd
import geopandas as gpd

# path where 2020_PLSummaryFile_FieldNames.xlsx XX=State Code
# and XXgeo2020.pl, xx000012020.pl through XX000032020.pl
# reside on your hard drive
path='/home/andy/Desktop/2020pl-94-171/'

# state code
state='ny'

# header file, open with all tabs as an dictionary of dataframes
field_names=pd.read_excel(path+'2020_PLSummaryFile_FieldNames.xlsx', sheet_name=None)

# load the geoheader, force as str type to mixed types on certain fields
# ensure GEOIDs are properly processed avoids issues with paging
gh=pd.read_csv( path+state+'geo2020.pl',delimiter='|',
               header=None, 
               names=field_names['2020 P.L. Geoheader Fields'].columns,
               index_col='LOGRECNO',
               dtype=str )
               
 # load segment 1 of 2020 PL 94-171 which is racial data 
segNum=1
seg=pd.read_csv( path+state+'0000'+str(segNum)+'2020.pl',delimiter='|',
               header=None, 
               names=field_names['2020 P.L. Segment '+str(segNum)+' Fields'].columns,
               index_col='LOGRECNO',
              )
# discard FILEID, STUSAB, CHARITER, CIFSN as duplicative after join
seg=seg.iloc[:,4:]

# join seg to geoheader
seg=gh.join(seg)

# Calculate the population of New York Counties that is African American 
# using County SUMLEVEL == 50 (see Census Docts)
ql="SUMLEV=='050'"

# Create a DataFrame with the County and Percent Hispani
# You can get the fields list from 2020 PL Summary FieldNames.xlsx
# under the 2020 P.L. Segment 1 Definitions tab
his=pd.DataFrame({ 'County': seg.query(ql)['BASENAME'], 
              'Percent Hispanic': seg.query(ql)['P0020002'] / seg.query(ql)['P0020001'] *100})

# Sort and print most Hispanic Counties
his.sort_values(by="Percent Hispanic", ascending=False).head(10).to_csv('/tmp/hispanics.csv')

Population Maths

2020 Population Maths !

These require Python, PANDAS and GeoPandas. You will also need the PL 94-171 redistricting files, specifically the 2020 TIGER Line Shapefiles and the nyplgeo2020.pl which is in a zip file. That nyplgeo2020.pl contains the population, households, and area from the 2020 census file — among other things for all census summary levels. It’s really handy to have.

This document is very helpful in understanding the Census files when you load them into PANDAS: 2020 Census State (P.L. 94-171) Redistricting Summary File Technical Documentation.

For all of these scripts, you will need to adjust the variables for the actual paths on your computer where they are saved. The overlay shape file can be anything, but you will need to update the catField to match the actual field in the shapefile that you want to calculate the population.

Population of an Area

The below code calculates the area of overlay layer, if you have an overlay shapefile with a series of rings extending out from the NYS Capitol. As this covers a large area, we use blockgroup sums to calculate, and then the cumulative sum of each ring.

import pandas as pd
import geopandas as gpd

# path to overlay shapefile
overlayshp = r'/tmp/dis_to_albany.gpkg'

# summary level -- 750 is tabulation block, 150 is blockgroup
# large areas over about 50 miles much faster to use bg
summaryLevel = 150
#summaryLevel = 750

# path to block or blockgroup file
if summaryLevel == 150:
    blockshp = r'/home/andy/Documents/GIS.Data/census.tiger/36_New_York/tl_2020_36_bg20.shp.gpkg'
else:
    blockshp = r'/home/andy/Documents/GIS.Data/census.tiger/36_New_York/tl_2020_36_tabblock20.shp.gpkg'

# path to PL 94-171 redistricting geoheader file
pl94171File = '/home/andy/Desktop/nygeo2020.pl'

# field to categorize on (such as Ward -- required!)
catField = 'Name'

# geo header contains 2020 census population in column 90 
# per PL 94-171 documentation, low memory chunking disabled 
# as it causes issues with the geoid column being mixed types
df=pd.read_csv(pl94171File,delimiter='|',header=None, low_memory=False )

# column 2 is summary level 
population=df[(df.iloc[:,2] == summaryLevel)][[9,90]]

# load overlay
overlay = gpd.read_file(overlayshp).to_crs(epsg='3857')

# shapefile of nys 2020 blocks, IMPORTANT (!) mask by output file for speed
blocks = gpd.read_file(blockshp,mask=overlay).to_crs(epsg='3857')

# geoid for linking to shapefile is column 9
joinedBlocks=blocks.set_index('GEOID20').join(population.set_index(9))

# store the size of unbroken blocks
# in case overlay lines break blocks into two
joinedBlocks['area']=joinedBlocks.area

# run union
unionBlocks=gpd.overlay(overlay, joinedBlocks, how='union')

# drop blocks outside of overlay
unionBlocks=unionBlocks.dropna(subset=[catField])

# create population projection when a block crosses
# an overlay line -- avoid double counting -- this isn't perfect
# as we loose a 0.15 percent due to floating point errors
unionBlocks['sublock']=unionBlocks[90]*(unionBlocks.area/unionBlocks['area'])

# sum blocks in category
unionBlocks=pd.DataFrame(unionBlocks.groupby(catField).sum()['sublock'])

# rename columns
unionBlocks=unionBlocks.rename({'sublock': '2020 Census Population'},axis=1)

# calculate cumulative sum as you go out each ring
unionBlocks['millions']=unionBlocks.cumsum(axis=0)['2020 Census Population']/1000000

# each ring is 50 miles
unionBlocks['miles']=unionBlocks.index*50

# output
unionBlocks


Redistricting / Discrepancy from Ideal Districts

This is a variant of the above script, calculating the deviation in population from an ideal district. As this covers a small area, we use data from the block level. See below and the comments.

import pandas as pd
import geopandas as gpd

# path to overlay shapefile
overlayshp = r'/home/andy/Documents/GIS.Data/election.districts/albany wards 2015.gpkg'

# summary level -- 750 is tabulation block, 150 is blockgroup
# large areas over about 50 miles much faster to use bg
#summaryLevel = 150
summaryLevel = 750

# path to block or blockgroup file
if summaryLevel == 150:
    blockshp = r'/home/andy/Documents/GIS.Data/census.tiger/36_New_York/tl_2020_36_bg20.shp.gpkg'
else:
    blockshp = r'/home/andy/Documents/GIS.Data/census.tiger/36_New_York/tl_2020_36_tabblock20.shp.gpkg'

# path to PL 94-171 redistricting geoheader file
pl94171File = '/home/andy/Desktop/nygeo2020.pl'

# field to categorize on (such as Ward -- required!)
catField = 'Ward'

# geo header contains 2020 census population in column 90 
# per PL 94-171 documentation, low memory chunking disabled 
# as it causes issues with the geoid column being mixed types
df=pd.read_csv(pl94171File,delimiter='|',header=None, low_memory=False )

# column 2 is summary level 
population=df[(df.iloc[:,2] == summaryLevel)][[9,90]]

# load overlay
overlay = gpd.read_file(overlayshp).to_crs(epsg='3857')

# shapefile of nys 2020 blocks, IMPORTANT (!) mask by output file for speed
blocks = gpd.read_file(blockshp,mask=overlay).to_crs(epsg='3857')

# geoid for linking to shapefile is column 9
joinedBlocks=blocks.set_index('GEOID20').join(population.set_index(9))

# store the size of unbroken blocks
# in case overlay lines break blocks into two
joinedBlocks['area']=joinedBlocks.area

# run union
unionBlocks=gpd.overlay(overlay, joinedBlocks, how='union')

# drop blocks outside of overlay
unionBlocks=unionBlocks.dropna(subset=[catField])

# create population projection when a block crosses
# an overlay line -- avoid double counting -- this isn't perfect
# as we loose a 0.15 percent due to floating point errors
unionBlocks['sublock']=unionBlocks[90]*(unionBlocks.area/unionBlocks['area'])

# sum blocks in category
unionBlocks=pd.DataFrame(unionBlocks.groupby(catField).sum()['sublock'])

# rename columns
unionBlocks=unionBlocks.rename({'sublock': '2020 Census Population'},axis=1)

# calculate ideal ward based on 15 districts, 2020 albany population 99,224
unionBlocks['Ideal']=99224/15

# calculate departure from ideal
unionBlocks['Departure']=unionBlocks['2020 Census Population']-unionBlocks['Ideal']

# calculate percent departure
unionBlocks['Percent Departure']=unionBlocks['Departure']/unionBlocks['2020 Census Population']*100

# output
unionBlocks

NPR

The 2020 Census Data For Voting Districts Will Be Available Aug. 12 : NPR

After months of delays, the 2020 census results used to redraw voting districts around the country will finally be released on Aug. 12, the U.S. Census Bureau said Thursday.

In a tweet, the federal government's largest statistical agency confirmed that the detailed demographic data will be posted on its website four days sooner than Aug. 16, the previously announced deadline the bureau had agreed to meet as part of a lawsuit by Ohio over the data's release schedule.

The coronavirus pandemic and interference by the administration of former President Donald Trump have forced the bureau to put out new redistricting data about five months later than its original schedule in order to run more quality checks.

Census

Now that I’ve learned all of the advantages data processing with Python, I think my next adventure is going to be to figure out how to directly query the various Census databases directly rather than downloading the individual zip files from then editing and processing them in LibreOffice Calc. I’d much rather try to automate as much of this as possible so I can do new concepts for the blog, importing and mapping even more data.

It sure would be nice to able to get full census tables for a county with a few lines of code and have them dumped into a CSV file for a chart or to link to a map.

Percent Of Houses Lacking Complete Plumbing Facalities

This map shows the number of houses, per US County lacking complete plumbing facilities. This includes both occupied and unoccupied housing units, e.g. seasonal cabins. The US Census defines complete plumbing facilities are houses with hot and cold running water, a flush toilet, a bathtub or shower, a sink with a faucet, a stove or range, and a refrigerator. Rural counties, with a lot of cabins, are most likely to have a large number of housing units without complete plumbing facilities.

Data Source: Complete Plumbing Facilities, 2011-2015 American Community Survey 5-Year Estimates. https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t

NPR

Biden Revives An Effort To Change How The U.S. Census Asks About Race : NPR

President Biden's White House is reviving a previously stalled review of proposed policy changes that could allow the Census Bureau to ask about people's race and ethnicity in a radical new way in time for the 2030 head count, NPR has learned.

First proposed in 2016, the recommendations lost steam during former President Donald Trump's administration despite years of research by the bureau that suggested a new question format would improve the accuracy of 2020 census data about Latinos and people with roots in the Middle East or North Africa.

The proposals also appear to have received the backing of other federal government experts on data about race and ethnicity, based on a redacted document that NPR obtained through a Freedom of Information Act request. The document lists headings for redacted descriptions of the group's "recommended improvements," including "Improve data quality: Allow flexibility in question format for self-reported race and ethnicity."