Working PANDAS and American Community Survey Summary File

Want to be able to work with American Community Survey data offline using your own local copy of the ACS 5-year Summary File? It’s pretty easy to do with PANDAS. If you are planning a lot of Census queries, this can be a very fast way to extract data.

Before you can use this script, you will need to download some data:

Sequence Table Lookup File (Data Headers): ACS_5yr_Seq_Table_Number_Lookup.xlsx
Geography Table Lookup File (Links Logical Record Number to Geography): 5_year_Mini_Geo.xlsx
Summary File (Actual Data): 5 Year Sequence by State Summary File – Each state has two files, one with all geographies except Blockgroups, and one with just Census Tracts and Blockgroups

01import pandas as pd
02 
03path = '/home/andy/Desktop/acs-summary-file/'
04 
05# list of geography
06geo = pd.read_excel(path+'5_year_Mini_Geo.xlsx', sheet_name='ny',index_col='Logical Record Number')
07 
08# load headers
09header = pd.read_excel(path+'ACS_5yr_Seq_Table_Number_Lookup.xlsx')
10 
11# create a column with census variable headers
12header['COL_NAME'] = header['Table ID'] + '_' + header['Line Number'].apply(lambda a: "{0:.0f}".format(a).zfill(3))
13 
14# segment id, along with ACS year and state
15segId = 135
16year = 2019
17state = 'ny'
18 
19# create a list of headers for segment file
20segHead = ['FILEID','FILETYPE','STUSAB','CHARITER','SEQUENCE','LOGRECNO'] \
21    + header.query('`Sequence Number` == '+str(segId)).dropna(subset=['Line Number'])['COL_NAME'].to_list()
22 
23# read the segment file, including column names above    
24seg = pd.read_csv(path+'e'+str(year)+'5'+state+(str(segId).zfill(4))+'000.txt',header=None, names=segHead, index_col=5)
25 
26# join the segment file to geography using Logical Record number
27seg = geo.join(seg)
28 
29# calculate percentage of households with internet subscriptions -- codes from ACS_5yr_Seq_Table_Number_Lookup.xlsx
30seg['Internet Subscription']=seg['B28011_002']/seg['B28011_001']*100
31 
32# output the percentage of households by county with internet subscriptions
33seg[seg['Geography ID'].str.startswith('050')][['Geography Name','Internet Subscription']]

	Geography Name	Internet Subscription
Logical Record Number
13	Albany County, New York	83.888889
14	Allegany County, New York	76.248050
15	Bronx County, New York	75.917821
16	Broome County, New York	82.222562
17	Cattaraugus County, New York	72.431480
…	…	…
70	Washington County, New York	80.224036
71	Wayne County, New York	81.508715
72	Westchester County, New York	86.371288
73	Wyoming County, New York	78.387887
74	Yates County, New York	75.916583

`1`	`# alternatively you can display human readable columns automatically`

`2`	`seg.rename(dict(zip(header['COL_NAME'],header['Table Title'])),axis=1)`

	State	Geography ID	Geography Name	FILEID	FILETYPE	STUSAB	CHARITER	SEQUENCE	Total:	Has one or more types of computing devices:
Logical Record Number
1	NY	04000US36	New York	ACSSF	201900000.0	ny	0.0	135.0	7343234.0	6581493.0
2	NY	04001US36	New York — Urban	ACSSF	201900000.0	ny	0.0	135.0	6433524.0	5771681.0
3	NY	04043US36	New York — Rural	ACSSF	201900000.0	ny	0.0	135.0	909710.0	809812.0
4	NY	040A0US36	New York — In metropolitan or micropolitan st…	ACSSF	201900000.0	ny	0.0	135.0	7189902.0	6449723.0
5	NY	040C0US36	New York — In metropolitan statistical area	ACSSF	201900000.0	ny	0.0	135.0	6796057.0	6109882.0
…	…	…	…	…	…	…	…	…	…	…
28400	NY	97000US3631920	Yonkers City School District, New York	ACSSF	201900000.0	ny	0.0	135.0	74897.0	65767.0
28401	NY	97000US3631950	York Central School District, New York	ACSSF	201900000.0	ny	0.0	135.0	2116.0	1964.0
28402	NY	97000US3631980	Yorktown Central School District, New York	ACSSF	201900000.0	ny	0.0	135.0	7068.0	6751.0
28403	NY	97000US3632010	Cuba-Rushford Central School District, New York	ACSSF	201900000.0	ny	0.0	135.0	2629.0	2186.0
28404	NY	97000US3699999	Remainder of New York, New York	ACSSF	201900000.0	ny	0.0	135.0	79779.0	75425.0

Too much work or don’t want to download the summary file yourself? You can query the Census API directly using PyPI’s censusdata library from PIP. For infrequent queries where you are online, for those with Internet at home, you would be much better off just querying the API directly.

01import pandas as pd
02import censusdata as cd
03 
04# attributes to load
05cdcol=['B28011_001','B28011_002']
06 
07cdf = cd.download('acs5', 2019,
08           cd.censusgeo([('state', '36'),
09                         ('county','*')]),
10          cdcol)
11 
12 
13# seperate out the geoid and geography name
14geoid=[]
15geoname=[]
16 
17for index in cdf.index.tolist():
18    geopart=''
19    for part in index.geo:
20        geopart = geopart + part[1]
21    geoid.append(geopart)
22    geoname.append(index.name)
23 
24cdf['geoid']=geoid
25cdf['geoname']=geoname
26 
27# calculate percentage with internet subscriptions
28cdf['Internet Subscription']=cdf['B28011_002']/cdf['B28011_001']*100
29 
30# output a similar table as above
31cdf

Learn how to load into PANDAS the PL 94-171 2020 Redistricting Data, a process that is similar but different then ACS data.

Also, calculate the population of an area and it’s average demographics, including areas that don’t have Census demographics such as Election Districts or County Legislative districts.

Working PANDAS and American Community Survey Summary File

Leave a Reply Cancel reply