Acquiring U.S. census data with Python and cenpy

There are several useful online sources for accessing census data provided both by the US census Bureau (American Factfinder), and outside sources. These sources, however, are not conducive to large scale data aquisition and analysis. The Cenpy python package allows for programmitic access of this data through the Census Bureau’s API.

This tutorial outlines the use of the Cenpy package to search for, and acquire specific census data. Cenpy saves this data as a Pandas dataframe. These dataframes allow for easy access and analysis of data within python. For easy visualization of this data look into the GeoPandas package. This package builds on the base Pandas package to add tools for geospatial data analysis.

Objectives

Install Cenpy package
Search for desired census data
Download and store data

Dependencies

The Cenpy package depends on pandas and requests. Ensure that python and pip are already properly installed then use the following commands to install cenpy.

!pip install pandas
!pip install requests
!pip install cenpy
!pip install pysal

import pandas as pd
import cenpy as cen
import pysal

Finding Data

The cenpy explorer module allows you to view all of the available United States Census Bureau API’s.

datasets = list(cen.explorer.available(verbose=True).items())

# print first rows of the dataframe containing datasets
pd.DataFrame(datasets).head()

	0	1
0	2012acs3	2012 American Community Survey: 3-Year Estimates
1	NONEMP2013	2013 Nonemployer Statistics: Non Employer Stat...
2	BDSFirms	Time Series Business Dynamics Statistics: Firm...
3	POPESTprmagesex2013	Vintage 2013 Population Estimates: Puerto Rico...
4	POPESTcty2013	Vintage 2013 Population Estimates: County Tota...

Passing the name of a specific API to explorer.explain() will give a description of the data available. For this example, we will use the 2012 American Community Service 1 year data (2012acs1).

dataset = '2012acs1'
cen.explorer.explain(dataset)

{'2012 American Community Survey: 1-Year Estimates': "The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years.  Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself.  The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups.  Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties.  For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units."}

The base module allows you to establish a connection with the desired API that will be used later to acquire data.

con = cen.base.Connection(dataset)
con

Connection to 2012 American Community Survey: 1-Year Estimates (ID: http://api.census.gov/data/id/2012acs1)

Acquiring Data

Geographical specification

Cenpy uses FIPS codes to specify the geographical extent of the data to be downloaded. The object con is our connection to the api, and the attribute geographies is a dictionary.

print(type(con))
print(type(con.geographies))
print(con.geographies.keys())

<class 'cenpy.remote.APIConnection'>
<class 'dict'>
dict_keys(['fips'])

# print head of data frame in the geographies dictionary
con.geographies['fips'].head()

	geoLevelId	name	optionalWithWCFor	requires
0	500	congressional district	state	[state]
1	060	county subdivision	NaN	[state, county]
2	795	public use microdata area	NaN	[state]
3	320	metropolitan statistical area/micropolitan sta...	NaN	[state]
4	310	metropolitan statistical area/micropolitan sta...	NaN	NaN

geo_unit and geo_filter are both necessary arguments for the query() function. geo_unit specifies the scale at which data should be taken. geo_filter then creates a filter to ensure too much data is not downloaded. The following example will download data from all counties in Colorado (state FIPS codes are accessible here).

g_unit = 'county:*'
g_filter = {'state':'8'}

Specifying variables to extract

The other argument taken by query() is cols. This is a list of columns taken from the variables of the API. These variables can be displayed using the variables function, however, due to the number of variables it is easier to use the Social Explorer site to find data you are interested in.

var = con.variables
print('Number of variables in', dataset, ':', len(var))
con.variables.head()

Number of variables in 2012acs1 : 68401

	concept	label	predicateOnly	predicateType
AIANHH	NaN	American Indian Area/Alaska Native Area/Hawaii...	NaN	NaN
AIANHHFP	NaN	American Indian Area/Alaska Native Area/Hawaii...	NaN	NaN
AIHHTLI	NaN	American Indian Trust Land/Hawaiian Home Land ...	NaN	NaN
AITS	NaN	American Indian Tribal Subdivision (FIPS)	NaN	NaN
AITSCE	NaN	American Indian Tribal Subdivision (Census)	NaN	NaN

Related columns of data will always start with the same base prefix, so cenpy has an included function, varslike, that will create a list of column names that match the input pattern. It is also useful to add on the NAME and GEOID columns, as these will provide the name and geographic id of all data. In this example, we will use the B01001A, which gives data for sex by age within the desired geography. The identifier at the end corresponds to males or females of different age groups.

cols = con.varslike('B01001A_')
cols.extend(['NAME', 'GEOID'])

With the three necessary arguments, data can be downloaded and saved as a pandas dataframe.

data = con.query(cols, geo_unit=g_unit, geo_filter=g_filter)
# prints a deprecation warning because of how cenpy calls pandas

/home/max/anaconda3/lib/python3.5/site-packages/cenpy/remote.py:167: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df[cols] = df[cols].convert_objects(convert_numeric=convert_numeric)

It is useful to replace the default index with the data from the NAME or GEOID column, as these will give a more useful description of the data.

data.index = data.NAME

# print first five rows and last five columns
data.ix[:5, -5:]

	B01001A_030M	B01001A_031E	B01001A_031M	NAME	GEOID
NAME
Adams County, Colorado	671	2483	670	Adams County, Colorado	05000US08001
Arapahoe County, Colorado	701	5125	688	Arapahoe County, Colorado	05000US08005
Boulder County, Colorado	636	2985	645	Boulder County, Colorado	05000US08013
Denver County, Colorado	654	5408	650	Denver County, Colorado	05000US08031
Douglas County, Colorado	384	1177	366	Douglas County, Colorado	05000US08035

Topologically Integrated Geographic Encoding and Referencing (TIGER) data

The Census TIGER API provides geomotries for desired geographic regions. For instance, perhaps we want to have additional information on each county such as area.

cen.tiger.available()

[{'name': 'AIANNHA', 'type': 'MapServer'},
 {'name': 'CBSA', 'type': 'MapServer'},
 {'name': 'Hydro_LargeScale', 'type': 'MapServer'},
 {'name': 'Hydro', 'type': 'MapServer'},
 {'name': 'Labels', 'type': 'MapServer'},
 {'name': 'Legislative', 'type': 'MapServer'},
 {'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'},
 {'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'},
 {'name': 'Region_Division', 'type': 'MapServer'},
 {'name': 'School', 'type': 'MapServer'},
 {'name': 'Special_Land_Use_Areas', 'type': 'MapServer'},
 {'name': 'State_County', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2013', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2014', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2015', 'type': 'MapServer'},
 {'name': 'tigerWMS_Census2010', 'type': 'MapServer'},
 {'name': 'tigerWMS_Current', 'type': 'MapServer'},
 {'name': 'tigerWMS_Econ2012', 'type': 'MapServer'},
 {'name': 'tigerWMS_PhysicalFeatures', 'type': 'MapServer'},
 {'name': 'Tracts_Blocks', 'type': 'MapServer'},
 {'name': 'Transportation_LargeScale', 'type': 'MapServer'},
 {'name': 'Transportation', 'type': 'MapServer'},
 {'name': 'TribalTracts', 'type': 'MapServer'},
 {'name': 'Urban', 'type': 'MapServer'},
 {'name': 'USLandmass', 'type': 'MapServer'}]

First, you must establish a connection to the TIGER API, then you can display the avaialable layers. No Tiger data is available for ACS 2012, so we will use the ACS 2013 for the sake of example, but ideally you will be able to find corresponding Tiger data.

con.set_mapservice('tigerWMS_ACS2013')

# print layers
con.mapservice.layers

{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
 4: (ESRILayer) Tribal Census Tracts,
 5: (ESRILayer) Tribal Census Tracts Labels,
 6: (ESRILayer) Tribal Block Groups,
 7: (ESRILayer) Tribal Block Groups Labels,
 8: (ESRILayer) Census Tracts,
 9: (ESRILayer) Census Tracts Labels,
 10: (ESRILayer) Census Block Groups,
 11: (ESRILayer) Census Block Groups Labels,
 12: (ESRILayer) Unified School Districts,
 13: (ESRILayer) Unified School Districts Labels,
 14: (ESRILayer) Secondary School Districts,
 15: (ESRILayer) Secondary School Districts Labels,
 16: (ESRILayer) Elementary School Districts,
 17: (ESRILayer) Elementary School Districts Labels,
 18: (ESRILayer) Estates,
 19: (ESRILayer) Estates Labels,
 20: (ESRILayer) County Subdivisions,
 21: (ESRILayer) County Subdivisions Labels,
 22: (ESRILayer) Subbarrios,
 23: (ESRILayer) Subbarrios Labels,
 24: (ESRILayer) Consolidated Cities,
 25: (ESRILayer) Consolidated Cities Labels,
 26: (ESRILayer) Incorporated Places,
 27: (ESRILayer) Incorporated Places Labels,
 28: (ESRILayer) Census Designated Places,
 29: (ESRILayer) Census Designated Places Labels,
 30: (ESRILayer) Alaska Native Regional Corporations,
 31: (ESRILayer) Alaska Native Regional Corporations Labels,
 32: (ESRILayer) Tribal Subdivisions,
 33: (ESRILayer) Tribal Subdivisions Labels,
 34: (ESRILayer) Federal American Indian Reservations,
 35: (ESRILayer) Federal American Indian Reservations Labels,
 36: (ESRILayer) Off-Reservation Trust Lands,
 37: (ESRILayer) Off-Reservation Trust Lands Labels,
 38: (ESRILayer) State American Indian Reservations,
 39: (ESRILayer) State American Indian Reservations Labels,
 40: (ESRILayer) Hawaiian Home Lands,
 41: (ESRILayer) Hawaiian Home Lands Labels,
 42: (ESRILayer) Alaska Native Village Statistical Areas,
 43: (ESRILayer) Alaska Native Village Statistical Areas Labels,
 44: (ESRILayer) Oklahoma Tribal Statistical Areas,
 45: (ESRILayer) Oklahoma Tribal Statistical Areas Labels,
 46: (ESRILayer) State Designated Tribal Statistical Areas,
 47: (ESRILayer) State Designated Tribal Statistical Areas Labels,
 48: (ESRILayer) Tribal Designated Statistical Areas,
 49: (ESRILayer) Tribal Designated Statistical Areas Labels,
 50: (ESRILayer) American Indian Joint-Use Areas,
 51: (ESRILayer) American Indian Joint-Use Areas Labels,
 52: (ESRILayer) 113th Congressional Districts,
 53: (ESRILayer) 113th Congressional Districts Labels,
 54: (ESRILayer) 2013 State Legislative Districts - Upper,
 55: (ESRILayer) 2013 State Legislative Districts - Upper Labels,
 56: (ESRILayer) 2013 State Legislative Districts - Lower,
 57: (ESRILayer) 2013 State Legislative Districts - Lower Labels,
 58: (ESRILayer) Census Divisions,
 59: (ESRILayer) Census Divisions Labels,
 60: (ESRILayer) Census Regions,
 61: (ESRILayer) Census Regions Labels,
 62: (ESRILayer) 2010 Census Urbanized Areas,
 63: (ESRILayer) 2010 Census Urbanized Areas Labels,
 64: (ESRILayer) 2010 Census Urban Clusters,
 65: (ESRILayer) 2010 Census Urban Clusters Labels,
 66: (ESRILayer) Combined New England City and Town Areas,
 67: (ESRILayer) Combined New England City and Town Areas Labels,
 68: (ESRILayer) New England City and Town Area Divisions,
 69: (ESRILayer) New England City and Town Area  Divisions Labels,
 70: (ESRILayer) Metropolitan New England City and Town Areas,
 71: (ESRILayer) Metropolitan New England City and Town Areas Labels,
 72: (ESRILayer) Micropolitan New England City and Town Areas,
 73: (ESRILayer) Micropolitan New England City and Town Areas Labels,
 74: (ESRILayer) Combined Statistical Areas,
 75: (ESRILayer) Combined Statistical Areas Labels,
 76: (ESRILayer) Metropolitan Divisions,
 77: (ESRILayer) Metropolitan Divisions Labels,
 78: (ESRILayer) Metropolitan Statistical Areas,
 79: (ESRILayer) Metropolitan Statistical Areas Labels,
 80: (ESRILayer) Micropolitan Statistical Areas,
 81: (ESRILayer) Micropolitan Statistical Areas Labels,
 82: (ESRILayer) States,
 83: (ESRILayer) States Labels,
 84: (ESRILayer) Counties,
 85: (ESRILayer) Counties Labels}

The data retrieved earlier was at the county level, so we will use layer 84. Using the tiger connection, query() can retrieve the data, taking the layer and the geographic location as arguments.

geodata = con.mapservice.query(layer=84, where='STATE=8')

# preview geodata
geodata.ix[:5, :5]

	AREALAND	AREAWATER	BASENAME	CENTLAT	CENTLON
0	1881237983	36592000	Boulder	+40.0924502	-105.3577112
1	396290895	4208401	Denver	+39.7620189	-104.8765880
2	6179976050	30284242	Pueblo	+38.1732359	-104.5127778
3	85478497	1411781	Broomfield	+39.9541268	-105.0527108
4	2958007403	16886462	Delta	+38.8613998	-107.8631974
5	4605714129	8166134	Cheyenne	+38.8281780	-102.6034141

This data can now be merged with the original data to create one pandas dataframe containing all of the relevant data.

newdata = pd.merge(data, geodata, left_on='county', right_on='COUNTY')
newdata.ix[:5, -5:]

	NAME_y	OBJECTID	OID	STATE	geometry
0	Adams County	1226	27553700234319	08	<pysal.cg.shapes.Polygon object at 0x7f6173163...
1	Arapahoe County	2980	27553703789414	08	<pysal.cg.shapes.Polygon object at 0x7f617096c...
2	Boulder County	512	27553701435070	08	<pysal.cg.shapes.Polygon object at 0x7f617448c...
3	Denver County	529	27553700234321	08	<pysal.cg.shapes.Polygon object at 0x7f617448c...
4	Douglas County	2762	27553711656416	08	<pysal.cg.shapes.Polygon object at 0x7f6173058...
5	El Paso County	2878	27553704502958	08	<pysal.cg.shapes.Polygon object at 0x7f6171448...

Share on

Twitter Facebook Google+ LinkedIn

Earth Analytics