Acquiring U.S. census data with Python and cenpy
There are several useful online sources for accessing census data provided both by the US census Bureau (American Factfinder), and outside sources. These sources, however, are not conducive to large scale data aquisition and analysis. The Cenpy python package allows for programmitic access of this data through the Census Bureau’s API.
This tutorial outlines the use of the Cenpy package to search for, and acquire specific census data. Cenpy saves this data as a Pandas dataframe. These dataframes allow for easy access and analysis of data within python. For easy visualization of this data look into the GeoPandas package. This package builds on the base Pandas package to add tools for geospatial data analysis.
Objectives
- Install Cenpy package
- Search for desired census data
- Download and store data
Dependencies
The Cenpy package depends on pandas and requests. Ensure that python and pip are already properly installed then use the following commands to install cenpy.
!pip install pandas
!pip install requests
!pip install cenpy
!pip install pysal
import pandas as pd
import cenpy as cen
import pysal
Finding Data
The cenpy explorer module allows you to view all of the available United States Census Bureau API’s.
datasets = list(cen.explorer.available(verbose=True).items())
# print first rows of the dataframe containing datasets
pd.DataFrame(datasets).head()
0 | 1 | |
---|---|---|
0 | 2012acs3 | 2012 American Community Survey: 3-Year Estimates |
1 | NONEMP2013 | 2013 Nonemployer Statistics: Non Employer Stat... |
2 | BDSFirms | Time Series Business Dynamics Statistics: Firm... |
3 | POPESTprmagesex2013 | Vintage 2013 Population Estimates: Puerto Rico... |
4 | POPESTcty2013 | Vintage 2013 Population Estimates: County Tota... |
Passing the name of a specific API to explorer.explain()
will give a description of the data available. For this example, we will use the 2012 American Community Service 1 year data (2012acs1
).
dataset = '2012acs1'
cen.explorer.explain(dataset)
{'2012 American Community Survey: 1-Year Estimates': "The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties. For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units."}
The base module allows you to establish a connection with the desired API that will be used later to acquire data.
con = cen.base.Connection(dataset)
con
Connection to 2012 American Community Survey: 1-Year Estimates (ID: http://api.census.gov/data/id/2012acs1)
Acquiring Data
Geographical specification
Cenpy uses FIPS codes to specify the geographical extent of the data to be downloaded. The object con
is our connection to the api, and the attribute geographies
is a dictionary.
print(type(con))
print(type(con.geographies))
print(con.geographies.keys())
<class 'cenpy.remote.APIConnection'>
<class 'dict'>
dict_keys(['fips'])
# print head of data frame in the geographies dictionary
con.geographies['fips'].head()
geoLevelId | name | optionalWithWCFor | requires | |
---|---|---|---|---|
0 | 500 | congressional district | state | [state] |
1 | 060 | county subdivision | NaN | [state, county] |
2 | 795 | public use microdata area | NaN | [state] |
3 | 320 | metropolitan statistical area/micropolitan sta... | NaN | [state] |
4 | 310 | metropolitan statistical area/micropolitan sta... | NaN | NaN |
geo_unit
and geo_filter
are both necessary arguments for the query()
function. geo_unit
specifies the scale at which data should be taken. geo_filter
then creates a filter to ensure too much data is not downloaded. The following example will download data from all counties in Colorado (state FIPS codes are accessible here).
g_unit = 'county:*'
g_filter = {'state':'8'}
Specifying variables to extract
The other argument taken by query()
is cols. This is a list of columns taken from the variables of the API. These variables can be displayed using the variables
function, however, due to the number of variables it is easier to use the Social Explorer site to find data you are interested in.
var = con.variables
print('Number of variables in', dataset, ':', len(var))
con.variables.head()
Number of variables in 2012acs1 : 68401
concept | label | predicateOnly | predicateType | |
---|---|---|---|---|
AIANHH | NaN | American Indian Area/Alaska Native Area/Hawaii... | NaN | NaN |
AIANHHFP | NaN | American Indian Area/Alaska Native Area/Hawaii... | NaN | NaN |
AIHHTLI | NaN | American Indian Trust Land/Hawaiian Home Land ... | NaN | NaN |
AITS | NaN | American Indian Tribal Subdivision (FIPS) | NaN | NaN |
AITSCE | NaN | American Indian Tribal Subdivision (Census) | NaN | NaN |
Related columns of data will always start with the same base prefix, so cenpy has an included function, varslike
, that will create a list of column names that match the input pattern. It is also useful to add on the NAME
and GEOID
columns, as these will provide the name and geographic id of all data. In this example, we will use the B01001A, which gives data for sex by age within the desired geography. The identifier at the end corresponds to males or females of different age groups.
cols = con.varslike('B01001A_')
cols.extend(['NAME', 'GEOID'])
With the three necessary arguments, data can be downloaded and saved as a pandas dataframe.
data = con.query(cols, geo_unit=g_unit, geo_filter=g_filter)
# prints a deprecation warning because of how cenpy calls pandas
/home/max/anaconda3/lib/python3.5/site-packages/cenpy/remote.py:167: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
df[cols] = df[cols].convert_objects(convert_numeric=convert_numeric)
It is useful to replace the default index with the data from the NAME
or GEOID
column, as these will give a more useful description of the data.
data.index = data.NAME
# print first five rows and last five columns
data.ix[:5, -5:]
B01001A_030M | B01001A_031E | B01001A_031M | NAME | GEOID | |
---|---|---|---|---|---|
NAME | |||||
Adams County, Colorado | 671 | 2483 | 670 | Adams County, Colorado | 05000US08001 |
Arapahoe County, Colorado | 701 | 5125 | 688 | Arapahoe County, Colorado | 05000US08005 |
Boulder County, Colorado | 636 | 2985 | 645 | Boulder County, Colorado | 05000US08013 |
Denver County, Colorado | 654 | 5408 | 650 | Denver County, Colorado | 05000US08031 |
Douglas County, Colorado | 384 | 1177 | 366 | Douglas County, Colorado | 05000US08035 |
Topologically Integrated Geographic Encoding and Referencing (TIGER) data
The Census TIGER API provides geomotries for desired geographic regions. For instance, perhaps we want to have additional information on each county such as area.
cen.tiger.available()
[{'name': 'AIANNHA', 'type': 'MapServer'},
{'name': 'CBSA', 'type': 'MapServer'},
{'name': 'Hydro_LargeScale', 'type': 'MapServer'},
{'name': 'Hydro', 'type': 'MapServer'},
{'name': 'Labels', 'type': 'MapServer'},
{'name': 'Legislative', 'type': 'MapServer'},
{'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'},
{'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'},
{'name': 'Region_Division', 'type': 'MapServer'},
{'name': 'School', 'type': 'MapServer'},
{'name': 'Special_Land_Use_Areas', 'type': 'MapServer'},
{'name': 'State_County', 'type': 'MapServer'},
{'name': 'tigerWMS_ACS2013', 'type': 'MapServer'},
{'name': 'tigerWMS_ACS2014', 'type': 'MapServer'},
{'name': 'tigerWMS_ACS2015', 'type': 'MapServer'},
{'name': 'tigerWMS_Census2010', 'type': 'MapServer'},
{'name': 'tigerWMS_Current', 'type': 'MapServer'},
{'name': 'tigerWMS_Econ2012', 'type': 'MapServer'},
{'name': 'tigerWMS_PhysicalFeatures', 'type': 'MapServer'},
{'name': 'Tracts_Blocks', 'type': 'MapServer'},
{'name': 'Transportation_LargeScale', 'type': 'MapServer'},
{'name': 'Transportation', 'type': 'MapServer'},
{'name': 'TribalTracts', 'type': 'MapServer'},
{'name': 'Urban', 'type': 'MapServer'},
{'name': 'USLandmass', 'type': 'MapServer'}]
First, you must establish a connection to the TIGER API, then you can display the avaialable layers. No Tiger data is available for ACS 2012, so we will use the ACS 2013 for the sake of example, but ideally you will be able to find corresponding Tiger data.
con.set_mapservice('tigerWMS_ACS2013')
# print layers
con.mapservice.layers
{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
4: (ESRILayer) Tribal Census Tracts,
5: (ESRILayer) Tribal Census Tracts Labels,
6: (ESRILayer) Tribal Block Groups,
7: (ESRILayer) Tribal Block Groups Labels,
8: (ESRILayer) Census Tracts,
9: (ESRILayer) Census Tracts Labels,
10: (ESRILayer) Census Block Groups,
11: (ESRILayer) Census Block Groups Labels,
12: (ESRILayer) Unified School Districts,
13: (ESRILayer) Unified School Districts Labels,
14: (ESRILayer) Secondary School Districts,
15: (ESRILayer) Secondary School Districts Labels,
16: (ESRILayer) Elementary School Districts,
17: (ESRILayer) Elementary School Districts Labels,
18: (ESRILayer) Estates,
19: (ESRILayer) Estates Labels,
20: (ESRILayer) County Subdivisions,
21: (ESRILayer) County Subdivisions Labels,
22: (ESRILayer) Subbarrios,
23: (ESRILayer) Subbarrios Labels,
24: (ESRILayer) Consolidated Cities,
25: (ESRILayer) Consolidated Cities Labels,
26: (ESRILayer) Incorporated Places,
27: (ESRILayer) Incorporated Places Labels,
28: (ESRILayer) Census Designated Places,
29: (ESRILayer) Census Designated Places Labels,
30: (ESRILayer) Alaska Native Regional Corporations,
31: (ESRILayer) Alaska Native Regional Corporations Labels,
32: (ESRILayer) Tribal Subdivisions,
33: (ESRILayer) Tribal Subdivisions Labels,
34: (ESRILayer) Federal American Indian Reservations,
35: (ESRILayer) Federal American Indian Reservations Labels,
36: (ESRILayer) Off-Reservation Trust Lands,
37: (ESRILayer) Off-Reservation Trust Lands Labels,
38: (ESRILayer) State American Indian Reservations,
39: (ESRILayer) State American Indian Reservations Labels,
40: (ESRILayer) Hawaiian Home Lands,
41: (ESRILayer) Hawaiian Home Lands Labels,
42: (ESRILayer) Alaska Native Village Statistical Areas,
43: (ESRILayer) Alaska Native Village Statistical Areas Labels,
44: (ESRILayer) Oklahoma Tribal Statistical Areas,
45: (ESRILayer) Oklahoma Tribal Statistical Areas Labels,
46: (ESRILayer) State Designated Tribal Statistical Areas,
47: (ESRILayer) State Designated Tribal Statistical Areas Labels,
48: (ESRILayer) Tribal Designated Statistical Areas,
49: (ESRILayer) Tribal Designated Statistical Areas Labels,
50: (ESRILayer) American Indian Joint-Use Areas,
51: (ESRILayer) American Indian Joint-Use Areas Labels,
52: (ESRILayer) 113th Congressional Districts,
53: (ESRILayer) 113th Congressional Districts Labels,
54: (ESRILayer) 2013 State Legislative Districts - Upper,
55: (ESRILayer) 2013 State Legislative Districts - Upper Labels,
56: (ESRILayer) 2013 State Legislative Districts - Lower,
57: (ESRILayer) 2013 State Legislative Districts - Lower Labels,
58: (ESRILayer) Census Divisions,
59: (ESRILayer) Census Divisions Labels,
60: (ESRILayer) Census Regions,
61: (ESRILayer) Census Regions Labels,
62: (ESRILayer) 2010 Census Urbanized Areas,
63: (ESRILayer) 2010 Census Urbanized Areas Labels,
64: (ESRILayer) 2010 Census Urban Clusters,
65: (ESRILayer) 2010 Census Urban Clusters Labels,
66: (ESRILayer) Combined New England City and Town Areas,
67: (ESRILayer) Combined New England City and Town Areas Labels,
68: (ESRILayer) New England City and Town Area Divisions,
69: (ESRILayer) New England City and Town Area Divisions Labels,
70: (ESRILayer) Metropolitan New England City and Town Areas,
71: (ESRILayer) Metropolitan New England City and Town Areas Labels,
72: (ESRILayer) Micropolitan New England City and Town Areas,
73: (ESRILayer) Micropolitan New England City and Town Areas Labels,
74: (ESRILayer) Combined Statistical Areas,
75: (ESRILayer) Combined Statistical Areas Labels,
76: (ESRILayer) Metropolitan Divisions,
77: (ESRILayer) Metropolitan Divisions Labels,
78: (ESRILayer) Metropolitan Statistical Areas,
79: (ESRILayer) Metropolitan Statistical Areas Labels,
80: (ESRILayer) Micropolitan Statistical Areas,
81: (ESRILayer) Micropolitan Statistical Areas Labels,
82: (ESRILayer) States,
83: (ESRILayer) States Labels,
84: (ESRILayer) Counties,
85: (ESRILayer) Counties Labels}
The data retrieved earlier was at the county level, so we will use layer 84. Using the tiger connection, query()
can retrieve the data, taking the layer and the geographic location as arguments.
geodata = con.mapservice.query(layer=84, where='STATE=8')
# preview geodata
geodata.ix[:5, :5]
AREALAND | AREAWATER | BASENAME | CENTLAT | CENTLON | |
---|---|---|---|---|---|
0 | 1881237983 | 36592000 | Boulder | +40.0924502 | -105.3577112 |
1 | 396290895 | 4208401 | Denver | +39.7620189 | -104.8765880 |
2 | 6179976050 | 30284242 | Pueblo | +38.1732359 | -104.5127778 |
3 | 85478497 | 1411781 | Broomfield | +39.9541268 | -105.0527108 |
4 | 2958007403 | 16886462 | Delta | +38.8613998 | -107.8631974 |
5 | 4605714129 | 8166134 | Cheyenne | +38.8281780 | -102.6034141 |
This data can now be merged with the original data to create one pandas dataframe containing all of the relevant data.
newdata = pd.merge(data, geodata, left_on='county', right_on='COUNTY')
newdata.ix[:5, -5:]
NAME_y | OBJECTID | OID | STATE | geometry | |
---|---|---|---|---|---|
0 | Adams County | 1226 | 27553700234319 | 08 | <pysal.cg.shapes.Polygon object at 0x7f6173163... |
1 | Arapahoe County | 2980 | 27553703789414 | 08 | <pysal.cg.shapes.Polygon object at 0x7f617096c... |
2 | Boulder County | 512 | 27553701435070 | 08 | <pysal.cg.shapes.Polygon object at 0x7f617448c... |
3 | Denver County | 529 | 27553700234321 | 08 | <pysal.cg.shapes.Polygon object at 0x7f617448c... |
4 | Douglas County | 2762 | 27553711656416 | 08 | <pysal.cg.shapes.Polygon object at 0x7f6173058... |
5 | El Paso County | 2878 | 27553704502958 | 08 | <pysal.cg.shapes.Polygon object at 0x7f6171448... |
Leave a Comment