Skip to content →

Data

Organizations often do great work collecting data, but then share it in ways that are hard to access or understand, or require all users to repeat hours of cleaning to make the data usable. Sometimes a data hero comes along to share their own improved version that is cleaned and easier to access and understand. Here I share links to some of these “most-improved” datasets.

IPUMS.org is the gold standard here if you want microdata (individual-level survey responses) from any of the following:
American Community Survey (surveys ~3million Americans per year about demographics, work, income, et c)
Current Population Survey (surveys ~100,000 Americans per month about demographics and work, with supplements on additional topics. Some questions asked since 1962)
Medical Expenditure Panel Survey (surveys ~30,000 Americans per year about health status and health spending)

County Business Patterns Database: The Census Bureau has long collected data about the number of employees and establishments in each industry in each county. But their website makes you download each year separately, and only goes back to 1986. The authors of the County Business Patterns Database provide a harmonized panel in one file that goes from the present all the way to 1975.

Quarterly Census of Employment and Wages: The Bureau of Labor Statistics has collected data on employment, wages, and the number of establishments by state and detailed industry back to 1975. Their page is actually decent; they provide links to each year of data, and they have a good reason for not providing one file with all years- it would be well over 10GB. Still, it could take each user hours to download each year they want, delete extraneous information, and merge years together into a reasonably sized panel. That’s why its great that some people who already spent those hours shared their code: here’s R code and Stata code to get exactly what you want (and nothing more) out of the QCEW. The Stata code comes from Gabriel Chodorow-Reich; his page has code for several other datasets too.

Statistics of US Business: The SUSB is compiled by the Census Bureau, and like the QCEW it collects data on employment, payrolls, and the number of establishments by state and detailed industry. They each have slight advantages and disadvantages; the SUSB has firm counts as well as establishment counts, and has more detail at some levels (e.g. 4-digit NAICS codes by establishment size), but its only annual (instead of quarterly) and only goes back to 1997. The official SUSB page has the same basic issue as the QCEW page, with the additional problem that they change their file naming conventions from year to year sometimes. But because its not quite as big as the QCEW, its actually reasonable to merge all years into a single file that retains all variables; doing so comes in at just under 3GB. Here’s the Stata code I used to do so, here’s a page with the full merged SUSB (1997-2022) and a smaller version with less detail (up to 3-digit NAICS).

Behavioral Risk Factor Surveillance System Survey: The BRFSS has been collected by the Centers for Disease Control since the 1980s. It now surveys 400,000 Americans each year on health-related topics including alcohol and drug use, health status, chronic disease, health care use, height and weight, diet, and exercise, along with demographics and geography. It’s a great survey that is underused because the CDC only offers it in XPT and ASC formats. So I offer the 1987-2023 BRFSS in Stata DTA and Excel CSV formats here.

State Life Expectancy Data 1990-2019: The CDC NCHS collects the underlying mortality data, but only makes state life expectancy easily available back to 2018. IHME extended this back to 1990, but puts it in a complex sheet that never actually gives overall life expectancy by state. I offer a simplified, easy to use version of state life expectancy data back to 1990 here.

State Demographics By Year 1962-2024: State by year averages for key demographic variables commonly used as controls in regressions: age, race, sex, marital status, income, education, health insurance. Created using IPUMS microdata from the Current Population Survey- Annual Social and Economic Supplement. CPS data covers all US states back to 1977, and some back to 1962. Available in a neat panel in CSV and Stata formats, I share the cleaning code on OSF.

County Demographics By Year 1969-2023It was shocking difficult to find a neat panel with basic demographic information for each county for many years together in one neat spreadsheet. CDC SEER County Population Files seem to be the closest- they bring great info together into a single file, but the formatting is horrible. So I cleaned and reshaped it and posted it on Kaggle.

National Survey of Children’s HealthA great dataset on children’s health that has been taken down from government websites as of early February 2025. I put it back up, including improvements (Excel and Stata versions that merge all years 2016-2023) on my OSF page.

University of Michigan Consumer SurveysA survey on consumer confidence, inflation expectations, and more that has been conducted since 1978. U Michigan shares the individual-level responses from the survey, but only in an unlabelled CSV file on a hard-to-find page. I share a cleaned and labelled Stata file on Kaggle.

Andrew Forrester maintains a similar page to this one here that provides cleaned versions of data on economic freedom, DOL PERM (permanent labor certification), the Home Mortgage Disclosure Act (HMDA), and the Community Reinvestment Act (CRA).

If you’re looking for data from one of my papers, see the links on my Research page or my Open Science Foundation profile. Not everything is posted publicly yet but if you ask about something I’ll move it up the priority list.

Coming soon: AHRF files in non-terrible formats

Think another dataset belongs on this page? Want me to update one of the datasets I manage to include a new release? Let me know