Package 'usdata' reference manual

Title:	Data on the States and Counties of the United States
Description:	Demographic data on the United States at the county and state levels spanning multiple years.
Authors:	Mine Çetinkaya-Rundel [aut, cre] , David Diez [aut], Leah Dorazio [aut]
Maintainer:	Mine Çetinkaya-Rundel <[email protected]>
License:	GPL-3
Version:	0.3.1
Built:	2025-03-14 05:29:47 UTC
Source:	https://github.com/openintrostat/usdata

Convert state abbreviations to names

Description

Two utility functions. One converts state names to the state abbreviations, and the second does the opposite.

Usage

abbr2state(abbr)
abbr2state(abbr)

Arguments

abbr

A vector of state abbreviation.

Value

Returns a vector of the same length with the corresponding state names or abbreviations.

Author(s)

David Diez

Examples


abbr2state("MN")
abbr2state("MN")

Airline Delays for December 2019 and 2020.

Description

Summary Data counts for airline per carrier per US City.

Usage

airline_delay
airline_delay

Format

A data frame with 3351 rows and 21 variables.

year: Year data collected
month: Numeric representation of the month
carrier: Carrier.
carrier_name: Carrier Name.
airport: Airport code.
airport_name: Name of airport.
arr_flights: Number of flights arriving at airport
arr_del15: Number of flights more than 15 minutes late
carrier_ct: Number of flights delayed due to air carrier. (e.g. no crew)
weather_ct: Number of flights due to weather.
nas_ct: Number of flights delayed due to National Aviation System (e.g. heavy air traffic).
security_ct: Number of flights canceled due to a security breach.
late_aircraft_ct: Number of flights delayed as a result of another flight on the same aircraft delayed
arr_cancelled: Number of cancelled flights
arr_diverted: Number of flights that were diverted
arr_delay: Total time (minutes) of delayed flight.
carrier_delay: Total time (minutes) of delay due to air carrier
weather_delay: Total time (minutes) of delay due to inclement weather.
nas_delay: Total time (minutes) of delay due to National Aviation System.
security_delay: Total time (minutes) of delay as a result of a security issue .
late_aircraft_delay: Total time (minutes) of delay flights as a result of a previous flight on the same airplane being late.

Source

Bureau of Transportation Statistics

Examples

library(ggplot2)
ggplot(airline_delay, aes(arr_flights, arr_del15, color = as.factor(year))) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Total Number of inbound flights",
    y = "Number of flights delayed by more than 15 mins",
    title = "Inbound vs delayed flights by year",
    color = "Year"
  )
library(ggplot2)
ggplot(airline_delay, aes(arr_flights, arr_del15, color = as.factor(year))) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Total Number of inbound flights",
    y = "Number of flights delayed by more than 15 mins",
    title = "Inbound vs delayed flights by year",
    color = "Year"
  )

United States Counties

Description

Data for 3142 counties in the United States. See the county_complete data set for additional variables.

Usage

county
county

Format

A data frame with 3142 observations on the following 14 variables.

name: County names.
state: State names.
pop2000: Population in 2000.
pop2010: Population in 2010.
pop2017: Population in 2017.
pop_change: Population change from 2010 to 2017.
poverty: Percent of population in poverty in 2017.
homeownership: Home ownership rate, 2006-2010.
multi_unit: Percent of housing units in multi-unit structures, 2006-2010.
unemployment_rate: Unemployment rate in 2017.
metro: Whether the county contains a metropolitan area.
median_edu: Median education level (2013-2017).
per_capita_income: Per capita (per person) income (2013-2017).
median_hh_income: Median household income.
smoking_ban: Describes whether the type of county-level smoking ban in place in 2010, taking one of the values "none", "partial", or "comprehensive".

Source

These data were collected from Census Quick Facts (no longer available as of 2020) and its accompanying pages. Smoking ban data were from a variety of sources.

Examples


library(ggplot2)

ggplot(county, aes(x = median_edu, y = median_hh_income)) +
  geom_boxplot()
library(ggplot2)

ggplot(county, aes(x = median_edu, y = median_hh_income)) +
  geom_boxplot()

American Community Survey 2019

Description

Data for 3142 counties in the United States with many variables of the 2019 American Community Survey.

Usage

county_2019
county_2019

Format

A data frame with 3142 observations on the following 95 variables.

state: State.
name: County name.
fips: FIPS code.
median_individual_income: Median individual income (2019).
median_individual_income_moe: Margin of error for median_individual_income.
pop: 2019 population.
pop_moe: Margin of error for pop.
white: Percent of population that is white alone (2015-2019).
white_moe: Margin of error for white.
black: Percent of population that is black alone (2015-2019).
black_moe: Margin of error for black.
native: Percent of population that is Native American alone (2015-2019).
native_moe: Margin of error for native.
asian: Percent of population that is Asian alone (2015-2019).
asian_moe: Margin of error for asian.
pac_isl: Percent of population that is Native Hawaiian or other Pacific Islander alone (2015-2019).
pac_isl_moe: Margin of error for pac_isl.
other_single_race: Percent of population that is some other race alone (2015-2019).
other_single_race_moe: Margin of error for other_single_race.
two_plus_races: Percent of population that is two or more races (2015-2019).
two_plus_races_moe: Margin of error for two_plus_races.
hispanic: Percent of population that identifies as Hispanic or Latino (2015-2019).
hispanic_moe: Margin of error for hispanic.
white_not_hispanic: Percent of population that is white alone, not Hispanic or Latino (2015-2019).
white_not_hispanic_moe: Margin of error for white_not_hispanic.
median_age: Median age (2015-2019).
median_age_moe: Margin of error for median_age.
age_under_5: Percent of population under 5 (2015-2019).
age_under_5_moe: Margin of error for age_under_5.
age_over_85: Percent of population 85 and over (2015-2019).
age_over_85_moe: Margin of error for age_over_85.
age_over_18: Percent of population 18 and over (2015-2019).
age_over_18_moe: Margin of error for age_over_18.
age_over_65: Percent of population 65 and over (2015-2019).
age_over_65_moe: Margin of error for age_over_65.
mean_work_travel: Mean travel time to work (2015-2019).
mean_work_travel_moe: Margin of error for mean_work_travel.
persons_per_household: Persons per household (2015-2019)
persons_per_household_moe: Margin of error for persons_per_household.
avg_family_size: Average family size (2015-2019).
avg_family_size_moe: Margin of error for avg_family_size.
housing_one_unit_structures: Percent of housing units in 1-unit structures (2015-2019).
housing_one_unit_structures_moe: Margin of error for housing_one_unit_structures.
housing_two_unit_structures: Percent of housing units in multi-unit structures (2015-2019).
housing_two_unit_structures_moe: Margin of error for housing_two_unit_structures.
housing_mobile_homes: Percent of housing units in mobile homes and other types of units (2015-2019).
housing_mobile_homes_moe: Margin of error for housing_mobile_homes.
median_individual_income_age_25plus: Median individual income (2019 dollars, 2015-2019).
median_individual_income_age_25plus_moe: Margin of error for median_individual_income_age_25plus.
hs_grad: Percent of population 25 and older that is a high school graduate (2015-2019).
hs_grad_moe: Margin of error for hs_grad.
bachelors: Percent of population 25 and older that earned a Bachelor's degree or higher (2015-2019).
bachelors_moe: Margin of error for bachelors.
households: Total households (2015-2019).
households_moe: Margin of error for households.
households_speak_spanish: Percent of households speaking Spanish (2015-2019).
households_speak_spanish_moe: Margin of error for households_speak_spanish.
households_speak_other_indo_euro_lang: Percent of households speaking other Indo-European language (2015-2019).
households_speak_other_indo_euro_lang_moe: Margin of error for households_speak_other_indo_euro_lang.
households_speak_asian_or_pac_isl: Percent of households speaking Asian and Pacific Island language (2015-2019).
households_speak_asian_or_pac_isl_moe: Margin of error for households_speak_asian_or_pac_isl.
households_speak_other: Percent of households speaking non European or Asian/Pacific Island language (2015-2019).
households_speak_other_moe: Margin of error for households_speak_other.
households_speak_limited_english: Percent of limited English-speaking households (2015-2019).
households_speak_limited_english_moe: Margin of error for households_speak_limited_english.
poverty: Percent of population below the poverty level (2015-2019).
poverty_moe: Margin of error for poverty.
poverty_under_18: Percent of population under 18 below the poverty level (2015-2019).
poverty_under_18_moe: Margin of error for poverty_under_18.
poverty_65_and_over: Percent of population 65 and over below the poverty level (2015-2019).
poverty_65_and_over_moe: Margin of error for poverty_65_and_over.
mean_household_income: Mean household income (2019 dollars, 2015-2019).
mean_household_income_moe: Margin of error for mean_household_income.
per_capita_income: Per capita money income in past 12 months (2019 dollars, 2015-2019).
per_capita_income_moe: Margin of error for per_capita_income.
median_household_income: Median household income (2015-2019).
median_household_income_moe: Margin of error for median_household_income.
veterans: Percent among civilian population 18 and over that are veterans (2015-2019).
veterans_moe: Margin of error for veterans.
unemployment_rate: Unemployment rate among those ages 20-64 (2015-2019).
unemployment_rate_moe: Margin of error for unemployment_rate.
uninsured: Percent of civilian noninstitutionalized population that is uninsured (2015-2019).
uninsured_moe: Margin of error for uninsured.
uninsured_under_6: Percent of population under 6 years that is uninsured (2015-2019).
uninsured_under_6_moe: Margin of error for uninsured_under_6.
uninsured_under_19: Percent of population under 19 that is uninsured (2015-2019).
uninsured_under_19_moe: Margin of error for uninsured_under_19.
uninsured_65_and_older: Percent of population 65 and older that is uninsured (2015-2019).
uninsured_65_and_older_moe: Margin of error for uninsured_65_and_older.
household_has_computer: Percent of households that have desktop or laptop computer (2015-2019).
household_has_computer_moe: Margin of error for household_has_computer.
household_has_smartphone: Percent of households that have smartphone (2015-2019).
household_has_smartphone_moe: Margin of error for household_has_smartphone.
household_has_broadband: Percent of households that have broadband internet subscription (2015-2019).
household_has_broadband_moe: Margin of error for household_has_broadband.

Source

The data were downloaded via the tidycensus R package.

Examples


library(ggplot2)

ggplot(
  county_2019,
  aes(
    x = hs_grad, y = median_individual_income,
    size = sqrt(pop) / 1000
  )
) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Percentage of population graduated from high school",
    y = "Median individual income"
  )
library(ggplot2)

ggplot(
  county_2019,
  aes(
    x = hs_grad, y = median_individual_income,
    size = sqrt(pop) / 1000
  )
) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Percentage of population graduated from high school",
    y = "Median individual income"
  )

United States Counties

Description

Data for 3142 counties in the United States.

Usage

county_complete
county_complete

Format

A data frame with 3142 observations on the following 188 variables.

state: State.
name: County name.
fips: FIPS code.
pop2000: 2000 population.
pop2010: 2010 population.
pop2011: 2011 population.

names

pop2012: 2012 population.
pop2013: 2013 population.
pop2014: 2014 population.
pop2015: 2015 population.
pop2016: 2016 population.
pop2017: 2017 population.
age_under_5_2010: Percent of population under 5 (2010).
age_under_5_2017: Percent of population under 5 (2017).
age_under_18_2010: Percent of population under 18 (2010).
age_over_65_2010: Percent of population over 65 (2010).
age_over_65_2017: Percent of population over 65 (2017).
median_age_2017: Median age (2017).
female_2010: Percent of population that is female (2010).
white_2010: Percent of population that is white (2010).
black_2010: Percent of population that is black (2010).
black_2017: Percent of population that is black (2017).
native_2010: Percent of population that is a Native American (2010).
native_2017: Percent of population that is a Native American (2017).
asian_2010: Percent of population that is a Asian (2010).
asian_2017: Percent of population that is a Asian (2017).
pac_isl_2010: Percent of population that is Hawaii or Pacific Islander (2010).
pac_isl_2017: Percent of population that is Hawaii or Pacific Islander (2017).
other_single_race_2017: Percent of population that identifies as another single race (2017).
two_plus_races_2010: Percent of population that identifies as two or more races (2010).
two_plus_races_2017: Percent of population that identifies as two or more races (2017).
hispanic_2010: Percent of population that is Hispanic (2010).
hispanic_2017: Percent of population that is Hispanic (2017).
white_not_hispanic_2010: Percent of population that is white and not Hispanic (2010).
white_not_hispanic_2017: Percent of population that is white and not Hispanic (2017).
speak_english_only_2017: Percent of population that speaks English only (2017).
no_move_in_one_plus_year_2010: Percent of population that has not moved in at least one year (2006-2010).
foreign_born_2010: Percent of population that is foreign-born (2006-2010).
foreign_spoken_at_home_2010: Percent of population that speaks a foreign language at home (2006-2010).
women_16_to_50_birth_rate_2017: Birth rate for women ages 16 to 50 (2017).
hs_grad_2010: Percent of population that is a high school graduate (2006-2010).
hs_grad_2016: Percent of population that is a high school graduate (2012-2016).
hs_grad_2017: Percent of population that is a high school graduate (2017).
some_college_2016: Percent of population with some college education (2012-2016).
some_college_2017: Percent of population with some college education (2017).
bachelors_2010: Percent of population that earned a bachelor's degree (2006-2010).
bachelors_2016: Percent of population that earned a bachelor's degree (2012-2016).
bachelors_2017: Percent of population that earned a bachelor's degree (2017).
veterans_2010: Percent of population that are veterans (2006-2010).
veterans_2017: Percent of population that are veterans (2017).
mean_work_travel_2010: Mean travel time to work (2006-2010).
mean_work_travel_2017: Mean travel time to work (2017).
broadband_2017: Percent of population who has access to broadband (2017).
computer_2017: Percent of population who has access to a computer (2017).
housing_units_2010: Number of housing units (2010).
homeownership_2010: Home ownership rate (2006-2010).
housing_multi_unit_2010: Housing units in multi-unit structures (2006-2010).
median_val_owner_occupied_2010: Median value of owner-occupied housing units (2006-2010).
households_2010: Households (2006-2010).
households_2017: Households (2017).
persons_per_household_2010: Persons per household (2006-2010).
persons_per_household_2017: Persons per household (2017).
per_capita_income_2010: Per capita money income in past 12 months (2010 dollars, 2006-2010)
per_capita_income_2017: Per capita money income in past 12 months (2017 dollars, 2017)
metro_2013: Whether the county contained a metropolitan area in 2013.
median_household_income_2010: Median household income (2006-2010).
median_household_income_2016: Median household income (2012-2016).
median_household_income_2017: Median household income (2017).
private_nonfarm_establishments_2009: Private nonfarm establishments (2009).
private_nonfarm_employment_2009: Private nonfarm employment (2009).
percent_change_private_nonfarm_employment_2009: Private nonfarm employment, percent change from 2000 to 2009.
nonemployment_establishments_2009: Nonemployer establishments (2009).
firms_2007: Total number of firms (2007).
black_owned_firms_2007: Black-owned firms, percent (2007).
native_owned_firms_2007: Native American-owned firms, percent (2007).
asian_owned_firms_2007: Asian-owned firms, percent (2007).
pac_isl_owned_firms_2007: Native Hawaiian and other Pacific Islander-owned firms, percent (2007).
hispanic_owned_firms_2007: Hispanic-owned firms, percent (2007).
women_owned_firms_2007: Women-owned firms, percent (2007).
manufacturer_shipments_2007: Manufacturer shipments, 2007 ($1000).
mercent_whole_sales_2007: Mercent wholesaler sales, 2007 ($1000).
sales_2007: Retail sales, 2007 ($1000).
sales_per_capita_2007: Retail sales per capita, 2007.
accommodation_food_service_2007: Accommodation and food services sales, 2007 ($1000).
building_permits_2010: Building permits (2010).
fed_spending_2009: Federal spending, in thousands of dollars (2009).
area_2010: Land area in square miles (2010).
density_2010: Persons per square mile (2010).
smoking_ban_2010: Describes whether the type of county-level smoking ban in place in 2010, taking one of the values "none", "partial", or "comprehensive".
poverty_2010: Percent of population below poverty level (2006-2010).
poverty_2016: Percent of population below poverty level (2012-2016).
poverty_2017: Percent of population below poverty level (2017).
poverty_age_under_5_2017: Percent of population under age 5 below poverty level (2017).
poverty_age_under_18_2017: Percent of population under age 18 below poverty level (2017).
civilian_labor_force_2007: Civilian labor force in 2007.
employed_2007: Number of civilians employed in 2007.
unemployed_2007: Number of civilians unemployed in 2007.
unemployment_rate_2007: Unemployment rate in 2007.
civilian_labor_force_2008: Civilian labor force in 2008.
employed_2008: Number of civilians employed in 2008.
unemployed_2008: Number of civilians unemployed in 2008.
unemployment_rate_2008: Unemployment rate in 2008.
civilian_labor_force_2009: Civilian labor force in 2009.
employed_2009: Number of civilians employed in 2009.
unemployed_2009: Number of civilians unemployed in 2009.
unemployment_rate_2009: Unemployment rate in 2009.
civilian_labor_force_2010: Civilian labor force in 2010.
employed_2010: Number of civilians employed in 2010.
unemployed_2010: Number of civilians unemployed in 2010.
unemployment_rate_2010: Unemployment rate in 2010.
civilian_labor_force_2011: Civilian labor force in 2011.
employed_2011: Number of civilians employed in 2011.
unemployed_2011: Number of civilians unemployed in 2011.
unemployment_rate_2011: Unemployment rate in 2011.
civilian_labor_force_2012: Civilian labor force in 2012.
employed_2012: Number of civilians employed in 2012.
unemployed_2012: Number of civilians unemployed in 2012.
unemployment_rate_2012: Unemployment rate in 2012.
civilian_labor_force_2013: Civilian labor force in 2013.
employed_2013: Number of civilians employed in 2013.
unemployed_2013: Number of civilians unemployed in 2013.
unemployment_rate_2013: Unemployment rate in 2013.
civilian_labor_force_2014: Civilian labor force in 2014.
employed_2014: Number of civilians employed in 2014.
unemployed_2014: Number of civilians unemployed in 2014.
unemployment_rate_2014: Unemployment rate in 2014.
civilian_labor_force_2015: Civilian labor force in 2015.
employed_2015: Number of civilians employed in 2015.
unemployed_2015: Number of civilians unemployed in 2015.
unemployment_rate_2015: Unemployment rate in 2015.
civilian_labor_force_2016: Civilian labor force in 2016.
employed_2016: Number of civilians employed in 2016.
unemployed_2016: Number of civilians unemployed in 2016.
unemployment_rate_2016: Unemployment rate in 2016.
uninsured_2017: Percent of population who are uninsured (2017).
uninsured_age_under_6_2017: Percent of population under 6 who are uninsured (2017).
uninsured_age_under_19_2017: Percent of population under 19 who are uninsured (2017).
uninsured_age_over_74_2017: Percent of population under 74 who are uninsured (2017).
civilian_labor_force_2017: Civilian labor force in 2017.
employed_2017: Number of civilians employed in 2017.
unemployed_2017: Number of civilians unemployed in 2017.
unemployment_rate_2017: Unemployment rate in 2017.
median_individual_income_2019: Median individual income (2019).
pop_2019: 2019 population.
white_2019: Percent of population that is white alone (2015-2019).
black_2019: Percent of population that is black alone (2015-2019).
native_2019: Percent of population that is Native American alone (2015-2019).
asian_2019: Percent of population that is Asian alone (2015-2019).
pac_isl_2019: Percent of population that is Native Hawaiian or other Pacific Islander alone (2015-2019).
other_single_race_2019: Percent of population that is some other race alone (2015-2019).
two_plus_races_2019: Percent of population that is two or more races (2015-2019).
hispanic_2019: Percent of population that identifies as Hispanic or Latino (2015-2019).
white_not_hispanic_2019: Percent of population that is white alone, not Hispanic or Latino (2015-2019).
median_age_2019: Median age (2015-2019).
age_under_5_2019: Percent of population under 5 (2015-2019).
age_over_85_2019: Percent of population 85 and over (2015-2019).
age_over_18_2019: Percent of population 18 and over (2015-2019).
age_over_65_2019: Percent of population 65 and over (2015-2019).
mean_work_travel_2019: Mean travel time to work (2015-2019).
persons_per_household_2019: Persons per household (2015-2019)
avg_family_size_2019: Average family size (2015-2019).
housing_one_unit_structures_2019: Percent of housing units in 1-unit structures (2015-2019).
housing_two_unit_structures_2019: Percent of housing units in multi-unit structures (2015-2019).
housing_mobile_homes_2019: Percent of housing units in mobile homes and other types of units (2015-2019).
median_individual_income_age_25plus_2019: Median individual income (2019 dollars, 2015-2019).
hs_grad_2019: Percent of population 25 and older that is a high school graduate (2015-2019).
bachelors_2019: Percent of population 25 and older that earned a Bachelor's degree or higher (2015-2019).
households_2019: Total households (2015-2019).
households_speak_spanish_2019: Percent of households speaking Spanish (2015-2019).
households_speak_other_indo_euro_lang_2019: Percent of households speaking other Indo-European language (2015-2019).
households_speak_asian_or_pac_isl_2019: Percent of households speaking Asian and Pacific Island language (2015-2019).
households_speak_other_2019: Percent of households speaking non European or Asian/Pacific Island language (2015-2019).
households_speak_limited_english_2019: Percent of limited English-speaking households (2015-2019).
poverty_2019: Percent of population below the poverty level (2015-2019).
poverty_under_18_2019: Percent of population under 18 below the poverty level (2015-2019).
poverty_65_and_over_2019: Percent of population 65 and over below the poverty level (2015-2019).
mean_household_income_2019: Mean household income (2019 dollars, 2015-2019).
per_capita_income_2019: Per capita money income in past 12 months (2019 dollars, 2015-2019).
median_household_income_2019: Median household income (2015-2019).
veterans_2019: Percent among civilian population 18 and over that are veterans (2015-2019).
unemployment_rate_2019: Unemployment rate among those ages 20-64 (2015-2019).
uninsured_2019: Percent of civilian noninstitutionalized population that is uninsured (2015-2019).
uninsured_under_6_2019: Percent of population under 6 years that is uninsured (2015-2019).
uninsured_under_19_2019: Percent of population under 19 that is uninsured (2015-2019).
uninsured_65_and_older_2019: Percent of population 65 and older that is uninsured (2015-2019).
household_has_computer_2019: Percent of households that have desktop or laptop computer (2015-2019).
household_has_smartphone_2019: Percent of households that have smartphone (2015-2019).
household_has_broadband_2019: Percent of households that have broadband internet subscription (2015-2019).

Source

The data prior to 2011 was from http://census.gov, though the exact page it came from is no longer available.

More recent data comes from the following sources.

Downloaded via the tidycensus R package.
Download links for spreadsheets were found on https://www.ers.usda.gov/data-products/county-level-data-sets/download-data
Unemployment - Bureau of Labor Statistics - LAUS data - https://www.bls.gov/lau/.
Median Household Income - Census Bureau - Small Area Income and Poverty Estimates (SAIPE) data.
The original data table was prepared by USDA, Economic Research Service.
Census Bureau.
2012-16 American Community Survey 5-yr average.
The original data table was prepared by USDA, Economic Research Service.
Tim Parker (tparker at ers.usda.gov) is the contact for much of the new data incorporated into this data set.

Examples


library(dplyr)
library(ggplot2)

county_complete |>
  mutate(
    pop_change = 100 * ((pop2017 / pop2013) - 1),
    metro_area = if_else(metro_2013 == 1, TRUE, FALSE)
  ) |>
  ggplot(aes(
    x = poverty_2016,
    y = pop_change,
    color = metro_area,
    size = sqrt(pop2017) / 1e3
  )) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Percentage of population in poverty (2016)",
    y = "Percentage population change between 2013 to 2017",
    color = "Metropolitan area",
    title = "Population change and poverty"
  )

# Counties with high population change
county_complete |>
  mutate(pop_change = 100 * ((pop2017 / pop2013) - 1)) |>
  filter(pop_change < -10 | pop_change > 25) |>
  select(state, name, fips, pop_change)

# Population by metro area
county_complete |>
  mutate(metro_area = if_else(metro_2013 == 1, TRUE, FALSE)) |>
  filter(!is.na(metro_area)) |>
  ggplot(aes(x = metro_area, y = log(pop2017))) +
  geom_violin() +
  labs(
    x = "Metro area",
    y = "Log of population in 2017",
    title = "Population by metro area"
  )

# Poverty and median household income
county_complete |>
  mutate(metro_area = if_else(metro_2013 == 1, TRUE, FALSE)) |>
  ggplot(aes(
    x = poverty_2016,
    y = median_household_income_2016,
    color = metro_area,
    size = sqrt(pop2017) / 1e3
  )) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Percentage of population in poverty (2016)",
    y = "Median household income (2016)",
    color = "Metropolitan area",
    title = "Poverty and median household income"
  )

# Unemployment rate and poverty
county_complete |>
  mutate(metro_area = if_else(metro_2013 == 1, TRUE, FALSE)) |>
  ggplot(aes(
    x = unemployment_rate_2017,
    y = poverty_2016,
    color = metro_area,
    size = sqrt(pop2017) / 1e3
  )) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Unemployment rate (2017)",
    y = "Percentage of population in poverty (2016)",
    color = "Metropolitan area",
    title = "Unemployment rate and poverty"
  )
library(dplyr)
library(ggplot2)

county_complete |>
  mutate(
    pop_change = 100 * ((pop2017 / pop2013) - 1),
    metro_area = if_else(metro_2013 == 1, TRUE, FALSE)
  ) |>
  ggplot(aes(
    x = poverty_2016,
    y = pop_change,
    color = metro_area,
    size = sqrt(pop2017) / 1e3
  )) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Percentage of population in poverty (2016)",
    y = "Percentage population change between 2013 to 2017",
    color = "Metropolitan area",
    title = "Population change and poverty"
  )

# Counties with high population change
county_complete |>
  mutate(pop_change = 100 * ((pop2017 / pop2013) - 1)) |>
  filter(pop_change < -10 | pop_change > 25) |>
  select(state, name, fips, pop_change)

# Population by metro area
county_complete |>
  mutate(metro_area = if_else(metro_2013 == 1, TRUE, FALSE)) |>
  filter(!is.na(metro_area)) |>
  ggplot(aes(x = metro_area, y = log(pop2017))) +
  geom_violin() +
  labs(
    x = "Metro area",
    y = "Log of population in 2017",
    title = "Population by metro area"
  )

# Poverty and median household income
county_complete |>
  mutate(metro_area = if_else(metro_2013 == 1, TRUE, FALSE)) |>
  ggplot(aes(
    x = poverty_2016,
    y = median_household_income_2016,
    color = metro_area,
    size = sqrt(pop2017) / 1e3
  )) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Percentage of population in poverty (2016)",
    y = "Median household income (2016)",
    color = "Metropolitan area",
    title = "Poverty and median household income"
  )

# Unemployment rate and poverty
county_complete |>
  mutate(metro_area = if_else(metro_2013 == 1, TRUE, FALSE)) |>
  ggplot(aes(
    x = unemployment_rate_2017,
    y = poverty_2016,
    color = metro_area,
    size = sqrt(pop2017) / 1e3
  )) +
  geom_point(alpha = 0.5) +
  scale_color_discrete(na.translate = FALSE) +
  guides(size = FALSE) +
  labs(
    x = "Unemployment rate (2017)",
    y = "Percentage of population in poverty (2016)",
    color = "Metropolitan area",
    title = "Unemployment rate and poverty"
  )

Fatal Police Shootings data.

Description

A subset of the Washington Post database. Contains records of every fatal police shooting by an on-duty officer since January 1, 2015.

Usage

fatal_police_shootings
fatal_police_shootings

Format

A data frame with 6421 rows and 12 variables.

date: date of fatal shooting.
manner_of_death: shot or shot and Tasered.
armed: Indicates if the victim was armed with some sort of implement that a police officer believed could inflict harm.
age: the age of the victim.
gender: The gender of the victim. The Post identifies victims by the gender they identify with if reports indicate that it differs from their biological sex.
race: W White non-Hispanic; B Black non-Hispanic; A Asian; N Native American; H Hispanic; O Other None unknown.
city: The municipality where the fatal shooting took place. Note that in some cases this field may contain a county name if a more specific municipality is unavailable or unknown.
state: two-letter postal code abbreviation.
signs_of_mental_illness: If news reports have indicated the victim had a history of mental health issues, expressed suicidal intentions or was experiencing mental distress at the time of the shooting.
threat_level: The general criteria for the attack label was that there was the most direct and immediate threat to life that would include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. ; the attack category is meant to flag the highest level of threat; the other and undetermined categories represent all remaining cases; other includes many incidents where officers or others faced significant threats.
flee: If news reports have indicated the victim was moving away from officers by Foot, by Car, or Not fleeing.
body_camera: If news reports have indicated an officer was wearing a body camera and it may have recorded some portion of the incident.

Source

Washington Post

Examples

library(dplyr)

# List race frequency and percentage
fatal_police_shootings |>
  group_by(race) |>
  summarize(n = n()) |>
  mutate(freq = n / sum(n) * 100)
# List different weapons that victims were armed with
fatal_police_shootings |>
  distinct(armed)
library(dplyr)

# List race frequency and percentage
fatal_police_shootings |>
  group_by(race) |>
  summarize(n = n()) |>
  mutate(freq = n / sum(n) * 100)
# List different weapons that victims were armed with
fatal_police_shootings |>
  distinct(armed)

Gerrymander

Description

A dataset on gerrymandering and its influence on House elections. The data set was originally built by Jeff Whitmer.

Usage

gerrymander
gerrymander

Format

A data frame with 435 rows and 12 variables:

district: Congressional district.
last_name: Last name of 2016 election winner.
first_name: First name of 2016 election winnner.
party16: Political party of 2016 election winner.
clinton16: Percent of vote received by Clinton in 2016 Presidential Election.
trump16: Percent of vote received by Trump in 2016 Presidential Election.
dem16: Did a Democrat win the 2016 House election. Levels of 1 (yes) and 0 (no).
state: State the Representative is from.
party18: Political Party of the 2018 election winner.
dem18: Did a Democrat win the 2018 House election. Levels of 1 (yes) and 0 (no).
flip18: Did a Democrat flip the seat in the 2018 election? Levels of 1 (yes) and 0 (no).
gerry: Categorical variable for prevalence of gerrymandering with levels of low, mid and high.

Source

Washington Post

Examples

library(ggplot2)
library(dplyr)
ggplot(gerrymander |> filter(gerry != "mid"), aes(clinton16, dem16, color = gerry)) +
  geom_jitter(height = 0.05, size = 3, shape = 1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  scale_color_manual(values = c("purple", "orange")) +
  labs(
    title = "Logistic Regression of 2016 House Elections",
    subtitle = "by Congressional District",
    x = "Percent of Presidential Vote Won by Clinton",
    y = "Seat Won by Democrat Candidate",
    color = "Gerrymandering"
  )
library(ggplot2)
library(dplyr)
ggplot(gerrymander |> filter(gerry != "mid"), aes(clinton16, dem16, color = gerry)) +
  geom_jitter(height = 0.05, size = 3, shape = 1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  scale_color_manual(values = c("purple", "orange")) +
  labs(
    title = "Logistic Regression of 2016 House Elections",
    subtitle = "by Congressional District",
    x = "Percent of Presidential Vote Won by Clinton",
    y = "Seat Won by Democrat Candidate",
    color = "Gerrymandering"
  )

Election results for 2010 Governor races in the U.S.

Description

Election results for 2010 Governor races in the U.S.

Usage

govrace10
govrace10

Format

A data frame with 37 observations on the following 23 variables.

id: Unique identifier for the race, which does not overlap with other 2010 races (see houserace10 and senaterace10)
state: State name
abbr: State name abbreviation
name1: Name of the winning candidate
perc1: Percentage of vote for winning candidate (if more than one candidate)
party1: Party of winning candidate
votes1: Number of votes for winning candidate
name2: Name of candidate with second most votes
perc2: Percentage of vote for candidate who came in second
party2: Party of candidate with second most votes
votes2: Number of votes for candidate who came in second
name3: Name of candidate with third most votes
perc3: Percentage of vote for candidate who came in third
party3: Party of candidate with third most votes
votes3: Number of votes for candidate who came in third
name4: Name of candidate with fourth most votes
perc4: Percentage of vote for candidate who came in fourth
party4: Party of candidate with fourth most votes
votes4: Number of votes for candidate who came in fourth
name5: Name of candidate with fifth most votes
perc5: Percentage of vote for candidate who came in fifth
party5: Party of candidate with fifth most votes
votes5: Number of votes for candidate who came in fifth

Source

MSNBC.com, retrieved 2010-11-09.

Examples


table(govrace10$party1, govrace10$party2)
table(govrace10$party1, govrace10$party2)

Election results for the 2010 U.S. House of Represenatives races

Description

Election results for the 2010 U.S. House of Represenatives races

Usage

houserace10
houserace10

Format

A data frame with 435 observations on the following 24 variables.

id: Unique identifier for the race, which does not overlap with other 2010 races (see govrace10 and senaterace10)
state: State name
abbr: State name abbreviation
num: District number for the state
name1: Name of the winning candidate
perc1: Percentage of vote for winning candidate (if more than one candidate)
party1: Party of winning candidate
votes1: Number of votes for winning candidate
name2: Name of candidate with second most votes
perc2: Percentage of vote for candidate who came in second
party2: Party of candidate with second most votes
votes2: Number of votes for candidate who came in second
name3: Name of candidate with third most votes
perc3: Percentage of vote for candidate who came in third
party3: Party of candidate with third most votes
votes3: Number of votes for candidate who came in third
name4: Name of candidate with fourth most votes
perc4: Percentage of vote for candidate who came in fourth
party4: Party of candidate with fourth most votes
votes4: Number of votes for candidate who came in fourth
name5: Name of candidate with fifth most votes
perc5: Percentage of vote for candidate who came in fifth
party5: Party of candidate with fifth most votes
votes5: Number of votes for candidate who came in fifth

Details

This analysis in the Examples section was inspired by and is similar to that of Nate Silver's district-level analysis on the FiveThirtyEight blog in the New York Times: https://fivethirtyeight.com/features/2010-an-aligning-election/

Source

MSNBC.com, retrieved 2010-11-09.

Examples


hr <- table(houserace10[, c("abbr", "party1")])
nr <- apply(hr, 1, sum)

pr <- prrace08[prrace08$state != "DC", c("state", "p_obama")]
hr <- hr[as.character(pr$state), ]
(fit <- glm(hr ~ pr$p_obama, family = binomial))

x1 <- pr$p_obama[match(houserace10$abbr, pr$state)]
y1 <- (houserace10$party1 == "Democrat") + 0
g <- glm(y1 ~ x1, family = binomial)


x <- pr$p_obama[pr$state != "DC"]
nr <- apply(hr, 1, sum)
plot(x, hr[, "Democrat"] / nr,
  pch = 19, cex = sqrt(nr), col = "#22558844",
  xlim = c(20, 80), ylim = c(0, 1),
  xlab = "Percent vote for Obama in 2008",
  ylab = "Probability of Democrat winning House seat"
)
X <- seq(0, 100, 0.1)
lo <- -5.6079 + 0.1009 * X
p <- exp(lo) / (1 + exp(lo))
lines(X, p)
abline(h = 0:1, lty = 2, col = "#888888")
hr <- table(houserace10[, c("abbr", "party1")])
nr <- apply(hr, 1, sum)

pr <- prrace08[prrace08$state != "DC", c("state", "p_obama")]
hr <- hr[as.character(pr$state), ]
(fit <- glm(hr ~ pr$p_obama, family = binomial))

x1 <- pr$p_obama[match(houserace10$abbr, pr$state)]
y1 <- (houserace10$party1 == "Democrat") + 0
g <- glm(y1 ~ x1, family = binomial)


x <- pr$p_obama[pr$state != "DC"]
nr <- apply(hr, 1, sum)
plot(x, hr[, "Democrat"] / nr,
  pch = 19, cex = sqrt(nr), col = "#22558844",
  xlim = c(20, 80), ylim = c(0, 1),
  xlab = "Percent vote for Obama in 2008",
  ylab = "Probability of Democrat winning House seat"
)
X <- seq(0, 100, 0.1)
lo <- -5.6079 + 0.1009 * X
p <- exp(lo) / (1 + exp(lo))
lines(X, p)
abline(h = 0:1, lty = 2, col = "#888888")

Pierce County House Sales Data for 2020

Description

Real estate sales for Pierce County, WA in 2020.

Usage

pierce_county_house_sales
pierce_county_house_sales

Format

A data frame with 16814 rows and 19 variables.

sale_date: Date the legal document (deed) was executed.
sale_price: Dollar amount recorded for the sale.
house_square_feet: Sum of the square feet for the building.
attic_finished_square_feet: Finished living area in the attic.
basement_square_feet: Total square footage of the basement..
attached_garage_square_feet: Total square footage of the attached or built in garage(s).
detached_garage_square_feet: Total detached garage(s) square footage.
fireplaces: Total count of single, double or PreFab stoves.
hvac_description: Text description associated with the predominant heating source for the built-as structure i.e. Forced Air, Electric Baseboard, Steam, etc. .
exterior: Predominant type of construction materials used for the exterior siding on Residential Buildings.
interior: Predominant type of materials used on the interior walls. i.e. Sheetrock or Paneling.
stories: Number of floors/building levels above grade. Stories do not include attic or basement areas.
roof_cover: Material used for the roof. I.e. Composition Shingles, Wood Shake, Concrete Tile, etc.
year_built: Year the building was built, as stated by the building permit or a historical record.
bedrooms: Number of bedrooms listed for a residential property.
bathrooms: Number of baths listed for a residential property. The number is listed as a decimal, i.e. 2.75 = two full and one three-quarter baths. A tub/sink/toilet combination (plus any additional fixtures) is considered 1.0 bath. A shower/sink/toilet combination (plus any additional fixtures) is 0.75 bath. A sink/toilet combination is .5 bath.
waterfront_type: Describes the type of waterfront the property adjoins or has legal access to.
view_quality: Assigned to reflect the market appeal of the overall view available from the dwelling or property.
utility_sewer: Identifies if sewer/septic is installed, available or not available or if the property does not support an on site sewage disposal system.

Source

Pierce County, Washington

Examples

library(dplyr)
library(lubridate)

# List house sales frequency and average price grouped by month
pierce_county_house_sales |>
  mutate(month_sale = month(sale_date)) |>
  group_by(month_sale) |>
  summarize(freq = n(), mean_price = mean(sale_price)) |>
  arrange(desc(freq))

# List house sales frequency and average price group by waterfront type
pierce_county_house_sales |>
  group_by(waterfront_type) |>
  summarize(freq = n(), mean_price = mean(sale_price)) |>
  arrange(desc(mean_price))
library(dplyr)
library(lubridate)

# List house sales frequency and average price grouped by month
pierce_county_house_sales |>
  mutate(month_sale = month(sale_date)) |>
  group_by(month_sale) |>
  summarize(freq = n(), mean_price = mean(sale_price)) |>
  arrange(desc(freq))

# List house sales frequency and average price group by waterfront type
pierce_county_house_sales |>
  group_by(waterfront_type) |>
  summarize(freq = n(), mean_price = mean(sale_price)) |>
  arrange(desc(mean_price))

Population Age 2019 Data.

Description

State level data on population by age.

Usage

pop_age_2019
pop_age_2019

Format

A data frame with 2820 rows and 4 variables.

state: State as 2 letter abbreviation.
state_name: State name.
age: Age cohort for population.
population: Population of age cohort.
state_total_population: total estimated state population in 2019

Source

Centers for Disease Control and Prevention

Examples

library(dplyr)

# List age population for each state with percent of total
pop_age_2019 |>
  group_by(state_name, age) |>
  mutate(percent = population / state_total_population * 100) |>
  select(state_name, age, population, percent)

pop_age_2019 |>
  select(state_name, state_total_population) |>
  distinct() |>
  arrange(desc(state_total_population))
library(dplyr)

# List age population for each state with percent of total
pop_age_2019 |>
  group_by(state_name, age) |>
  mutate(percent = population / state_total_population * 100) |>
  select(state_name, age, population, percent)

pop_age_2019 |>
  select(state_name, state_total_population) |>
  distinct() |>
  arrange(desc(state_total_population))

Population Race 2019 Data.

Description

State level data on population by race.

Usage

pop_race_2019
pop_race_2019

Format

A data frame with 2820 rows and 4 variables.

state: State as 2 letter abbreviation.
state_name: State name.
race: race cohort for population.
hispanic: indicates whether population is Hispanic or Latino
population: Population of race cohort.
state_total_population: total estimated state population in 2019

Source

Centers for Disease Control and Prevention

Examples

library(dplyr)

# List race population for each state with percent of total
pop_race_2019 |>
  group_by(state_name, race, hispanic) |>
  mutate(percent = population / state_total_population * 100) |>
  select(state_name, race, hispanic, population, percent)

pop_race_2019 |>
  select(state_name, state_total_population) |>
  distinct() |>
  arrange(desc(state_total_population))
library(dplyr)

# List race population for each state with percent of total
pop_race_2019 |>
  group_by(state_name, race, hispanic) |>
  mutate(percent = population / state_total_population * 100) |>
  select(state_name, race, hispanic, population, percent)

pop_race_2019 |>
  select(state_name, state_total_population) |>
  distinct() |>
  arrange(desc(state_total_population))

Presidential Power.

Description

Data from a Pew Research Center poll about Presidential power/control over gas prices.

Usage

prez_pwr
prez_pwr

Format

A data frame with 365 rows and 3 variables.

president: Sitting President at time of the poll.
party: Political party of the respondent with levels d(emocrat) and r(epublican).
has_pwr: Respondent answer to the question: "Is the price of gasoline something the president can do alot about, or is that beyond the president's control?"

Source

Pew Research Center, May 2006 & March 2012.

Examples

library(ggplot2)
ggplot(prez_pwr, aes(has_pwr, fill = party)) +
  geom_bar() +
  labs(
    title = "Is the price of gasoline something the president can do alot about?",
    x = "",
    y = "Number of respondents",
    fill = "Respondent Party"
  ) +
  facet_wrap(~president)
library(ggplot2)
ggplot(prez_pwr, aes(has_pwr, fill = party)) +
  geom_bar() +
  labs(
    title = "Is the price of gasoline something the president can do alot about?",
    x = "",
    y = "Number of respondents",
    fill = "Respondent Party"
  ) +
  facet_wrap(~president)

Election results for the 2008 U.S. Presidential race

Description

Election results for the 2008 U.S. Presidential race

Usage

prrace08
prrace08

Format

A data frame with 51 observations on the following 7 variables.

state: State name abbreviation
state_full: Full state name
n_obama: Number of votes for Barack Obama
p_obama: Proportion of votes for Barack Obama
n_mc_cain: Number of votes for John McCain
p_mc_cain: Proportion of votes for John McCain
el_votes: Number of electoral votes for a state

Details

In Nebraska, 4 electoral votes went to McCain and 1 to Obama. Otherwise the electoral votes were a winner-take-all.

Source

Presidential Election of 2008, Electoral and Popular Vote Summary, retrieved 2011-04-21.

Examples


# ===> Obtain 2010 US House Election Data <===#
hr <- table(houserace10[, c("abbr", "party1")])
nr <- apply(hr, 1, sum)

# ===> Obtain 2008 President Election Data <===#
pr <- prrace08[prrace08$state != "DC", c("state", "p_obama")]
hr <- hr[as.character(pr$state), ]
(fit <- glm(hr ~ pr$p_obama, family = binomial))

# ===> Visualizing Binomial outcomes <===#
x <- pr$p_obama[pr$state != "DC"]
nr <- apply(hr, 1, sum)
plot(x, hr[, "Democrat"] / nr,
  pch = 19, cex = sqrt(nr), col = "#22558844",
  xlim = c(20, 80), ylim = c(0, 1), xlab = "Percent vote for Obama in 2008",
  ylab = "Probability of Democrat winning House seat"
)

# ===> Logistic Regression <===#
x1 <- pr$p_obama[match(houserace10$abbr, pr$state)]
y1 <- (houserace10$party1 == "Democrat") + 0
g <- glm(y1 ~ x1, family = binomial)
X <- seq(0, 100, 0.1)
lo <- -5.6079 + 0.1009 * X
p <- exp(lo) / (1 + exp(lo))
lines(X, p)
abline(h = 0:1, lty = 2, col = "#888888")
# ===> Obtain 2010 US House Election Data <===#
hr <- table(houserace10[, c("abbr", "party1")])
nr <- apply(hr, 1, sum)

# ===> Obtain 2008 President Election Data <===#
pr <- prrace08[prrace08$state != "DC", c("state", "p_obama")]
hr <- hr[as.character(pr$state), ]
(fit <- glm(hr ~ pr$p_obama, family = binomial))

# ===> Visualizing Binomial outcomes <===#
x <- pr$p_obama[pr$state != "DC"]
nr <- apply(hr, 1, sum)
plot(x, hr[, "Democrat"] / nr,
  pch = 19, cex = sqrt(nr), col = "#22558844",
  xlim = c(20, 80), ylim = c(0, 1), xlab = "Percent vote for Obama in 2008",
  ylab = "Probability of Democrat winning House seat"
)

# ===> Logistic Regression <===#
x1 <- pr$p_obama[match(houserace10$abbr, pr$state)]
y1 <- (houserace10$party1 == "Democrat") + 0
g <- glm(y1 ~ x1, family = binomial)
X <- seq(0, 100, 0.1)
lo <- -5.6079 + 0.1009 * X
p <- exp(lo) / (1 + exp(lo))
lines(X, p)
abline(h = 0:1, lty = 2, col = "#888888")

Election results for the 2010 U.S. Senate races

Description

Election results for the 2010 U.S. Senate races

Usage

senaterace10
senaterace10

Format

A data frame with 38 observations on the following 23 variables.

id: Unique identifier for the race, which does not overlap with other 2010 races (see govrace10 and houserace10)
state: State name
abbr: State name abbreviation
name1: Name of the winning candidate
perc1: Percentage of vote for winning candidate (if more than one candidate)
party1: Party of winning candidate
votes1: Number of votes for winning candidate
name2: Name of candidate with second most votes
perc2: Percentage of vote for candidate who came in second
party2: Party of candidate with second most votes
votes2: Number of votes for candidate who came in second
name3: Name of candidate with third most votes
perc3: Percentage of vote for candidate who came in third
party3: Party of candidate with third most votes
votes3: Number of votes for candidate who came in third
name4: Name of candidate with fourth most votes
perc4: Percentage of vote for candidate who came in fourth
party4: Party of candidate with fourth most votes
votes4: Number of votes for candidate who came in fourth
name5: Name of candidate with fifth most votes
perc5: Percentage of vote for candidate who came in fifth
party5: Party of candidate with fifth most votes
votes5: Number of votes for candidate who came in fifth

Source

MSNBC.com, retrieved 2010-11-09.

Examples


library(ggplot2)

ggplot(senaterace10, aes(x = perc1)) +
  geom_histogram(binwidth = 5) +
  labs(x = "Winning candidate vote percentage")
library(ggplot2)

ggplot(senaterace10, aes(x = perc1)) +
  geom_histogram(binwidth = 5) +
  labs(x = "Winning candidate vote percentage")

State-level data

Description

Information about each state collected from both the official US Census website and from various other sources.

Usage

state_stats
state_stats

Format

A data frame with 51 observations on the following 23 variables.

state: State name.
abbr: State abbreviation (e.g. "MN").
fips: FIPS code.
pop2010: Population in 2010.
pop2000: Population in 2000.
homeownership: Home ownership rate.
multiunit: Percent of living units that are in multi-unit structures.
income: Average income per capita.
med_income: Median household income.
poverty: Poverty rate.
fed_spend: Federal spending per capita.
land_area: Land area.
smoke: Percent of population that smokes.
murder: Murders per 100,000 people.
robbery: Robberies per 100,000.
agg_assault: Aggravated assaults per 100,000.
larceny: Larcenies per 100,000.
motor_theft: Vehicle theft per 100,000.
soc_sec: Percent of individuals collecting social security.
nuclear: Percent of power coming from nuclear sources.
coal: Percent of power coming from coal sources.
tr_deaths: Traffic deaths per 100,000.
tr_deaths_no_alc: Traffic deaths per 100,000 where alcohol was not a factor.
unempl: Unemployment rate (February 2012, preliminary).

Source

Census Quick Facts (no longer available as of 2020), InfoChimps (also no longer available as of 2020), National Highway Traffic Safety Administration (tr_deaths, tr_deaths_no_alc), Bureau of Labor Statistics (unempl).

Examples


library(ggplot2)
library(dplyr)
library(maps)

states_selected <- state_stats |>
  mutate(region = tolower(state)) |>
  select(region, unempl, murder, nuclear)

states_map <- map_data("state") |>
  inner_join(states_selected)

# Unemployment map
ggplot(states_map, aes(map_id = region)) +
  geom_map(aes(fill = unempl), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c() +
  labs(x = "", y = "", fill = "Unemployment\n(%)")

# Murder rate map
states_map |>
  filter(region != "district of columbia") |>
  ggplot(aes(map_id = region)) +
  geom_map(aes(fill = murder), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c() +
  labs(x = "", y = "", fill = "Murders\nper 100k")

# Nuclear energy map
ggplot(states_map, aes(map_id = region)) +
  geom_map(aes(fill = nuclear), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c() +
  labs(x = "", y = "", fill = "Nuclear energy\n(%)")
library(ggplot2)
library(dplyr)
library(maps)

states_selected <- state_stats |>
  mutate(region = tolower(state)) |>
  select(region, unempl, murder, nuclear)

states_map <- map_data("state") |>
  inner_join(states_selected)

# Unemployment map
ggplot(states_map, aes(map_id = region)) +
  geom_map(aes(fill = unempl), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c() +
  labs(x = "", y = "", fill = "Unemployment\n(%)")

# Murder rate map
states_map |>
  filter(region != "district of columbia") |>
  ggplot(aes(map_id = region)) +
  geom_map(aes(fill = murder), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c() +
  labs(x = "", y = "", fill = "Murders\nper 100k")

# Nuclear energy map
ggplot(states_map, aes(map_id = region)) +
  geom_map(aes(fill = nuclear), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c() +
  labs(x = "", y = "", fill = "Nuclear energy\n(%)")

Convert state names to abbreviations

Description

Two utility functions. One converts state names to the state abbreviations, and the second does the opposite.

Usage

state2abbr(state)
state2abbr(state)

Arguments

state

A vector of state name, where there is a little fuzzy matching.

Value

Returns a vector of the same length with the corresponding state names or abbreviations.

Author(s)

David Diez

Examples


state2abbr("Minnesota")

# Some spelling/capitalization errors okay
state2abbr("mINnesta")
state2abbr("Minnesota")

# Some spelling/capitalization errors okay
state2abbr("mINnesta")

Summary of many state-level variables

Description

Census data for the 50 states plus DC and Puerto Rico.

Usage

urban_owner
urban_owner

Format

A data frame with 52 observations on the following 28 variables.

state: State
total_housing_units_2000: Total housing units available in 2000.
total_housing_units_2010: Total housing units available in 2010.
pct_vacant: a numeric vector
occupied: Occupied.
pct_owner_occupied: a numeric vector
pop_st: a numeric vector
area_st: a numeric vector
pop_urban: a numeric vector
poppct_urban: a numeric vector
area_urban: a numeric vector
areapct_urban: a numeric vector
popden_urban: a numeric vector
pop_ua: a numeric vector
poppct_urban.1: a numeric vector
area_ua: a numeric vector
areapct_ua: a numeric vector
popden_ua: a numeric vector
pop_uc: a numeric vector
poppct_uc: a numeric vector
area_uc: a numeric vector
areapct_uc: a numeric vector
popden_uc: a numeric vector
pop_rural: a numeric vector
poppct_rural: a numeric vector
area_rural: a numeric vector
areapct_rural: a numeric vector
popden_rural: a numeric vector

Source

US Census.

Examples


urban_owner
urban_owner

State summary info

Description

Census info for the 50 US states plus DC.

Usage

urban_rural_pop
urban_rural_pop

Format

A data frame with 51 observations on the following 5 variables.

state: US state.
urban_in: a numeric vector
urban_out: a numeric vector
rural_farm: a numeric vector
rural_nonfarm: a numeric vector

Source

US census.

Examples


urban_rural_pop
urban_rural_pop

US Crime Rates

Description

National data on the number of crimes committed in the US between 1960 and 2019.

Usage

us_crime_rates
us_crime_rates

Format

A data frame with 60 rows and 12 variables.

year: Year data was collected.
population: Population of the United States the year data was collected.
total: Total number of violent and property crimes committed.
violent: Total number of violent crimes committed.
property: Total number of property crimes committed.
murder: Number of murders committed. Counted in violent total.
forcible_rape: Number of forcible rapes committed. Counted in violent total.
robbery: Number of robberies committed. Counted in violent total.
aggravated_assault: Number of aggravated assaults committed. Counted in violent total.
burglary: Number of burglaries committed. Counted in property total.
larceny_theft: Number of larcency thefts committed. Counted in property total.
vehicle_theft: Number of vehicle thefts committed. Counted in property total.

Source

Disaster Center

Examples


library(ggplot2)

ggplot(us_crime_rates, aes(x = population, y = total)) +
  geom_point() +
  labs(
    title = "Crimes V Population",
    x = "Population",
    y = "Total Number of Crimes"
  )

ggplot(us_crime_rates, aes(x = murder)) +
  geom_boxplot() +
  labs(
    title = "US Murders",
    subtitle = "1960 - 2019",
    x = "Number of Murders"
  ) +
  theme(axis.text.y = element_blank())
library(ggplot2)

ggplot(us_crime_rates, aes(x = population, y = total)) +
  geom_point() +
  labs(
    title = "Crimes V Population",
    x = "Population",
    y = "Total Number of Crimes"
  )

ggplot(us_crime_rates, aes(x = murder)) +
  geom_boxplot() +
  labs(
    title = "US Murders",
    subtitle = "1960 - 2019",
    x = "Number of Murders"
  ) +
  theme(axis.text.y = element_blank())

US Temperature Data

Description

A representative set of monitoring locations were taken from NOAA data that had both years of interest (1950 and 2022). The information was collected so as to spread the measurements across the continental United States. Daily high and low temperatures are given for each of 24 weather stations.

Usage

us_temp
us_temp

Format

A data frame with 17250 observations on the following 9 variables.

station: Station ID, measurements from 24 stations.
name: Name of the station.
latitude: Latitude of the station.
longitude: Longitude of the station.
elevation: Elevation of the station.
date: Date of observed temperature.
tmax: High temp for the observed day.
tmin: Low temp for the observed day.
year: Factor variable for year, levels: 1950 and 2022.

Details

Please keep in mind that these are two annual snapshots from a few dozen arbitrarily selected weather stations. A complete analysis would consider more than two years of data and a more precise random sample uniformly distributed across the United States.

Source

https://www.ncei.noaa.gov/cdo-web/, retrieved 2023-09-23.

Examples


library(ggplot2)
library(maps)
library(sf)
library(dplyr)

# Summarize temperature by station and year for plotting
summarized_temp <- us_temp |>
  group_by(station, year, latitude, longitude) |>
  summarize(tmax_med = median(tmax, na.rm = TRUE), .groups = "drop") |>
  mutate(plot_shift = ifelse(year == "1950", 0, 2))

# Make a map of the US as a baseline
usa <- st_as_sf(maps::map("state", fill = TRUE, plot = FALSE))

# Layer the US map with summarized temperatures
ggplot(data = usa) +
  geom_sf() +
  geom_point(
    data = summarized_temp,
    aes(x = longitude + plot_shift, y = latitude, fill = tmax_med, shape = year),
    color = "black", size = 3
  ) +
  scale_fill_gradient(high = "red", low = "yellow") +
  scale_shape_manual(values = c(21, 24)) +
  labs(
    title = "Median high temperature, 1950 and 2022",
    x = "Longitude",
    y = "Latitude",
    fill = "Median\nhigh temp",
    shape = "Year"
  )
library(ggplot2)
library(maps)
library(sf)
library(dplyr)

# Summarize temperature by station and year for plotting
summarized_temp <- us_temp |>
  group_by(station, year, latitude, longitude) |>
  summarize(tmax_med = median(tmax, na.rm = TRUE), .groups = "drop") |>
  mutate(plot_shift = ifelse(year == "1950", 0, 2))

# Make a map of the US as a baseline
usa <- st_as_sf(maps::map("state", fill = TRUE, plot = FALSE))

# Layer the US map with summarized temperatures
ggplot(data = usa) +
  geom_sf() +
  geom_point(
    data = summarized_temp,
    aes(x = longitude + plot_shift, y = latitude, fill = tmax_med, shape = year),
    color = "black", size = 3
  ) +
  scale_fill_gradient(high = "red", low = "yellow") +
  scale_shape_manual(values = c(21, 24)) +
  labs(
    title = "Median high temperature, 1950 and 2022",
    x = "Longitude",
    y = "Latitude",
    fill = "Median\nhigh temp",
    shape = "Year"
  )

American Time Survey 2009 - 2019

Description

Average Time Spent on Activities by Americans

Usage

us_time_survey
us_time_survey

Format

A data frame with 11 rows and 8 variables.

year: Year data collected
household_activities: Average hours per day spent on household activities - travel included
eating_and_drinking: Average hours per day spent eating and drinking including travel.
leisure_and_sports: Average hours per day spent on leisure and sports - including travel.
sleeping: Average Hours spent sleeping.
caring_children: Average hours spent per day caring for and helping children under 18 years of age.
working_employed: Average hours spent working for those employed. (15 years and older)
working_employed_days_worked: Average hours per day spent working on days worked (15 years and older)

Source

US Bureau of Labor Statistics

Examples


library(ggplot2)
us_time_survey$year <- as.factor(us_time_survey$year)
ggplot(us_time_survey, aes(year, sleeping)) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Year",
    y = "Average hours spent Sleeping",
    title = "US Average hours spent sleeping, 2009 - 2019"
  )

library(ggplot2)
us_time_survey$year <- as.factor(us_time_survey$year)
ggplot(us_time_survey, aes(year, sleeping)) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Year",
    y = "Average hours spent Sleeping",
    title = "US Average hours spent sleeping, 2009 - 2019"
  )

Predicting who would vote for NSA Mass Surveillance

Description

In 2013, the House of Representatives voted to not stop the National Security Agency's (NSA's) mass surveillance of phone behaviors. We look at two predictors for how a representative voted: their party and how much money they have received from the private defense industry.

Usage

vote_nsa
vote_nsa

Format

A data frame with 434 observations on the following 5 variables.

name: Name of the Congressional representative.
party: The party of the representative: D for Democrat and R for Republican.
state: State for the representative.
money: Money received from the defense industry for their campaigns.
phone_spy_vote: Voting to rein in the phone dragnet or continue allowing mass surveillance.

Source

MapLight. Available at http://s3.documentcloud.org/documents/741074/amash-amendment-vote-maplight.pdf.

References

Kravets, D., 2020. Lawmakers Who Upheld NSA Phone Spying Received Double The Defense Industry Cash. WIRED. Available at https://www.wired.com/2013/07/money-nsa-vote/.

Examples


table(vote_nsa$party, vote_nsa$phone_spy_vote)
boxplot(vote_nsa$money / 1000 ~ vote_nsa$phone_spy_vote,
  ylab = "$1000s Received from Defense Industry"
)
table(vote_nsa$party, vote_nsa$phone_spy_vote)
boxplot(vote_nsa$money / 1000 ~ vote_nsa$phone_spy_vote,
  ylab = "$1000s Received from Defense Industry"
)

US Voter Turnout Data.

Description

State-level data on federal elections held in November between 1980 and 2014.

Usage

voter_count
voter_count

Format

A data frame with 936 rows and 7 variables.

year: Year election was held.
region: Specifies if data is state or national total.
voting_eligible_population: Number of citizens eligible to vote; does not count felons.
total_ballots_counted: Number of ballots cast.
highest_office: Number of ballots that contained a vote for the highest office of that election.
percent_total_ballots_counted: Overall voter turnout percentage.
percent_highest_office: Highest office voter turnout percentage.

Source

United States Election Project

Examples


library(ggplot2)

ggplot(voter_count, aes(x = percent_highest_office, y = percent_total_ballots_counted)) +
  geom_point() +
  labs(
    title = "Total Ballots V Highest Office",
    x = "Highest Office",
    y = "Total Ballots"
  )
library(ggplot2)

ggplot(voter_count, aes(x = percent_highest_office, y = percent_total_ballots_counted)) +
  geom_point() +
  labs(
    title = "Total Ballots V Highest Office",
    x = "Highest Office",
    y = "Total Ballots"
  )

Package 'usdata'

Help Index

Convert state abbreviations to names

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Airline Delays for December 2019 and 2020.

Description

Usage

Format

Source

Examples

United States Counties

Description

Usage

Format

Source

See Also

Examples

American Community Survey 2019

Description

Usage

Format

Source

See Also

Examples

United States Counties

Description

Usage

Format

Source

See Also

Examples

Fatal Police Shootings data.

Description

Usage

Format

Source

Examples

Gerrymander

Description

Usage

Format

Source

Examples

Election results for 2010 Governor races in the U.S.

Description

Usage

Format

Source

Examples

Election results for the 2010 U.S. House of Represenatives races

Description

Usage

Format

Details

Source

Examples

Pierce County House Sales Data for 2020

Description

Usage

Format

Source

Examples

Population Age 2019 Data.

Description

Usage

Format

Source

Examples

Population Race 2019 Data.

Description

Usage

Format

Source

Examples