Title: | Datasets and Supplemental Functions from 'OpenIntro' Textbooks and Labs |
---|---|
Description: | Supplemental functions and data for 'OpenIntro' resources, which includes open-source textbooks and resources for introductory statistics (<https://www.openintro.org/>). The package contains datasets used in our open-source textbooks along with custom plotting functions for reproducing book figures. Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system. |
Authors: | Mine Çetinkaya-Rundel [aut, cre] , David Diez [aut], Andrew Bray [aut], Albert Y. Kim [aut] , Ben Baumer [aut], Chester Ismay [aut], Nick Paterno [aut], Christopher Barr [aut] |
Maintainer: | Mine Çetinkaya-Rundel <[email protected]> |
License: | GPL-3 |
Version: | 2.5.0 |
Built: | 2025-01-01 01:48:52 UTC |
Source: | https://github.com/openintrostat/openintro |
Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year.
absenteeism
absenteeism
A data frame with 146 observations on the following 5 variables.
Ethnicity, representing Aboriginal (A
) or not (N
).
Gender.
Age bucket.
Learner status, with average learner (AL
) and
slow learner (SL
).
Number of days absent.
Venables WN, Ripley BD. 2002. Modern Applied Statistics with S. Fourth Edition. New York: Springer.
Data can also be found in the R MASS
package under the dataset name
quine
.
library(ggplot2) ggplot(absenteeism, aes(x = eth, y = days)) + geom_boxplot() + coord_flip()
library(ggplot2) ggplot(absenteeism, aes(x = eth, y = days)) + geom_boxplot() + coord_flip()
Results from the US Census American Community Survey, 2012.
acs12
acs12
A data frame with 2000 observations on the following 13 variables.
Annual income.
Employment status.
Hours worked per week.
Race.
Age, in years.
Gender.
Whether the person is a U.S. citizen.
Travel time to work, in minutes.
Language spoken at home.
Whether the person is married.
Education level.
Whether the person is disabled.
The quarter of the year that the person was born,
e.g. Jan thru Mar
.
https://www.census.gov/programs-surveys/acs
library(dplyr) library(ggplot2) library(broom) # employed only acs12_emp <- acs12 |> filter( age >= 30, age <= 60, employment == "employed", income > 0 ) # linear model ggplot(acs12_emp, mapping = aes(x = age, y = income)) + geom_point() + geom_smooth(method = "lm") lm(income ~ age, data = acs12_emp) |> tidy() # log-transormed model ggplot(acs12_emp, mapping = aes(x = age, y = log(income))) + geom_point() + geom_smooth(method = "lm") lm(log(income) ~ age, data = acs12_emp) |> tidy()
library(dplyr) library(ggplot2) library(broom) # employed only acs12_emp <- acs12 |> filter( age >= 30, age <= 60, employment == "employed", income > 0 ) # linear model ggplot(acs12_emp, mapping = aes(x = age, y = income)) + geom_point() + geom_smooth(method = "lm") lm(income ~ age, data = acs12_emp) |> tidy() # log-transormed model ggplot(acs12_emp, mapping = aes(x = age, y = log(income))) + geom_point() + geom_smooth(method = "lm") lm(log(income) ~ age, data = acs12_emp) |> tidy()
Age at first marriage of 5,534 US women who responded to the National Survey of Family Growth (NSFG) conducted by the CDC in the 2006 and 2010 cycle.
age_at_mar
age_at_mar
A data frame with 5,534 observations and 1 variable.
Age a first marriage.
National Survey of Family Growth, 2006-2010 cycle, https://www.cdc.gov/nchs/nsfg/nsfg_2006_2010_puf.htm.
library(ggplot2) ggplot(age_at_mar, mapping = aes(x = age)) + geom_histogram(binwidth = 3) + labs( x = "Age", y = "Count", title = "Age at first marriage, US Women", subtitle = "Source: National Survey of Family Growth Survey, 2006 - 2010" )
library(ggplot2) ggplot(age_at_mar, mapping = aes(x = age)) + geom_histogram(binwidth = 3) + labs( x = "Age", y = "Count", title = "Age at first marriage, US Women", subtitle = "Source: National Survey of Family Growth Survey, 2006 - 2010" )
Data set contains information from the Ames Assessor's Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. See here for detailed variable descriptions.
ames
ames
A tbl_df with with 2930 rows and 82 variables:
Observation number.
Parcel identification number - can be used with city web site for parcel review.
Above grade (ground) living area square feet.
Sale price in USD.
Identifies the type of dwelling involved in the sale.
Identifies the general zoning classification of the sale.
Linear feet of street connected to property.
Lot size in square feet.
Type of road access to property.
Type of alley access to property.
General shape of property.
Flatness of the property.
Type of utilities available.
Lot configuration.
Slope of property.
Physical locations within Ames city limits (map available).
Proximity to various conditions.
Proximity to various conditions (if more than one is present).
Type of dwelling.
Style of dwelling.
Rates the overall material and finish of the house.
Rates the overall condition of the house.
Original construction date.
Remodel date (same as construction date if no remodeling or additions).
Type of roof.
Roof material.
Exterior covering on house.
Exterior covering on house (if more than one material).
Masonry veneer type.
Masonry veneer area in square feet.
Evaluates the quality of the material on the exterior.
Evaluates the present condition of the material on the exterior.
Type of foundation.
Evaluates the height of the basement.
Evaluates the general condition of the basement.
Refers to walkout or garden level walls.
Rating of basement finished area.
Type 1 finished square feet.
Rating of basement finished area (if multiple types).
Type 2 finished square feet.
Unfinished square feet of basement area.
Total square feet of basement area.
Type of heating.
Heating quality and condition.
Central air conditioning.
Electrical system.
First Floor square feet.
Second floor square feet.
Low quality finished square feet (all floors).
Basement full bathrooms.
Basement half bathrooms.
Full bathrooms above grade.
Half baths above grade.
Bedrooms above grade (does NOT include basement bedrooms).
Kitchens above grade.
Kitchen quality.
Total rooms above grade (does not include bathrooms).
Home functionality (Assume typical unless deductions are warranted).
Number of fireplaces.
Fireplace quality.
Garage location.
Year garage was built.
Interior finish of the garage.
Size of garage in car capacity.
Size of garage in square feet.
Garage quality.
Garage condition.
Paved driveway.
Wood deck area in square feet.
Open porch area in square feet.
Enclosed porch area in square feet.
Three season porch area in square feet.
Screen porch area in square feet.
Pool area in square feet.
Pool quality.
Fence quality.
Miscellaneous feature not covered in other categories.
Dollar value of miscellaneous feature.
Month Sold (MM).
Year Sold (YYYY).
Type of sale.
Condition of sale.
De Cock, Dean. "Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project." Journal of Statistics Education 19.3 (2011).
This dataset is simulated but contains realistic occurrences of AMI in NY City.
ami_occurrences
ami_occurrences
A data frame with 365 observations on the following variable.
Number of daily occurrences of heart attacks in NY City.
library(ggplot2) ggplot(ami_occurrences, mapping = aes(x = ami)) + geom_bar() + labs( x = "Acute Myocardial Infarction events", y = "Count", title = "Acute Myocardial Infarction events in NYC" )
library(ggplot2) ggplot(ami_occurrences, mapping = aes(x = ami)) + geom_bar() + labs( x = "Acute Myocardial Infarction events", y = "Count", title = "Acute Myocardial Infarction events in NYC" )
Pre-existing medical conditions of 92 children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.
antibiotics
antibiotics
A data frame with 92 observations, each representing a child, on the following variable.
Pre-existing medical condition.
library(ggplot2) ggplot(antibiotics, aes(x = condition)) + geom_bar() + labs( x = "Conidition", y = "Count", title = "Pre-existing coniditions of children", subtitle = "in antibiotic use study" ) + coord_flip()
library(ggplot2) ggplot(antibiotics, aes(x = condition)) + geom_bar() + labs( x = "Conidition", y = "Count", title = "Pre-existing coniditions of children", subtitle = "in antibiotic use study" ) + coord_flip()
Arbuthnot's data describes male and female christenings (births) for London from 1629-1710.
arbuthnot
arbuthnot
A tbl_df with with 82 rows and 3 variables:
year, ranging from 1629 to 1710
number of male christenings (births)
number of female christenings (births)
John Arbuthnot (1710) used these time series data to carry out the first known significance test. During every one of the 82 years, there were more male christenings than female christenings. As Arbuthnot wondered, we might also wonder if this could be due to chance, or whether it meant the birth ratio was not actually 1:1.
These data are excerpted from the Arbuthnot
dataset in the
HistData package.
library(ggplot2) library(tidyr) # All births ggplot(arbuthnot, aes(x = year, y = boys + girls, group = 1)) + geom_line() # Boys and girls arbuthnot |> pivot_longer(cols = -year, names_to = "sex", values_to = "n") |> ggplot(aes(x = year, y = n, color = sex, group = sex)) + geom_line()
library(ggplot2) library(tidyr) # All births ggplot(arbuthnot, aes(x = year, y = boys + girls, group = 1)) + geom_line() # Boys and girls arbuthnot |> pivot_longer(cols = -year, names_to = "sex", values_to = "n") |> ggplot(aes(x = year, y = n, color = sex, group = sex)) + geom_line()
Published results used RNA-Seq to investigate how cold responsiveness differs in two populations of A. arenosa: TBG (collected from Triberg, Germany) and KA (collected from Kasparstein, Austria). Each row corresponds to a gene; the first column contains the gene name; other columns correspond to expression measured in a plant sample. Three plants of each population were exposed to cold (vernalized, denoted by v), and three were not (non-vernalized, denoted by nv). Expression was measured in gene counts (i.e. the number of RNA transcripts present in a sample); the data were then normalized to allow comparison between samples.
arenosa
arenosa
A tibble with 1088 rows and 13 variables:
gene.name
a character vector
ka.nv.1
a numeric vector
ka.nv.2
a numeric vector
ka.nv.3
a numeric vector
ka.v.1
a numeric vector
ka.v.2
a numeric vector
ka.v.3
a numeric vector
tbg.nv.1
a numeric vector
tbg.nv.2
a numeric vector
tbg.nv.3
a numeric vector
tbg.v.1
a numeric vector
tbg.v.2
a numeric vector
tbg.v.3
a numeric vector
K Bomblies Harvard University lab.
Pierre Baduel, Brian Arnold, Cara M. Weisman, Ben Hunter, Kirsten Bomblies, Habitat-Associated Life History and Stress-Tolerance Variation in Arabidopsis arenosa, Plant Physiology, Volume 171, Issue 1, May 2016, Pages 437–451 https://doi.org/10.1104/pp.15.01875https://doi.org/10.1104/pp.15.01875
Similar to lines
, this function will include
endpoints that are solid points, open points, or arrows (mix-and-match
ready).
ArrowLines( x, y, lty = 1, lwd = 2.5, col = 1, length = 0.1, af = 3, cex.pch = 1.2, ends = c("a", "a"), ... )
ArrowLines( x, y, lty = 1, lwd = 2.5, col = 1, length = 0.1, af = 3, cex.pch = 1.2, ends = c("a", "a"), ... )
x |
A vector of the x-coordinates of the line to be drawn. |
y |
A vector of the y-coordinates of the line to be drawn. This vector
should have the same length as that of |
lty |
The line type. |
lwd |
The line width. |
col |
The line and endpoint color. |
length |
If an end point is an arrow, then this specifies the sizing of
the arrow. See the |
af |
A tuning parameter for creating the arrow. Usually the default
( |
cex.pch |
Plotting character size (if open or closed point at the end). |
ends |
A character vector of length 2, where the first value
corresponds to the start of the line and the second to the end of the line.
A value of |
... |
All additional arguments are passed to the
|
David Diez
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2) x <- c(-2, 0, 2, 4) y <- c(0, 3, 0, 3) ArrowLines(x, y, col = COL[1], ends = c("c", "c")) points(x, y, col = COL[1], pch = 19, cex = 1.2) CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2) x <- c(-3, 0, 1, 3) y <- c(2, 1, -2, 1) ArrowLines(x, y, col = COL[1], ends = c("c", "c")) points(x, y, col = COL[1], pch = 19, cex = 1.2) CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2) x <- seq(-2, 2, 0.01) y <- x^2 - 3 ArrowLines(x, y, col = COL[1], ends = c("c", "c")) x <- seq(-2, 2, 1) y <- x^2 - 3 points(x, y, col = COL[1], pch = 19, cex = 1.2)
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2) x <- c(-2, 0, 2, 4) y <- c(0, 3, 0, 3) ArrowLines(x, y, col = COL[1], ends = c("c", "c")) points(x, y, col = COL[1], pch = 19, cex = 1.2) CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2) x <- c(-3, 0, 1, 3) y <- c(2, 1, -2, 1) ArrowLines(x, y, col = COL[1], ends = c("c", "c")) points(x, y, col = COL[1], pch = 19, cex = 1.2) CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2) x <- seq(-2, 2, 0.01) y <- x^2 - 3 ArrowLines(x, y, col = COL[1], ends = c("c", "c")) x <- seq(-2, 2, 1) y <- x^2 - 3 points(x, y, col = COL[1], pch = 19, cex = 1.2)
In this experiment, each individual was asked to be a seller of an iPod (a product commonly used to store music on before smart phones...). They participant received $10 + 5% of the sale price for participating. The iPod they were selling had frozen twice in the past inexplicably but otherwise worked fine. The prospective buyer starts off and then asks one of three final questions, depending on the seller's treatment group.
ask
ask
A data frame with 219 observations on the following 3 variables.
The type of question:
general
, pos_assumption
, and neg_assumption
.
The question corresponding to the
question.class
The classified response from the seller,
either disclose
or hide
.
The three possible questions:
General: What can you tell me about it?
Positive Assumption: It doesn't have any problems, does it?
Negative Assumption: What problems does it have?
The outcome variable is whether or not the participant discloses or hides the problem with the iPod.
Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication.
library(dplyr) library(ggplot2) # Distribution of responses based on question type ask |> count(question_class, response) # Visualize relative frequencies of responses based on question type ggplot(ask, aes(x = question_class, fill = response)) + geom_bar(position = "fill") # Perform chi-square test (test <- chisq.test(table(ask$question_class, ask$response))) # Check the test's assumption around sufficient expected observations # per table cell. test$expected
library(dplyr) library(ggplot2) # Distribution of responses based on question type ask |> count(question_class, response) # Visualize relative frequencies of responses based on question type ggplot(ask, aes(x = question_class, fill = response)) + geom_bar(position = "fill") # Perform chi-square test (test <- chisq.test(table(ask$question_class, ask$response))) # Check the test's assumption around sufficient expected observations # per table cell. test$expected
Simulated dataset.
association
association
A data frame with 121 observations on the following 4 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
library(ggplot2) ggplot(association, aes(x = x1, y = y1)) + geom_point() ggplot(association, aes(x = x2, y = y4)) + geom_point() ggplot(association, aes(x = x3, y = y7)) + geom_point()
library(ggplot2) ggplot(association, aes(x = x1, y = y1)) + geom_point() ggplot(association, aes(x = x2, y = y4)) + geom_point() ggplot(association, aes(x = x3, y = y7)) + geom_point()
Colors of the eye colors of male and female partners.
assortative_mating
assortative_mating
A data frame with 204 observations on the following 2 variables.
a factor with levels blue
, brown
, and green
a factor with blue
, brown
, and green
B. Laeng et al. Why do blue-eyed men prefer women with the same eye color? In: Behavioral Ecology and Sociobiology 61.3 (2007), pp. 371-384.
data(assortive_mating) table(assortive_mating)
data(assortive_mating) table(assortive_mating)
A comparison of cardiovascular problems for Rosiglitazone and Pioglitazone.
avandia
avandia
A data frame with 227571 observations on the following 2 variables.
a factor with levels Pioglitazone
and
Rosiglitazone
a factor with levels no
and
yes
D.J. Graham et al. Risk of acute myocardial infarction, stroke, heart failure, and death in elderly Medicare patients treated with rosiglitazone or pioglitazone. In: JAMA 304.4 (2010), p. 411. issn: 0098-7484.
table(avandia)
table(avandia)
Convert and simplify axis labels that are in US Dollars.
AxisInDollars(side, at, include.symbol = TRUE, simplify = TRUE, ...)
AxisInDollars(side, at, include.symbol = TRUE, simplify = TRUE, ...)
side |
An integer specifying which side of the plot the axis is to be drawn on. The axis is place as follows: 1 = below, 2 = left, 3 = above and 4 = right. |
at |
The points at which tick-marks are to be drawn. |
include.symbol |
Whether to include a dollar or percent symbol, where the symbol chosen depends on the function. |
simplify |
For dollars, simplify the amount to use abbreviations of
|
... |
Arguments passed to |
The numeric locations on the axis scale at which tick marks were drawn when the plot was first drawn.
David Diez
buildAxis
AxisInDollars
AxisInPercent
x <- sample(50e6, 100) hist(x, axes = FALSE) AxisInDollars(1, pretty(x))
x <- sample(50e6, 100) hist(x, axes = FALSE) AxisInDollars(1, pretty(x))
Convert and simplify axis labels that are in percentages.
AxisInPercent(side, at, include.symbol = TRUE, simplify = TRUE, ...)
AxisInPercent(side, at, include.symbol = TRUE, simplify = TRUE, ...)
side |
An integer specifying which side of the plot the axis is to be drawn on. The axis is place as follows: 1 = below, 2 = left, 3 = above and 4 = right. |
at |
The points at which tick-marks are to be drawn. |
include.symbol |
Whether to include a dollar or percent symbol, where the symbol chosen depends on the function. |
simplify |
For dollars, simplify the amount to use abbreviations of
|
... |
Arguments passed to |
The numeric locations on the axis scale at which tick marks were drawn when the plot was first drawn.
David Diez
buildAxis
AxisInDollars
AxisInDollars
x <- sample(50e6, 100) hist(x, axes = FALSE) AxisInDollars(1, pretty(x))
x <- sample(50e6, 100) hist(x, axes = FALSE) AxisInDollars(1, pretty(x))
The Child Health and Development Studies investigate a range of topics. One
study, in particular, considered all pregnancies between 1960 and 1967 among
women in the Kaiser Foundation Health Plan in the San Francisco East Bay
area. We do not have ideal provenance for these data. For a better documented
and more recent dataset on a similar topic with similar variables,
see births14. Additionally, Gestation
dataset in the
mosaicData
package also contains similar data.
babies
babies
A data frame with 1236 rows and 8 variables:
id number
birthweight, in ounces
length of gestation, in days
binary indicator for a first pregnancy (0 = first pregnancy)
mother's age in years
mother's height in inches
mother's weight in pounds
binary indicator for whether the mother smokes
These data come from Child Health and Development Studies.
Crawling age of babies along with the average outdoor temperature at 6 months of age.
babies_crawl
babies_crawl
A data frame with 12 observations on the following 5 variables.
A factor with levels corresponding to months
a numeric vector
a numeric vector
a numeric vector
a numeric vector
J.B. Benson. Season of birth and onset of locomotion: Theoretical and methodological implications. In: Infant behavior and development 16.1 (1993), pp. 69-81. issn: 0163-6383.
library(ggplot2) ggplot(babies_crawl, aes(x = temperature, y = avg_crawling_age)) + geom_point() + labs(x = "Temperature", y = "Average crawling age")
library(ggplot2) ggplot(babies_crawl, aes(x = temperature, y = avg_crawling_age)) + geom_point() + labs(x = "Temperature", y = "Average crawling age")
Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer.
bac
bac
A data frame with 16 observations on the following 3 variables.
a numeric vector
a numeric vector
a numeric vector
J. Malkevitch and L.M. Lesser. For All Practical Purposes: Mathematical Literacy in Today's World. WH Freeman & Co, 2008. The data origin is given in the Electronic Encyclopedia of Statistical Examples and Exercises, 1992.
library(ggplot2) ggplot(bac, aes(x = beers, y = bac)) + geom_point() + labs(x = "Number of beers", y = "Blood alcohol content")
library(ggplot2) ggplot(bac, aes(x = beers, y = bac)) + geom_point() + labs(x = "Number of beers", y = "Blood alcohol content")
A simulated dataset on lifespan of ball bearings.
ball_bearing
ball_bearing
A data frame with 75 observations on the following variable.
Lifespan of ball bearings (in hours).
Simulated data.
library(ggplot2) ggplot(ball_bearing, aes(x = life_span)) + geom_histogram(binwidth = 1) qqnorm(ball_bearing$life_span)
library(ggplot2) ggplot(ball_bearing, aes(x = life_span)) + geom_histogram(binwidth = 1) qqnorm(ball_bearing$life_span)
Body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender, are given for 507 physically active individuals - 247 men and 260 women. These data can be used to provide statistics students practice in the art of data analysis. Such analyses range from simple descriptive displays to more complicated multivariate analyses such as multiple regression and discriminant analysis.
bdims
bdims
A data frame with 507 observations on the following 25 variables.
A numerical vector, respondent's biacromial diameter in centimeters.
A numerical vector, respondent's biiliac diameter (pelvic breadth) in centimeters.
A numerical vector, respondent's bitrochanteric diameter in centimeters.
A numerical vector, respondent's chest depth in centimeters, measured between spine and sternum at nipple level, mid-expiration.
A numerical vector, respondent's chest diameter in centimeters, measured at nipple level, mid-expiration.
A numerical vector, respondent's elbow diameter in centimeters, measured as sum of two elbows.
A numerical vector, respondent's wrist diameter in centimeters, measured as sum of two wrists.
A numerical vector, respondent's knee diameter in centimeters, measured as sum of two knees.
A numerical vector, respondent's ankle diameter in centimeters, measured as sum of two ankles.
A numerical vector, respondent's shoulder girth in centimeters, measured over deltoid muscles.
A numerical vector, respondent's chest girth in centimeters, measured at nipple line in males and just above breast tissue in females, mid-expiration.
A numerical vector, respondent's waist girth in centimeters, measured at the narrowest part of torso below the rib cage as average of contracted and relaxed position.
A numerical vector, respondent's navel (abdominal) girth in centimeters, measured at umbilicus and iliac crest using iliac crest as a landmark.
A numerical vector, respondent's hip girth in centimeters, measured at at level of bitrochanteric diameter.
A numerical vector, respondent's thigh girth in centimeters, measured below gluteal fold as the average of right and left girths.
A numerical vector, respondent's bicep girth in centimeters, measured when flexed as the average of right and left girths.
A numerical vector, respondent's forearm girth in centimeters, measured when extended, palm up as the average of right and left girths.
A numerical vector, respondent's knee diameter in centimeters, measured as sum of two knees.
A numerical vector, respondent's calf maximum girth in centimeters, measured as average of right and left girths.
A numerical vector, respondent's ankle minimum girth in centimeters, measured as average of right and left girths.
A numerical vector, respondent's wrist minimum girth in centimeters, measured as average of right and left girths.
A numerical vector, respondent's age in years.
A numerical vector, respondent's weight in kilograms.
A numerical vector, respondent's height in centimeters.
A categorical vector, 1 if the respondent is male, 0 if female.
Heinz G, Peterson LJ, Johnson RW, Kerk CJ. 2003. Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2).
library(ggplot2) ggplot(bdims, aes(x = hgt)) + geom_histogram(binwidth = 5) ggplot(bdims, aes(x = hgt, y = wgt)) + geom_point() + labs(x = "Height", y = "Weight") ggplot(bdims, aes(x = hgt, y = sho_gi)) + geom_point() + labs(x = "Height", y = "Shoulder girth") ggplot(bdims, aes(x = hgt, y = hip_gi)) + geom_point() + labs(x = "Height", y = "Hip girth")
library(ggplot2) ggplot(bdims, aes(x = hgt)) + geom_histogram(binwidth = 5) ggplot(bdims, aes(x = hgt, y = wgt)) + geom_point() + labs(x = "Height", y = "Weight") ggplot(bdims, aes(x = hgt, y = sho_gi)) + geom_point() + labs(x = "Height", y = "Shoulder girth") ggplot(bdims, aes(x = hgt, y = hip_gi)) + geom_point() + labs(x = "Height", y = "Hip girth")
Overlays a colored rectangle over the entire plotting region.
BG(col = openintro::COL[5, 9])
BG(col = openintro::COL[5, 9])
col |
Color to overlay. |
Test <- function(col) { plot(1:7, col = COL[1:7], pch = 19, cex = 5, xlim = c(0, 8), ylim = c(0, 9) ) BG(col) points(2:8, col = COL[1:7], pch = 19, cex = 5) text(2, 6, "Correct Color") text(6, 2, "Affected Color") } # Works well since black color almost fully transparent Test(COL[5, 9]) # Works less well since transparency isn't as significant Test(COL[5, 6]) # Pretty ugly due to overlay Test(COL[5, 3]) # Basically useless due to heavy color gradient Test(COL[4, 2])
Test <- function(col) { plot(1:7, col = COL[1:7], pch = 19, cex = 5, xlim = c(0, 8), ylim = c(0, 9) ) BG(col) points(2:8, col = COL[1:7], pch = 19, cex = 5) text(2, 6, "Correct Color") text(6, 2, "Affected Color") } # Works well since black color almost fully transparent Test(COL[5, 9]) # Works less well since transparency isn't as significant Test(COL[5, 6]) # Pretty ugly due to overlay Test(COL[5, 3]) # Basically useless due to heavy color gradient Test(COL[4, 2])
On March 31, 2021, Pfizer and BioNTech announced that "in a Phase 3 trial in adolescents 12 to 15 years of age with or without prior evidence of SARS-CoV-2 infection, the Pfizer-BioNTech COVID-19 vaccine BNT162b2 demonstrated 100% efficacy and robust antibody responses, exceeding those recorded earlier in vaccinated participants aged 16 to 25 years old, and was well tolerated." These results are from a Phase 3 trial in 2,260 adolescents 12 to 15 years of age in the United States. In the trial, 18 cases of COVID-19 were observed in the placebo group (n = 1,129) versus none in the vaccinated group (n = 1,131).
biontech_adolescents
biontech_adolescents
A data frame with 2260 observations on the following 2 variables.
Study group: vaccine
(Pfizer-BioNTech COVID-19 vaccine administered)
or placebo
.
Study outcome: COVID-19
or no COVID-19
.
"Pfizer-Biontech Announce Positive Topline Results Of Pivotal Covid-19 Vaccine Study In Adolescents". March 21, 2021. (Retrieved April 25, 2021.)
library(dplyr) library(ggplot2) biontech_adolescents |> count(group, outcome) ggplot(biontech_adolescents, aes(y = group, fill = outcome)) + geom_bar()
library(dplyr) library(ggplot2) biontech_adolescents |> count(group, outcome) ggplot(biontech_adolescents, aes(y = group, fill = outcome)) + geom_bar()
A collection of all collisions between aircraft in wildlife that were reported to the US Federal Aviation Administration between 1990 and 1997, with details on the circumstances of the collision.
birds
birds
A data frame with 19302 observations on the following 17 variables.
Three letter identification code for the operator (carrier) of the aircraft.
Name of the aircraft operator.
Make and model of aircraft.
Verbal remarks regarding the collision.
Phase of the flight during which the collision occurred: Approach
, Climb
, Descent
, En Route
, Landing Roll
, Parked
, Take-off run
, Taxi
.
Mass of the aircraft classified as 2250 kg or less (1), 2251-5700 kg (2), 5701-27000 kg (3), 27001-272000 kg (4), above 272000 kg (5).
Number of engines on the aircraft.
Date of the collision (MM/DD/YYYY).
Light conditions: Dawn
, Day
, Dusk
, Night
.
Two letter abbreviation of the US state in which the collision occurred.
Feet above ground level.
Knots (indicated air speed).
Effect on flight: Aborted Take-off
, Engine Shut Down
, None
, Other
, Precautionary Landing
.
Type of cloud cover, if any: No Cloud
, Overcast
, Some Cloud
.
Common name for bird or other wildlife.
Number of birds/wildlife seen by pilot: 1
, 2-10
, 11-100
, Over 100
.
Number of birds/wildlife struck: 0
, 1
, 2-10
, 11-100
, Over 100
.
The FAA National Wildlife Strike Database contains strike reports that are voluntarily reported to the FAA by pilots, airlines, airports and others. Current research indicates that only about 20\ Wildlife strike reporting is not uniform as some organizations have more robust voluntary reporting procedures. Because of variations in reporting, users are cautioned that the comparisons between individual airports or airlines may be misleading.
Aircraft Wildlife Strike Data: Search Tool - FAA Wildlife Strike Database. Available at https://datahub.transportation.gov/Aviation/Aircraft-Wildlife-Strike-Data-Search-Tool-FAA-Wild/jhay-dgxy. Retrieval date: Feb 4, 2012.
library(dplyr) library(ggplot2) library(forcats) library(tidyr) # Phase of the flight during which the collision occurred, tabular birds |> count(phase_of_flt, sort = TRUE) # Phase of the flight during which the collision occurred, barplot ggplot(birds, aes(y = fct_infreq(phase_of_flt))) + geom_bar() + labs(x = "Phase of flight") # Height summary statistics summary(birds$height) # Phase of flight vs. effect of crash birds |> drop_na(phase_of_flt, effect) |> ggplot(aes(y = phase_of_flt, fill = effect)) + geom_bar(position = "fill") + labs(x = "Proportion", y = "Phase of flight", fill = "Effect")
library(dplyr) library(ggplot2) library(forcats) library(tidyr) # Phase of the flight during which the collision occurred, tabular birds |> count(phase_of_flt, sort = TRUE) # Phase of the flight during which the collision occurred, barplot ggplot(birds, aes(y = fct_infreq(phase_of_flt))) + geom_bar() + labs(x = "Phase of flight") # Height summary statistics summary(birds$height) # Phase of flight vs. effect of crash birds |> drop_na(phase_of_flt, effect) |> ggplot(aes(y = phase_of_flt, fill = effect)) + geom_bar(position = "fill") + labs(x = "Proportion", y = "Phase of flight", fill = "Effect")
Data on a random sample of 100 births for babies in North Carolina where the mother was not a smoker and another 50 where the mother was a smoker.
births
births
A data frame with 150 observations on the following 14 variables.
Father's age.
Mother's age.
Weeks at which the mother gave birth.
Indicates whether the baby was premature or not.
Number of hospital visits.
Weight gained by mother.
Birth weight of the baby.
Gender of the baby.
Whether or not the mother was a smoker.
Birth records released by North Carolina in 2004.
We do not have ideal provenance for these data. For a better documented and more recent dataset on a similar topic with similar variables, see births14. Additionally, ncbirths also contains similar data.
library(ggplot2) ggplot(births, aes(x = smoke, y = weight)) + geom_boxplot()
library(ggplot2) ggplot(births, aes(x = smoke, y = weight)) + geom_boxplot()
Every year, the US releases to the public a large dataset containing information on births recorded in the country. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from the dataset released in 2014.
births14
births14
A data frame with 1,000 observations on the following 13 variables.
Father's age in years.
Mother's age in years.
Maturity status of mother.
Length of pregnancy in weeks.
Whether the birth was classified as premature (premie) or full-term.
Number of hospital visits during pregnancy.
Weight gained by mother during pregnancy in pounds.
Weight of the baby at birth in pounds.
Whether baby was classified as low birthweight (low
) or not (not low
).
Sex of the baby, female
or male
.
Status of the mother as a nonsmoker
or a smoker
.
Whether mother is married
or not married
at birth.
Whether mom is white
or not white
.
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States. Inter-university Consortium for Political and Social Research, 2016-10-07. doi:10.3886/ICPSR36461.v1.
library(ggplot2) ggplot(births14, aes(x = habit, y = weight)) + geom_boxplot() + labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)") ggplot(births14, aes(x = whitemom, y = visits)) + geom_boxplot() + labs(x = "Mother's race", y = "Number of doctor visits during pregnancy") ggplot(births14, aes(x = mature, y = gained)) + geom_boxplot() + labs(x = "Mother's age category", y = "Weight gained during pregnancy")
library(ggplot2) ggplot(births14, aes(x = habit, y = weight)) + geom_boxplot() + labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)") ggplot(births14, aes(x = whitemom, y = visits)) + geom_boxplot() + labs(x = "Mother's race", y = "Number of doctor visits during pregnancy") ggplot(births14, aes(x = mature, y = gained)) + geom_boxplot() + labs(x = "Mother's age category", y = "Weight gained during pregnancy")
Employee generated anonymous survey of salary information.
blizzard_salary
blizzard_salary
A data frame with 466 rows and 9 variables.
Time data was entered
Specifies employment status.
Current job title.
Current salary (in USD).
Frequency with levels year, hour, week.
Raise given July 2020.
Other information submitted by employee.
Current office of employment.
Most recent review performance rating.
Bloomberg - Blizzard workers share salaries in revolt over wage disparities.
library(ggplot2) library(dplyr) plot_data <- blizzard_salary |> mutate(annual_salary = case_when( salary_type == "week" ~ current_salary * 52, salary_type == "hour" ~ current_salary * 40 * 52, TRUE ~ current_salary )) ggplot(plot_data, aes(annual_salary)) + geom_histogram(binwidth = 25000, color = "white") + labs( title = "Current Salary of Blizzard Employees", x = "Salary", y = "Number of Employees" )
library(ggplot2) library(dplyr) plot_data <- blizzard_salary |> mutate(annual_salary = case_when( salary_type == "week" ~ current_salary * 52, salary_type == "hour" ~ current_salary * 40 * 52, TRUE ~ current_salary )) ggplot(plot_data, aes(annual_salary)) + geom_histogram(binwidth = 25000, color = "white") + labs( title = "Current Salary of Blizzard Employees", x = "Salary", y = "Number of Employees" )
Simulated dataset.
books
books
A data frame with 95 observations on the following 2 variables.
a factor with levels fiction
and nonfiction
a factor with levels hardcover
and paperback
table(books)
table(books)
An alternative to boxplot
. Equations are not accepted. Instead, the
second argument, fact
, is used to split the data.
boxPlot( x, fact = NULL, horiz = FALSE, width = 2/3, lwd = 1, lcol = "black", medianLwd = 2, pch = 20, pchCex = 1.8, col = grDevices::rgb(0, 0, 0, 0.25), add = FALSE, key = NULL, axes = TRUE, xlab = "", ylab = "", xlim = NULL, ylim = NULL, na.rm = TRUE, ... )
boxPlot( x, fact = NULL, horiz = FALSE, width = 2/3, lwd = 1, lcol = "black", medianLwd = 2, pch = 20, pchCex = 1.8, col = grDevices::rgb(0, 0, 0, 0.25), add = FALSE, key = NULL, axes = TRUE, xlab = "", ylab = "", xlim = NULL, ylim = NULL, na.rm = TRUE, ... )
x |
A numerical vector. |
fact |
A character or factor vector defining the grouping for side-by-side box plots. |
horiz |
If |
width |
The width of the boxes in the plot. Value between |
lwd |
Width of lines used in box and whiskers. |
lcol |
Color of the box, median, and whiskers. |
medianLwd |
Width of the line marking the median. |
pch |
Plotting character of outliers. |
pchCex |
Size of outlier character. |
col |
Color of outliers. |
add |
If |
key |
The order in which to display the side-by-side boxplots. If
locations are specified in |
axes |
Whether to plot the axes. |
xlab |
Label for the x axis. |
ylab |
Label for the y axis. |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
na.rm |
Indicate whether |
... |
Additional arguments to plot. |
David Diez
histPlot
, dotPlot
,
densityPlot
# univariarate boxPlot(email$num_char, ylab = "Number of characters in emails") # bivariate boxPlot(email$num_char, email$spam, xlab = "Spam", ylab = "Number of characters in emails" ) # faded outliers boxPlot(email$num_char, email$spam, xlab = "Spam", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # horizontal plots boxPlot(email$num_char, email$spam, horiz = TRUE, xlab = "Spam", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # bivariate relationships where categorical data have more than 2 levels boxPlot(email$num_char, email$image, horiz = TRUE, xlab = "Number of attached images", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # key can be used to restrict to only the desired groups boxPlot(email$num_char, email$image, horiz = TRUE, key = c(0, 1, 2), xlab = "Number of attached images (limited to 0, 1, 2)", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # combine boxPlot and dotPlot boxPlot(tips$tip, tips$day, horiz = TRUE, key = c("Tuesday", "Friday") ) dotPlot(tips$tip, tips$day, add = TRUE, at = 1:2 + 0.05, key = c("Tuesday", "Friday") ) # adding a box boxPlot(email$num_char[email$spam == 0], xlim = c(0, 3)) boxPlot(email$num_char[email$spam == 1], add = 2, axes = FALSE) axis(1, at = 1:2, labels = c(0, 1)) boxPlot(email$num_char[email$spam == 0], ylim = c(0, 3), horiz = TRUE) boxPlot(email$num_char[email$spam == 1], add = 2, horiz = TRUE, axes = FALSE) axis(2, at = 1:2, labels = c(0, 1))
# univariarate boxPlot(email$num_char, ylab = "Number of characters in emails") # bivariate boxPlot(email$num_char, email$spam, xlab = "Spam", ylab = "Number of characters in emails" ) # faded outliers boxPlot(email$num_char, email$spam, xlab = "Spam", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # horizontal plots boxPlot(email$num_char, email$spam, horiz = TRUE, xlab = "Spam", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # bivariate relationships where categorical data have more than 2 levels boxPlot(email$num_char, email$image, horiz = TRUE, xlab = "Number of attached images", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # key can be used to restrict to only the desired groups boxPlot(email$num_char, email$image, horiz = TRUE, key = c(0, 1, 2), xlab = "Number of attached images (limited to 0, 1, 2)", ylab = "Number of characters in emails", col = fadeColor("black", 18) ) # combine boxPlot and dotPlot boxPlot(tips$tip, tips$day, horiz = TRUE, key = c("Tuesday", "Friday") ) dotPlot(tips$tip, tips$day, add = TRUE, at = 1:2 + 0.05, key = c("Tuesday", "Friday") ) # adding a box boxPlot(email$num_char[email$spam == 0], xlim = c(0, 3)) boxPlot(email$num_char[email$spam == 1], add = 2, axes = FALSE) axis(1, at = 1:2, labels = c(0, 1)) boxPlot(email$num_char[email$spam == 0], ylim = c(0, 3), horiz = TRUE) boxPlot(email$num_char[email$spam == 1], add = 2, horiz = TRUE, axes = FALSE) axis(2, at = 1:2, labels = c(0, 1))
This function is not yet very flexible.
Braces(x, y, face.radians = 0, long = 1, short = 0.2, ...)
Braces(x, y, face.radians = 0, long = 1, short = 0.2, ...)
x |
x-coordinate of the center of the braces. |
y |
y-coordinate of the center of the braces. |
face.radians |
Radians of where the braces should face. For example,
the default with |
long |
The units for the long dimension of the braces. |
short |
The units for the short dimension of the braces. This must be less than or equal to half of the long dimension. |
... |
Arguments passed to |
David Diez
plot(0:1, 0:1, type = "n") Braces(0.5, 0.5, face.radians = 3 * pi / 2)
plot(0:1, 0:1, type = "n") Braces(0.5, 0.5, face.radians = 3 * pi / 2)
The function buildAxis
is built to provide more control of the number
of labels on the axis. This function is still under development.
buildAxis(side, limits, n, nMin = 2, nMax = 10, extend = 2, eps = 10^-12, ...)
buildAxis(side, limits, n, nMin = 2, nMax = 10, extend = 2, eps = 10^-12, ...)
side |
The side of the plot where to add the axis. |
limits |
Either lower and upper limits on the axis or a dataset. |
n |
The preferred number of axis labels. |
nMin |
The minimum number of axis labels. |
nMax |
The maximum number of axis labels. |
extend |
How far the axis may extend beyond |
eps |
The smallest increment allowed. |
... |
Arguments passed to |
The primary reason behind building this function was to allow a plot to be
created with similar features but with different datasets. For instance, if
a set of code was written for one dataset and the function axis
had
been utilized with pre-specified values, the axis may not match the plot of
a new set of data. The function buildAxis
addresses this problem by
allowing the number of axis labels to be specified and controlled.
The axis is built by assigning penalties to a variety of potential axis setups, ranking them based on these penalties and then selecting the axis with the best score.
A vector of the axis plotted.
David Diez
histPlot
, dotPlot
,
boxPlot
, densityPlot
# ===> 0 <===# limits <- rnorm(100, 605490, 10) hist(limits, axes = FALSE) buildAxis(1, limits, 2, nMax = 4) # ===> 1 <===# x <- seq(0, 500, 10) y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 5) buildAxis(2, limits = y, n = 3) # ===> 2 <===# x <- 9528412 + seq(0, 200, 10) y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200) plot(x, y, axes = FALSE) temp <- buildAxis(1, limits = x, n = 4) buildAxis(2, y, 3) # ===> 3 <===# x <- seq(367, 1251, 10) y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 800) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 3, nMax = 3) buildAxis(2, limits = y, n = 4, nMin = 3, nMax = 5) # ===> 4 <===# x <- seq(367, 367.1, 0.001) y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 0.01) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6) buildAxis(2, limits = y, n = 2, nMin = 3, nMax = 4) # ===> 5 <===# x <- seq(-0.05, -0.003, 0.0001) y <- 50 + 20 * x + rnorm(length(x), sd = 0.1) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6) buildAxis(2, limits = y, n = 4, nMax = 5) abline(lm(y ~ x)) # ===> 6 <===# x <- seq(-0.0097, -0.008, 0.0001) y <- 50 + 20 * x + rnorm(length(x), sd = 0.1) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5) buildAxis(2, limits = y, n = 4, nMax = 5) abline(lm(y ~ x)) # ===> 7 <===# x <- seq(0.03, -0.003099, -0.00001) y <- 50 + 20 * x + rnorm(length(x), sd = 0.1) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5) buildAxis(2, limits = y, n = 4, nMax = 6) abline(lm(y ~ x)) # ===> 8 - repeat <===# m <- runif(1) / runif(1) + rgamma(1, runif(1) / runif(1), runif(1) / runif(1)) s <- rgamma(1, runif(1) / runif(1), runif(1) / runif(1)) x <- rnorm(50, m, s) hist(x, axes = FALSE) buildAxis(1, limits = x, n = 5, nMin = 4, nMax = 6, eps = 10^-12) if (diff(range(x)) < 10^-12) { cat("too small\n") }
# ===> 0 <===# limits <- rnorm(100, 605490, 10) hist(limits, axes = FALSE) buildAxis(1, limits, 2, nMax = 4) # ===> 1 <===# x <- seq(0, 500, 10) y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 5) buildAxis(2, limits = y, n = 3) # ===> 2 <===# x <- 9528412 + seq(0, 200, 10) y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200) plot(x, y, axes = FALSE) temp <- buildAxis(1, limits = x, n = 4) buildAxis(2, y, 3) # ===> 3 <===# x <- seq(367, 1251, 10) y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 800) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 3, nMax = 3) buildAxis(2, limits = y, n = 4, nMin = 3, nMax = 5) # ===> 4 <===# x <- seq(367, 367.1, 0.001) y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 0.01) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6) buildAxis(2, limits = y, n = 2, nMin = 3, nMax = 4) # ===> 5 <===# x <- seq(-0.05, -0.003, 0.0001) y <- 50 + 20 * x + rnorm(length(x), sd = 0.1) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6) buildAxis(2, limits = y, n = 4, nMax = 5) abline(lm(y ~ x)) # ===> 6 <===# x <- seq(-0.0097, -0.008, 0.0001) y <- 50 + 20 * x + rnorm(length(x), sd = 0.1) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5) buildAxis(2, limits = y, n = 4, nMax = 5) abline(lm(y ~ x)) # ===> 7 <===# x <- seq(0.03, -0.003099, -0.00001) y <- 50 + 20 * x + rnorm(length(x), sd = 0.1) plot(x, y, axes = FALSE) buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5) buildAxis(2, limits = y, n = 4, nMax = 6) abline(lm(y ~ x)) # ===> 8 - repeat <===# m <- runif(1) / runif(1) + rgamma(1, runif(1) / runif(1), runif(1) / runif(1)) s <- rgamma(1, runif(1) / runif(1), runif(1) / runif(1)) x <- rnorm(50, m, s) hist(x, axes = FALSE) buildAxis(1, limits = x, n = 5, nMin = 4, nMax = 6, eps = 10^-12) if (diff(range(x)) < 10^-12) { cat("too small\n") }
Sample burger place preferences versus gender.
burger
burger
A data frame with 500 observations on the following 2 variables.
Burger place.
a factor with levels Female
and Male
SurveyUSA, Results of SurveyUSA News Poll #17718, data collected on December 2, 2010.
table(burger)
table(burger)
Calculate hit streaks
calc_streak(x)
calc_streak(x)
x |
A character vector of hits ( |
A data frame with one column, length
, containing the length of
each hit streak.
data(kobe_basket) calc_streak(kobe_basket$shot)
data(kobe_basket) calc_streak(kobe_basket$shot)
A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D).
cancer_in_dogs
cancer_in_dogs
A data frame with 1436 observations on the following 2 variables.
a factor with levels 2,4-D
and no 2,4-D
a factor with levels cancer
and no cancer
Hayes HM, Tarone RE, Cantor KP, Jessen CR, McCurnin DM, and Richardson RC. 1991. Case- Control Study of Canine Malignant Lymphoma: Positive Association With Dog Owner's Use of 2, 4- Dichlorophenoxyacetic Acid Herbicides. Journal of the National Cancer Institute 83(17):1226-1231.
table(cancer_in_dogs)
table(cancer_in_dogs)
All the cards in a standard deck.
cards
cards
A data frame with 52 observations on the following 4 variables.
a factor with levels 10
2
3
4
5
6
7
8
9
A
J
K
Q
a factor with levels black
red
a factor with levels Club
Diamond
Heart
Spade
a logical vector
table(cards$value) table(cards$color) table(cards$suit) table(cards$face) table(cards$suit, cards$face)
table(cards$value) table(cards$color) table(cards$suit) table(cards$face) table(cards$suit, cards$face)
A data frame with 428 rows and 19 columns. This is a record of characteristics on all of the new models of cars for sale in the US in the year 2004.
cars04
cars04
A data frame with 428 observations on the following 19 variables.
The name of the vehicle including manufacturer and model.
Logical variable indicating if the vehicle is a sports car.
Logical variable indicating if the vehicle is an suv.
Logical variable indicating if the vehicle is a wagon.
Logical variable indicating if the vehicle is a minivan.
Logical variable indicating if the vehicle is a pickup.
Logical variable indicating if the vehicle is all-wheel drive.
Logical variable indicating if the vehicle is rear-wheel drive.
Manufacturer suggested retail price of the vehicle.
Amount of money the dealer paid for the vehicle.
Displacement of the engine - the total volume of all the cylinders, measured in liters.
Number of cylinders in the engine.
Amount of horsepower produced by the engine.
Gas mileage for city driving, measured in miles per gallon.
Gas mileage for highway driving, measured in miles per gallon.
Total weight of the vehicle, measured in pounds.
Distance between the center of the front wheels and the center of the rear wheels, measured in inches.
Total length of the vehicle, measured in inches.
Total width of the vehicle, measured in inches.
library(ggplot2) # Highway gas mileage ggplot(cars04, aes(x = hwy_mpg)) + geom_histogram( bins = 15, color = "white", fill = openintro::IMSCOL["green", "full"] ) + theme_minimal() + labs( title = "Highway gas milage for cars from 2004", x = "Gas Mileage (miles per gallon)", y = "Number of cars" )
library(ggplot2) # Highway gas mileage ggplot(cars04, aes(x = hwy_mpg)) + geom_histogram( bins = 15, color = "white", fill = openintro::IMSCOL["green", "full"] ) + theme_minimal() + labs( title = "Highway gas milage for cars from 2004", x = "Gas Mileage (miles per gallon)", y = "Number of cars" )
A data frame with 54 rows and 6 columns. This data is a subset of the
Cars93
dataset from the MASS
package.
cars93
cars93
A data frame with 54 observations on the following 6 variables.
The vehicle type with levels large
, midsize
,
and small
.
Vehicle price (USD).
Vehicle mileage in city (miles per gallon).
Vehicle drive train with levels 4WD
, front
,
and rear
.
The vehicle passenger capacity.
Vehicle weight (lbs).
These cars represent a random sample for 1993 models that were in both
Consumer Reports and PACE Buying Guide. Only vehicles of type
small
, midsize
, and large
were include.
Further description can be found in Lock (1993). Use the URL http://jse.amstat.org/v1n1/datasets.lock.html.
Lock, R. H. (1993) 1993 New Car Data. Journal of Statistics Education 1(1).
library(ggplot2) # Vehicle price by type ggplot(cars93, aes(x = price)) + geom_histogram(binwidth = 5) + facet_wrap(~type) # Vehicle price vs. weight ggplot(cars93, aes(x = weight, y = price)) + geom_point() # Milleage vs. weight ggplot(cars93, aes(x = weight, y = mpg_city)) + geom_point() + geom_smooth()
library(ggplot2) # Vehicle price by type ggplot(cars93, aes(x = price)) + geom_histogram(binwidth = 5) + facet_wrap(~type) # Vehicle price vs. weight ggplot(cars93, aes(x = weight, y = price)) + geom_point() # Milleage vs. weight ggplot(cars93, aes(x = weight, y = mpg_city)) + geom_point() + geom_smooth()
These are simulated data and intended to represent housing prices of students at a community college.
cchousing
cchousing
A data frame with 75 observations on the following variable.
Monthly housing price, simulated.
hist(cchousing$price)
hist(cchousing$price)
Create a Cartesian Coordinate Plane.
CCP( xlim = c(-4, 4), ylim = c(-4, 4), mar = rep(0, 4), length = 0.1, tcl = 0.007, xylab = FALSE, ticks = 1, ticklabs = 1, xpos = 1, ypos = 2, cex.coord = 1, cex.xylab = 1.5, add = FALSE )
CCP( xlim = c(-4, 4), ylim = c(-4, 4), mar = rep(0, 4), length = 0.1, tcl = 0.007, xylab = FALSE, ticks = 1, ticklabs = 1, xpos = 1, ypos = 2, cex.coord = 1, cex.xylab = 1.5, add = FALSE )
xlim |
The x-limits for the plane (vector of length 2). |
ylim |
The y-limits for the plane (vector of length 2). |
mar |
Plotting margins. |
length |
The |
tcl |
Tick size. |
xylab |
Whether x and y should be shown next to the labels. |
ticks |
How frequently tick marks should be shown on the axes. If a vector of length 2, the first argument will correspond to the x-axis and the second to the y-axis. |
ticklabs |
How frequently tick labels should be shown on the axes. If a vector of length 2, the first argument will correspond to the x-axis and the second to the y-axis. |
xpos |
The position of the labels on the x-axis. See the |
ypos |
The position of the labels on the y-axis. See the |
cex.coord |
Inflation factor for font size of the coordinates, where
any value larger than zero is acceptable and |
cex.xylab |
Inflation factor for font size of the x and y labels, where
any value larger than zero is acceptable and |
add |
Indicate whether a new plot should be created ( |
David Diez
lsegments
, dlsegments
,
ArrowLines
CCP() CCP(xylab = TRUE, ylim = c(-3.5, 2), xpos = 3, cex.coord = 1) CCP(xlim = c(-8, 8), ylim = c(-10, 6), ticklabs = c(2, 2), cex.xylab = 0.8)
CCP() CCP(xylab = TRUE, ylim = c(-3.5, 2), xpos = 3, cex.coord = 1) CCP(xlim = c(-8, 8), ylim = c(-10, 6), ticklabs = c(2, 2), cex.xylab = 0.8)
A dataset from the 2000 Behavioral Risk Factors Surveillance System (BRFSS) conducted by the US Centers for Disease Control and Prevention used to illustrate inference on demographic data.
cdc
cdc
A dataframe with 20,000 rows and 9 variables:
genhlth
Factor with levels excellent
, very good
good
, fair
, poor
exerany
Numeric vector; 1 if the respondent exercised in the past month and 0 otherwise.
hlthplan
Numeric; 1 if the respondent has some form of health coverage and 0 otherwise.
smoke100
Numeric; 1 if the respondent has smoked at least 100 cigarettes in their entire life and 0 otherwise.
height
Numeric; respondent's height in inches.
weight
Numeric; respondent's weight in pounds.
wtdesire
Numeric; respondent's desired weight in pounds.
age
Numeric; respondent's age in years.
gender
Factor with two levels m
f
("https://www.cdc.gov/brfss/index.html")
A sample of 60 individuals from the 2000 Behavioral Risk Factors Surveillance System (BRFSS) conducted by the US Centers for Disease Control.
cdc.samp
cdc.samp
A tibble with 60 rows and 9 variables:
genhlth
Factor with levels excellent
, very good
good
, fair
, poor
exerany
Numeric vector; 1 if the respondent exercised in the past month and 0 otherwise.
hlthplan
Numeric vector; 1 if the respondent has some form of health coverage and 0 otherwise.
smoke100
Numeric; 1 if the respondent has smoked at least 100 cigarettes in their entire life and 0 otherwise.
height
Numeric; respondent's height in inches.
weight
Numeric; respondent's weight in pounds.
wtdesire
Numeric; respondent's desired weight in pounds.
age
Numeric; respondent's age in years.
gender
Factor with two levels m
f
("http://www.openintro.org/stat/data/cdc.R")
A random sample of 500 observations from the 2000 U.S. Census Data.
census
census
A data frame with 500 observations on the following 8 variables.
Census Year.
Name of state.
Total family income (in U.S. dollars).
Age.
Sex with levels Female
and Male
.
Race with levels American Indian or Alaska Native
, Black
, Chinese
, Japanese
, Other Asian or Pacific Islander
, Two major races
, White
and Other
.
Marital status with levels Divorced
, Married/spouse absent
, Married/spouse present
, Never married/single
, Separated
and Widowed
.
Total personal income (in U.S. dollars).
https://data.census.gov/cedsci
library(dplyr) library(ggplot2) census |> filter(total_family_income > 0) |> ggplot(aes(x = total_family_income)) + geom_histogram(binwidth = 25000)
library(dplyr) library(ggplot2) census |> filter(total_family_income > 0) |> ggplot(aes(x = total_family_income)) + geom_histogram(binwidth = 25000)
United States 2010 infant mortality and number of physicians by state, including the District of Columbia.
census.2010
census.2010
A data frame with 51 rows and 3 columns.
state
Character vector vector, US State including the District of Columbia
inf.mort
Numeric vector, number of deaths per 1000 live births between 1 day and 1 year of age
doctors
Numeric vector, active physicians per 100,000 population
Data were abstracted from the 2010 Statistical Abstract of the United States.
Due to a lag in recording state level data, the infant mortality data is from
2009 and the data on physicians from 2007. Both measurements are subject to
change annually, so these data are not current and should not be used for
inference about infant mortality. More current data can be found at the US
Centers for Disease Control and Prevention (https://www.cdc.gov/nchs/pressroom/sosmap/infant_mortality_rates/infant_mortality.htm), and in the dataset infant_mort_2022
.
https://www.census.gov/library/publications/2009/compendia/statab/129ed/births-deaths-marriages-divorces.html, https://www.census.gov/library/publications/2009/compendia/statab/129ed/health-nutrition.html
Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 trees in the Allegheny National Forest, Pennsylvania.
cherry
cherry
A data frame with 31 observations on the following 3 variables.
diameter in inches (at 54 inches above ground)
height is measured in feet
volume in cubic feet
D.J. Hand. A handbook of small data sets. Chapman & Hall/CRC, 1994.
library(ggplot2) library(broom) ggplot(cherry, aes(x = diam, y = volume)) + geom_point() + geom_smooth(method = "lm") mod <- lm(volume ~ diam + height, cherry) tidy(mod)
library(ggplot2) library(broom) ggplot(cherry, aes(x = diam, y = volume)) + geom_point() + geom_smooth(method = "lm") mod <- lm(volume ~ diam + height, cherry) tidy(mod)
Stereotypes are common, but at what age do they start? This study investigates stereotypes in young children aged 5-7 years old. There are four studies reported in the paper, and all four datasets are provided here.
children_gender_stereo
children_gender_stereo
This data object is more unusual than most. It is a list of 4 data frames. The four data frames correspond to the data used in Studies 1-4 of the referenced paper, and these data frames each have variables (columns) that are among the following:
Subject ID. Note that Subject 1 in the first data frame (dataset) does not correspond to Subject 1 in the second data frame.
Gender of the subject.
Age of the subject, in years.
The trait that the children were making a judgement about,
which was either nice
or smart
.
The age group of the people the children were making judgements
about (as being either nice or smart): children
or adults
.
The proportion of trials where the child picked a gender
target that matched the trait that was the same as the gender of the child.
For example, suppose we had 18 pictures, where each picture showed 2 men and
2 women (and a different set of people in each photo). Then if we asked a
boy to pick the person in each picture who they believed to be really smart,
this stereotype
variable would report the fraction of pictures where
the boy picked a man. When a girl reviews the photos, then this
stereotype
variable reports the fraction of photos where she picked
a woman. That is, this variable differs in meaning depending on the gender
of the child. (This variable design is a little confusing, but it is useful
when analyzing the data.)
The proportion of trials where the child said that children of their own gender were high-achieving in school.
Average score that measured the interest of the child in the game.
A difference score between the interest of the child in the “smart” game and their interest in the “try-hard” game.
The structure of the data object is a little unusual, so we recommend reviewing the Examples section before starting your analysis.
Thank you to Nicholas Horton for pointing us to this study and the data!
Most of the results in the paper can be reproduced using the data provided here.
% TODO(David) - Add short descriptions of each study.
Bian L, Leslie SJ, Cimpian A. 2017. "Gender stereotypes about intellectual ability emerge early and influence children's interests". Science 355:6323 (389-391). https://www.science.org/doi/10.1126/science.aah6524.
The original data may be found here.
# This dataset is a little funny to work with. # If wanting to review the data for a study, we # recommend first assigning the corresponding # data frame to a new variable. For instance, # below we assign the second study's data to an # object called `d` (d is for data!). d <- children_gender_stereo[[2]]
# This dataset is a little funny to work with. # If wanting to review the data for a study, we # recommend first assigning the corresponding # data frame to a new variable. For instance, # below we assign the second study's data to an # object called `d` (d is for data!). d <- children_gender_stereo[[2]]
The China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments.
china
china
A data frame with 9788 observations on the following 3 variables.
a numeric vector
a numeric vector
a numeric vector
UNC Carolina Population Center, China Health and Nutrition Survey, 2006.
summary(china)
summary(china)
Plot a chi-square distribution and shade the upper tail.
ChiSquareTail( U, df, xlim = c(0, 10), col = fadeColor("black", "22"), axes = TRUE, ... )
ChiSquareTail( U, df, xlim = c(0, 10), col = fadeColor("black", "22"), axes = TRUE, ... )
U |
Cut off for the upper tail. |
df |
Degrees of freedom. |
xlim |
Limits for the plot. |
col |
Color of the shading. |
axes |
Whether to plot an x-axis. |
... |
Currently ignored. |
Nothing is returned from the function.
David Diez
data(COL) ChiSquareTail(11.7, 7, c(0, 25), col = COL[1] )
data(COL) ChiSquareTail(11.7, 7, c(0, 25), col = COL[1] )
Country-level statistics from the US Central Intelligence Agency (CIA).
cia_factbook
cia_factbook
A data frame with 259 observations on the following 11 variables.
Country name.
Land area, in square kilometers. (1 square kilometer is 0.386 square miles
Birth rate, in births per 1,000 people.
Death rate, in deaths per 1,000 people.
Infant mortality, in deaths per 1,000 live births.
Total number of internet users.
Live expectancy at birth, in years.
Number of female deaths per 100,000 live births where the death is related to pregnancy or birth.
Net migration rate.
Total population.
Population growth rate.
CIA Factbook, Country Comparisons, 2014. https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons/
library(dplyr) library(ggplot2) cia_factbook_iup <- cia_factbook |> mutate(internet_users_percent = 100 * internet_users / population) ggplot(cia_factbook_iup, aes(x = internet_users_percent, y = life_exp_at_birth)) + geom_point() + labs(x = "Percentage of internet users", y = "Life expectancy at birth")
library(dplyr) library(ggplot2) cia_factbook_iup <- cia_factbook |> mutate(internet_users_percent = 100 * internet_users / population) ggplot(cia_factbook_iup, aes(x = internet_users_percent, y = life_exp_at_birth)) + geom_point() + labs(x = "Percentage of internet users", y = "Life expectancy at birth")
This data is simulated and is meant to represent students scores from three different lectures who were all given the same exam.
classdata
classdata
A data frame with 164 observations on the following 2 variables.
Represents a first midterm score.
Three classes: a
, b
, and c
.
OpenIntro Statistics, Chapter 8.
anova(lm(m1 ~ lecture, classdata))
anova(lm(m1 ~ lecture, classdata))
Data on a sample of 500 people from the Cleveland, OH and Sacramento, CA metro areas.
cle_sac
cle_sac
A data frame with 500 observations representing people on the following 8 variables.
Year the data was collected.
State where person resides.
City.
Age.
Sex.
Race.
Marital status.
Personal income.
library(ggplot2) ggplot(cle_sac, aes(x = personal_income)) + geom_histogram(binwidth = 20000) + facet_wrap(~city)
library(ggplot2) ggplot(cle_sac, aes(x = personal_income)) + geom_histogram(binwidth = 20000) + facet_wrap(~city)
A random set of monitoring locations were taken from NOAA data that had both years of interest (1948 and 2018) as well as data for both summary metrics of interest (dx70 and dx90, which are described below).
climate70
climate70
A data frame with 197 observations on the following 7 variables.
Station ID.
Latitude of the station.
Longitude of the station.
Number of days above 70 degrees in 1948.
Number of days above 70 degrees in 2018.
Number of days above 90 degrees in 1948.
Number of days above 90 degrees in 2018.
Please keep in mind that these are two annual snapshots, and a complete analysis would consider much more than two years of data and much additional information for those years.
https://www.ncdc.noaa.gov/cdo-web, retrieved 2019-04-24.
# Data sampled are from the US, Europe, and Australia. # This geographic limitation may be due to the particular # years considered, since locations without both 1948 and # 2018 were discarded for this (simple) dataset. plot(climate70$longitude, climate70$latitude) plot(climate70$dx70_1948, climate70$dx70_2018) abline(0, 1, lty = 2) plot(climate70$dx90_1948, climate70$dx90_2018) abline(0, 1, lty = 2) hist(climate70$dx70_2018 - climate70$dx70_1948) hist(climate70$dx90_2018 - climate70$dx90_1948) t.test(climate70$dx70_2018 - climate70$dx70_1948) t.test(climate70$dx90_2018 - climate70$dx90_1948)
# Data sampled are from the US, Europe, and Australia. # This geographic limitation may be due to the particular # years considered, since locations without both 1948 and # 2018 were discarded for this (simple) dataset. plot(climate70$longitude, climate70$latitude) plot(climate70$dx70_1948, climate70$dx70_2018) abline(0, 1, lty = 2) plot(climate70$dx90_1948, climate70$dx90_2018) abline(0, 1, lty = 2) hist(climate70$dx70_2018 - climate70$dx70_1948) hist(climate70$dx90_2018 - climate70$dx90_1948) t.test(climate70$dx70_2018 - climate70$dx70_1948) t.test(climate70$dx90_2018 - climate70$dx90_1948)
Anonymous data was collected from urine samples at huts along the climb of Mont Blanc. Several types of drugs were tested, and proportions were reported.
climber_drugs
climber_drugs
A data frame with 211 rows and 6 variables.
Idendification number of a specific urine sample.
Location where the sample was taken.
Substance detected to be present in the urine sample.
Amount of substance found measured in ng/ml.
Indicates that the concentration was determined by screening analysis.
Indicates that this substance was always detected concomitantly with the previous one, within the same urine sample.
PLOS One - Drug Use on Mont Blanc: A Study Using Automated Urine Collection
library(dplyr) # Calculate the average concentration of each substance and number of occurrences. climber_drugs |> group_by(substance) |> summarize(count = n(), mean_con = mean(concentration)) # Proportion samples in which each substance was detected. climber_drugs |> group_by(substance) |> summarize(prop = n() / 154)
library(dplyr) # Calculate the average concentration of each substance and number of occurrences. climber_drugs |> group_by(substance) |> summarize(count = n(), mean_con = mean(concentration)) # Proportion samples in which each substance was detected. climber_drugs |> group_by(substance) |> summarize(prop = n() / 154)
Travel times and distances.
coast_starlight
coast_starlight
A data frame with 16 observations on the following 3 variables.
Station.
Distance.
Travel time.
library(ggplot2) ggplot(coast_starlight, aes(x = dist, y = travel_time)) + geom_point()
library(ggplot2) ggplot(coast_starlight, aes(x = dist, y = travel_time)) + geom_point()
These are the core colors used for the OpenIntro Statistics textbook. The blue, green, yellow, and red colors are also gray-scaled, meaning no changes are required when printing black and white copies.
COL
COL
A 7-by-13 matrix of 7 colors with thirteen fading scales: blue, green, yellow, red, black, gray, and light gray.
Colors selected by OpenIntro's in-house graphic designer, Meenal Patel.
plot(1:7, 7:1, col = COL, pch = 19, cex = 6, xlab = "", ylab = "", xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE ) text(1:7, 7:1 + 0.7, paste("COL[", 1:7, "]", sep = ""), cex = 0.9) points(1:7, 7:1 - 0.7, col = COL[, 2], pch = 19, cex = 6) points(1:7, 7:1 - 1.4, col = COL[, 3], pch = 19, cex = 6) points(1:7, 7:1 - 2.1, col = COL[, 4], pch = 19, cex = 6)
plot(1:7, 7:1, col = COL, pch = 19, cex = 6, xlab = "", ylab = "", xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE ) text(1:7, 7:1 + 0.7, paste("COL[", 1:7, "]", sep = ""), cex = 0.9) points(1:7, 7:1 - 0.7, col = COL[, 2], pch = 19, cex = 6) points(1:7, 7:1 - 1.4, col = COL[, 3], pch = 19, cex = 6) points(1:7, 7:1 - 2.1, col = COL[, 4], pch = 19, cex = 6)
A data frame containing information about comic book characters from Marvel Comics and DC Comics.
comics
comics
A data frame with 21821 observations on the following 11 variables.
Name of the character. May include: Real name, hero or villain name, alias(es) and/or which universe they live in (i.e. Earth-616 in Marvel's multiverse).
Status of the characters identity with levels Secret
, Publie
, No Dual
and Unknown
.
Character's alignment with levels Good
, Bad
, Neutral
and Reformed Criminals
.
Character's eye color.
Character's hair color.
Character's gender.
Character's classification as a gender or sexual minority.
Is the character dead or alive?
Number of comic boooks the character appears in.
Date of publication for the comic book the character first appeared in.
Publisher of the comic with levels Marvel
and DC
.
library(ggplot2) library(dplyr) # Good v Bad plot_data <- comics |> filter(align == "Good" | align == "Bad") ggplot(plot_data, aes(x = align, fill = align)) + geom_bar() + facet_wrap(~publisher) + scale_fill_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) + theme_minimal() + labs( title = "Is there a balance of power", x = "", y = "Number of characters", fill = "" )
library(ggplot2) library(dplyr) # Good v Bad plot_data <- comics |> filter(align == "Good" | align == "Bad") ggplot(plot_data, aes(x = align, fill = align)) + geom_bar() + facet_wrap(~publisher) + scale_fill_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) + theme_minimal() + labs( title = "Is there a balance of power", x = "", y = "Number of characters", fill = "" )
Input a data frame or a table, and the LaTeX output will be returned. Options exist for row and column proportions as well as for showing work.
contTable( x, prop = c("none", "row", "col"), show = FALSE, digits = 3, caption = NULL, label = NULL )
contTable( x, prop = c("none", "row", "col"), show = FALSE, digits = 3, caption = NULL, label = NULL )
x |
A data frame (with two columns) or a table. |
prop |
Indicate whether row ( |
show |
If row or column proportions are specified, indicate whether work should be shown. |
digits |
The number of digits after the decimal that should be shown for row or column proportions. |
caption |
A string that contains the table caption. The default value is
|
label |
The latex table label. The default value is |
The contTable
function makes substantial use of the
cat
function.
David Diez
email
, cars93
, possum
,
mariokart
data(email) table(email[, c("spam", "sent_email")]) contTable(email[, c("spam", "sent_email")])
data(email) table(email[, c("spam", "sent_email")]) contTable(email[, c("spam", "sent_email")])
Simulated data.
corr_match
corr_match
A data frame with 121 observations on the following 9 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Simulated dataset.
library(ggplot2) ggplot(corr_match, aes(x = x, y = y1)) + geom_point() cor(corr_match$x, corr_match$y1)
library(ggplot2) ggplot(corr_match, aes(x = x, y = y1)) + geom_point() cor(corr_match$x, corr_match$y1)
Country International Organization for Standardization (ISO) information.
country_iso
country_iso
A data frame with 249 observations on the following 4 variables.
Two-letter ISO country code.
Country name.
Year the two-letter ISO country code was assigned.
op-level domain name.
Wikipedia, retrieved 2018-11-18. https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
country_iso
country_iso
These patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner. The outcome variable of interest was whether the patients survived for at least 24 hours.
cpr
cpr
A data frame with 90 observations on the following 2 variables.
a factor with levels control
and treatment
a factor with levels died
and survived
Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial, by Bottiger et al., The Lancet, 2001.
table(cpr)
table(cpr)
Data on computer processors released between 2010 and 2020.
cpu
cpu
A data frame with 875 rows and 12 variables.
Manufacturer of the CPU.
Model name of the processor.
Name given by manufacturer to all chips with this architecture.
Number of compute cores per processor.
The number of threads represents the number of simultaneous calculations that can be ongoing in the processor.
Base speed for the CPU in GHz.
Single-core max speed for the CPU in GHz.
Specifies the type of connection to the motherboard.
Size of the process node used in production in nm.
Size of the level 3 cache on the processor in MB.
Total draw power of the processor.
Date which the processor was released to the public.
library(ggplot2) # CPU base speed ggplot(cpu, aes(x = company, y = base_clock)) + geom_boxplot() + labs( x = "Company", y = "Base Clock (GHz)", title = "CPU base speed" ) # Process node size vs. boost speed ggplot(cpu, aes(x = process, y = boost_clock)) + geom_point() + labs( x = "Process node size (nm)", y = "Boost Clock (GHz)", title = "Process node size vs. boost speed" )
library(ggplot2) # CPU base speed ggplot(cpu, aes(x = company, y = base_clock)) + geom_boxplot() + labs( x = "Company", y = "Base Clock (GHz)", title = "CPU base speed" ) # Process node size vs. boost speed ggplot(cpu, aes(x = process, y = boost_clock)) + geom_point() + labs( x = "Process node size (nm)", y = "Boost Clock (GHz)", title = "Process node size vs. boost speed" )
A simulated dataset of number of credits taken by college students each semester.
credits
credits
A data frame with 100 observations on the following variable.
Number of credits.
Simulated data.
library(ggplot2) ggplot(credits, aes(x = credits)) + geom_histogram(binwidth = 1)
library(ggplot2) ggplot(credits, aes(x = credits)) + geom_histogram(binwidth = 1)
Take a 2D contingency table and create a data frame representing the individual cases.
CT2DF(x, rn = row.names(x), cn = colnames(x), dfn = c("row.var", "col.var"))
CT2DF(x, rn = row.names(x), cn = colnames(x), dfn = c("row.var", "col.var"))
x |
Contingency table as a matrix. |
rn |
Character vector of the row names. |
cn |
Character vector of the column names. |
dfn |
Character vector with 2 values for the variable representing the rows and columns. |
A data frame with two columns.
David Diez
a <- matrix( c(459, 727, 854, 385, 99, 4198, 6245, 4821, 1634, 578), 2, byrow = TRUE ) b <- CT2DF( a, c("No", "Yes"), c("Excellent", "Very good", "Good", "Fair", "Poor"), c("coverage", "health_status") ) table(b)
a <- matrix( c(459, 727, 854, 385, 99, 4198, 6245, 4821, 1634, 578), 2, byrow = TRUE ) b <- CT2DF( a, c("No", "Yes"), c("Excellent", "Very good", "Good", "Fair", "Poor"), c("coverage", "health_status") ) table(b)
Data from a Danish study on triage in an emergency department (ED)
danish.ed.primary
danish.ed.primary
A tibble with 6249 rows and 21 variables:
mort30
numeric, 1 if patient died within 30 days of admission, 0 otherwise
triage
factor, triage score given at arrival to ED.
Values green
, yellow
, orange
, red
, from lowest
to highest priority
for treatment. The value blue
normally denotes severity not
warranting admission to the ED, but no participants coded blue
are in these data.
age
numeric, age in years, rounded to lower integer
sex
factor, values female
, male
albumin
numeric, serum albumin, in g/L
creatinine
numeric, serum creatinine, in umol/L
hemaglobin
numeric, serum hemaglobin, in mmol/L
potassium
numeric, serum potassium, in mmol/L
leuk.count
blood leukocyte count, in 10E9/L
sodium
numeric, serum sodium, in mmol/L
c.react.protein
numeric, serum C-reactive protein
oxygen.sat
numeric, peripheral arterial oxygen saturation, as a percent
resp.rate
numeric, respiratory rate per minute
heart.rate
numeric, heart rate, beats/min
systolic.bp
numeric, systolic blood pressure, in mmHg
glasgow.coma.scale
numeric, extent of impaired consciousness in patients with acute medical condition or trauma, scored between 3 and 15, 3 being the worst and 15 the best. Score is based on 3 subscales, best eye, verbal and motor responses.
readmit.hosp
factor, readmitted to hospital within 30 days,
values yes
, no
days.in.hosp
numeric, number of days admitted to hospital
icu.time
numeric, number of days in the intensive care unit. value 99999 indicates patient not admitted to ICU
icu.status
factor, patient admitted to ICU, values yes
,
no
#' @references Kristensen, Michael, et al. "Routine blood tests are associated with short term mortality and can improve emergency department triage: a cohort study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine 25 (2017): 1-8. https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader
Data from a prospective cohort study of triage scoring for an emergency department (ED). The study examined whether the use of patient level measurements would improve an existing triage score. These data are the training data (called primary data in the original manuscript) used for model building. Some variable names have been changed for readability, but the data on 21 variables for the 6,249 participants are otherwise unchanged.
Data from a prospective cohort study of triage scoring for an emergency department (ED). The study examined whether the use of patient level measurements would improve an existing triage score. These data were used as a test set (called validation in the manuscript) to examine the performance of the model built using the training (primary) cohort. Some variable names have been changed for readability and for consistency with the primary dataset, but the data on 18 variables for the 6,383 participants are otherwise unchanged. Some variables in the primary dataset do not appear in these data.
danish.ed.validation
danish.ed.validation
A tibble with 6383 rows and 18 variables:
mort30
numeric, 1 if patient died within 30 days of admission, 0 otherwise
triage
factor, triage score given at arrival to ED.
Values blue
, green
, yellow
, orange
, red
,
from lowest to highest priority
for treatment. The value blue
normally denotes severity not
warranting admission to the ED. Participants coded blue
are in these data but not in the primary data.
age
numeric, age in years, rounded to lower integer
sex
factor, female
, male
albumin
numeric, serum albumin, in g/L
creatinine
numeric, serum creatinine, in umol/L
hemaglobin
numeric, serum hemaglobin, in mmol/L
potassium
numeric, serum potassium, in mmol/L
leuk.count
blood leukocyte count, in 10E9/L
sodium
numeric, serum sodium, in mmol/L
c.react.protein
numeric, serum C-reactive protein
oxygen.sat
numeric, peripheral arterial oxygen saturation, %
resp.rate
numeric, respiratory rate per minute
heart.rate
numeric, heart rate, beats/min
systolic.bp
numeric, systolic blood pressure, in mmHg
readmit.hosp
factor, readmitted to hospital within 30 days,
with values yes
, no
days.in.hosp
numeric, number of days admitted to hospital
icu.status
factor, patient admitted to ICU, with values
yes
, no
Kristensen, Michael, et al. "Routine blood tests are associated with short term mortality and can improve emergency department triage: a cohort study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine 25 (2017): 1-8. https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader
Researchers tested the deterrence hypothesis which predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to the fine, with the condition that the fine leaves everything else unchanged by instituting a fine for late pickup at daycare centers. For this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late. They randomly selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount) for picking up children late and then removed it. In the remaining 4 daycare centers no fine was introduced. The study period was divided into four: before the fine (weeks 1–4), the first 4 weeks with the fine (weeks 5-8), the entire period with the fine (weeks 5–16), and the after fine period (weeks 17-20). Throughout the study, the number of kids who were picked up late was recorded each week for each daycare. The study found that the number of late-coming parents increased significantly when the fine was introduced, and no reduction occurred after the fine was removed.
daycare_fines
daycare_fines
A data frame with 200 observations on the following 7 variables.
Daycare center id.
Study group: test
(fine instituted) or control
(no fine).
Number of children at daycare center.
Week of study.
Number of late pickups for a given week and daycare center.
Period of study, divided into 4 periods:
before fine
, first 4 weeks with fine
, last 8 weeks with fine
, after fine
Period of study, divided into 4 periods:
before fine
, with fine
, after fine
Gneezy, Uri, and Aldo Rustichini. "A fine is a price." The Journal of Legal Studies 29, no. 1 (2000): 1-17.
library(dplyr) library(tidyr) library(ggplot2) # The following tables roughly match results presented in Table 2 of the source article # The results are only off by rounding for some of the weeks daycare_fines |> group_by(center, study_period_4) |> summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |> pivot_wider(names_from = study_period_4, values_from = avg_late_pickups) daycare_fines |> group_by(center, study_period_3) |> summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |> pivot_wider(names_from = study_period_3, values_from = avg_late_pickups) # The following plot matches Figure 1 of the source article daycare_fines |> group_by(week, group) |> summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |> ggplot(aes(x = week, y = avg_late_pickups, group = group, color = group)) + geom_point() + geom_line()
library(dplyr) library(tidyr) library(ggplot2) # The following tables roughly match results presented in Table 2 of the source article # The results are only off by rounding for some of the weeks daycare_fines |> group_by(center, study_period_4) |> summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |> pivot_wider(names_from = study_period_4, values_from = avg_late_pickups) daycare_fines |> group_by(center, study_period_3) |> summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |> pivot_wider(names_from = study_period_3, values_from = avg_late_pickups) # The following plot matches Figure 1 of the source article daycare_fines |> group_by(week, group) |> summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |> ggplot(aes(x = week, y = avg_late_pickups, group = group, color = group)) + geom_point() + geom_line()
The dataset represents a sample of 1,000 DDS consumers (out of a total population of approximately 250,000),and includes information about age, gender, ethnicity, and the amount of financial support per consumer provided by the DDS.The dataset is based on recorded attributes of consumers, but has been altered to maintain consumer privacy. From the Taylor and Mickel paper: "The data set originated from DDS’s Client Master File. In order to remain in compliance with California State Legislation, the data have been altered to protect the rights and privacy of specific individual consumers. The provided data set is based on actual attributes of consumers."
dds.discr
dds.discr
A dataframe with 1000 rows and 6 variables:
id
Numeric, Unique identification code for each resident
age.cohort
A factor, 0-5
years,
6-12
years, 13-17
years, 18-21
years, 22-50
years,
and 51+
years
age
Numeric, Age measured in years
gender
A factor, with levels Female
or Male
expenditures
Numeric, Amount of expenditures spent by the State on an individual annually, measured in USD
ethnicity
Factor, Ethnic group, recorded as
American Indian
, Asian
, Black
, Hispanic
,
Multi Race
, Native Hawaiian
, Other
,
White not Hispanic
#' @references www.amstat.org/publications/jse/v22n1/mickel.pdf Taylor, Stanley A., and Amy E. Mickel. Simpson's paradox: A data set and discrimination case study exercise. Journal of Statistics Education 22.1 (2014). Data contained in supplement B of Taylor and Mickel.
Compute kernel density plots, written in the same structure as
boxPlot
. Histograms can be automatically added for teaching
purposes.
densityPlot( x, fact = NULL, bw = "nrd0", histo = c("none", "faded", "hollow"), breaks = "Sturges", fading = "0E", fadingBorder = "25", lty = NULL, lwd = 1, col = c("black", "red", "blue"), key = NULL, add = FALSE, adjust = 1, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), weights = NULL, n = 512, from, to, na.rm = FALSE, xlim = NULL, ylim = NULL, main = "", ... )
densityPlot( x, fact = NULL, bw = "nrd0", histo = c("none", "faded", "hollow"), breaks = "Sturges", fading = "0E", fadingBorder = "25", lty = NULL, lwd = 1, col = c("black", "red", "blue"), key = NULL, add = FALSE, adjust = 1, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), weights = NULL, n = 512, from, to, na.rm = FALSE, xlim = NULL, ylim = NULL, main = "", ... )
x |
A numerical vector. |
fact |
A character or factor vector defining the grouping for data in
|
bw |
Bandwidth. See |
histo |
Whether to plot a faded histogram ( |
breaks |
The |
fading |
Character value of hexadecimal, e.g. |
fadingBorder |
Character value of hexadecimal, e.g. |
lty |
Numerical vector describing the line type for the density
curve(s). Each element corresponds to a different level of the
argument |
lwd |
Numerical vector describing the line width for the density
curve(s). Each element corresponds to a different level of the
argument |
col |
Numerical vector describing the line color for the density
curve(s). Each element corresponds to a different level of the
argument |
key |
An argument to specify ordering of the factor levels. |
add |
If |
adjust |
Argument passed to |
kernel |
Argument passed to |
weights |
Argument passed to |
n |
Argument passed to |
from |
Argument passed to |
to |
Argument passed to |
na.rm |
Argument passed to |
xlim |
x-axis limits. |
ylim |
y-axis limits. |
main |
Title for the plot. |
... |
If |
David Diez
# hollow histograms histPlot(tips$tip[tips$day == "Tuesday"], hollow = TRUE, xlim = c(0, 30), lty = 1, main = "Tips by day" ) histPlot(tips$tip[tips$day == "Friday"], hollow = TRUE, border = "red", add = TRUE, main = "Tips by day" ) legend("topright", col = c("black", "red"), lty = 1:2, legend = c("Tuesday", "Friday") ) # density plots densityPlot(tips$tip, tips$day, col = c("black", "red"), main = "Tips by day" ) legend("topright", col = c("black", "red"), lty = 1:2, legend = c("Tuesday", "Friday") ) densityPlot(tips$tip, histo = "faded", breaks = 15, main = "Tips by day" ) densityPlot(tips$tip, histo = "hollow", breaks = 30, fadingBorder = "66", lty = 1, main = "Tips by day" )
# hollow histograms histPlot(tips$tip[tips$day == "Tuesday"], hollow = TRUE, xlim = c(0, 30), lty = 1, main = "Tips by day" ) histPlot(tips$tip[tips$day == "Friday"], hollow = TRUE, border = "red", add = TRUE, main = "Tips by day" ) legend("topright", col = c("black", "red"), lty = 1:2, legend = c("Tuesday", "Friday") ) # density plots densityPlot(tips$tip, tips$day, col = c("black", "red"), main = "Tips by day" ) legend("topright", col = c("black", "red"), lty = 1:2, legend = c("Tuesday", "Friday") ) densityPlot(tips$tip, histo = "faded", breaks = 15, main = "Tips by day" ) densityPlot(tips$tip, histo = "hollow", breaks = 30, fadingBorder = "66", lty = 1, main = "Tips by day" )
Three treatments were compared to test their relative efficacy (effectiveness) in treating Type 2 Diabetes in patients aged 10-17 who were being treated with metformin. The primary outcome was lack of glycemic control (or not); lacking glycemic control means the patient still needed insulin, which is not the preferred outcome for a patient.
diabetes2
diabetes2
A data frame with 699 observations on the following 2 variables.
The treatment the patient received.
Whether there patient still needs insulin (failure
)
or met a basic positive outcome bar (success
).
Each of the 699 patients in the experiment were randomized to one of the
following treatments: (1) continued treatment with metformin
(coded as met
), (2) formin combined with rosiglitazone (coded as
rosi
), or or (3) a lifestyle-intervention program (coded as
lifestyle
).
Zeitler P, et al. 2012. A Clinical Trial to Maintain Glycemic Control in Youth with Type 2 Diabetes. N Engl J Med.
lapply(diabetes2, table) (cont.table <- table(diabetes2)) (m <- chisq.test(cont.table)) m$expected
lapply(diabetes2, table) (cont.table <- table(diabetes2)) (m <- chisq.test(cont.table)) m$expected
Creae a plot showing two line segments. The union or intersection of those
line segments can also be generated by utilizing the type
argument.
dlsegments( x1 = c(3, 7), x2 = c(5, 9), l = c("o", "o"), r = c("c", "c"), type = c("n", "u", "i"), COL = 2, lwd = 2.224, ylim = c(-0.35, 2), mar = rep(0, 4), hideOrig = FALSE )
dlsegments( x1 = c(3, 7), x2 = c(5, 9), l = c("o", "o"), r = c("c", "c"), type = c("n", "u", "i"), COL = 2, lwd = 2.224, ylim = c(-0.35, 2), mar = rep(0, 4), hideOrig = FALSE )
x1 |
The endpoints of the first interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity. |
x2 |
The endpoints of the second interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity. |
l |
A vector of length 2, where the values correspond to the left end
point of each interval. A value of |
r |
A vector of length 2, where the values correspond to the right end
point of each interval. A value of |
type |
By default, no intersection or union of the two lines will be
shown (value of |
COL |
If the union or intersection is to be shown (see the |
lwd |
If the union or intersection is to be shown (see the |
ylim |
A vector of length 2 specifying the vertical plotting limits,
which may be useful for fine-tuning plots. The default is |
mar |
A vector of length 4 that represent the plotting margins. |
hideOrig |
An optional argument that to specify that the two line
segments should be shown ( |
David Diez
dlsegments(c(-3, 3), c(1, 1000), r = c("o", "o"), l = c("c", "o"), COL = COL[4] ) dlsegments(c(-3, 3), c(1, 1000), r = c("o", "o"), l = c("c", "o"), type = "un", COL = COL[4] ) dlsegments(c(-3, 3), c(1, 1000), r = c("o", "o"), l = c("c", "o"), type = "in", COL = COL[4] )
dlsegments(c(-3, 3), c(1, 1000), r = c("o", "o"), l = c("c", "o"), COL = COL[4] ) dlsegments(c(-3, 3), c(1, 1000), r = c("o", "o"), l = c("c", "o"), type = "un", COL = COL[4] ) dlsegments(c(-3, 3), c(1, 1000), r = c("o", "o"), l = c("c", "o"), type = "in", COL = COL[4] )
Plot observations as dots.
dotPlot( x, fact = NULL, vertical = FALSE, at = 1, key = NULL, pch = 20, col = fadeColor("black", "66"), cex = 1.5, add = FALSE, axes = TRUE, xlim = NULL, ylim = NULL, ... )
dotPlot( x, fact = NULL, vertical = FALSE, at = 1, key = NULL, pch = 20, col = fadeColor("black", "66"), cex = 1.5, add = FALSE, axes = TRUE, xlim = NULL, ylim = NULL, ... )
x |
A numerical vector. |
fact |
A character or factor vector defining the grouping for data in
|
vertical |
If |
at |
The vertical coordinate of the points, or the horizontal
coordinate if |
key |
The factor levels corresponding to |
pch |
Plotting character. If |
col |
Plotting character color. If |
cex |
Plotting character size. If |
add |
If |
axes |
If |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
... |
Additional arguments to be passed to |
David Diez
histPlot
, densityPlot
,
boxPlot
library(dplyr) # Price by type dotPlot(cars93$price, cars93$type, key = c("large", "midsize", "small"), cex = 1:3 ) # Hours worked by educational attainment or degree gss2010_nona <- gss2010 |> filter(!is.na(hrs1) & !is.na(degree)) dotPlot(gss2010_nona$hrs1, gss2010_nona$degree, col = fadeColor("black", "11") ) # levels reordered dotPlot(gss2010_nona$hrs1, gss2010_nona$degree, col = fadeColor("black", "11"), key = c("LT HIGH SCHOOL", "HIGH SCHOOL", "BACHELOR", "JUNIOR COLLEGE", "GRADUATE") ) # with boxPlot() overlaid dotPlot(mariokart$total_pr, mariokart$cond, ylim = c(0.5, 2.5), xlim = c(25, 80), cex = 1 ) boxPlot(mariokart$total_pr, mariokart$cond, add = 1:2 + 0.1, key = c("new", "used"), horiz = TRUE, axes = FALSE )
library(dplyr) # Price by type dotPlot(cars93$price, cars93$type, key = c("large", "midsize", "small"), cex = 1:3 ) # Hours worked by educational attainment or degree gss2010_nona <- gss2010 |> filter(!is.na(hrs1) & !is.na(degree)) dotPlot(gss2010_nona$hrs1, gss2010_nona$degree, col = fadeColor("black", "11") ) # levels reordered dotPlot(gss2010_nona$hrs1, gss2010_nona$degree, col = fadeColor("black", "11"), key = c("LT HIGH SCHOOL", "HIGH SCHOOL", "BACHELOR", "JUNIOR COLLEGE", "GRADUATE") ) # with boxPlot() overlaid dotPlot(mariokart$total_pr, mariokart$cond, ylim = c(0.5, 2.5), xlim = c(25, 80), cex = 1 ) boxPlot(mariokart$total_pr, mariokart$cond, add = 1:2 + 0.1, key = c("new", "used"), horiz = TRUE, axes = FALSE )
Add a stacked dot plot to an existing plot. The locations for the points in the dot plot are returned from the function in a list.
dotPlotStack(x, radius = 1, seed = 1, addDots = TRUE, ...)
dotPlotStack(x, radius = 1, seed = 1, addDots = TRUE, ...)
x |
A vector of numerical observations for the dot plot. |
radius |
The approximate distance that should separate each point. |
seed |
A random seed (integer). Different values will produce different variations. |
addDots |
Indicate whether the points should be added to the plot. |
... |
Additional arguments are passed to
|
Returns a list with a height that can be used as the upper bound of ylim for a plot, then also the x- and y-coordinates of the points in the stacked dot plot.
David Diez
#
#
A SurveyUSA poll.
dream
dream
A data frame with 910 observations on the following 2 variables.
a factor with levels Conservative
Liberal
Moderate
a factor with levels No
Not sure
Yes
SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.
table(dream)
table(dream)
Quality control dataset for quadcopter drone blades, where this data has been made up for an example.
drone_blades
drone_blades
A data frame with 2000 observations on the following 2 variables.
The supplier for the blade.
The inspection conclusion.
OpenIntro Statistics, Third Edition and Fourth Edition.
library(dplyr) drone_blades |> count(supplier, inspection)
library(dplyr) drone_blades |> count(supplier, inspection)
Summary of 445 student-parent pairs.
drug_use
drug_use
A data frame with 445 observations on the following 2 variables.
a factor with levels not
uses
a factor with levels not
used
Ellis GJ and Stone LH. 1979. Marijuana Use in College: An Evaluation of a Modeling Explanation. Youth and Society 10:323-334.
table(drug_use)
table(drug_use)
Data on houses that were recently sold in the Duke Forest neighborhood of Durham, NC in November 2020.
duke_forest
duke_forest
A data frame with 98 rows and 13 variables.
Address of house.
Sale price, in USD.
Number of bedrooms.
Number of bathrooms.
Area of home, in square feet.
Type of home (all are Single Family).
Year the home was built.
Heating sytem.
Cooling system (other
or central
).
Type of parking available and number of parking spaces.
Area of the entire property, in acres.
If the home belongs to an Home Owners Association, the associted fee (NA
otherwise).
URL of the listing.
Data were collected from Zillow in November 2020.
library(ggplot2) # Number of bedrooms and price ggplot(duke_forest, aes(x = as.factor(bed), y = price)) + geom_boxplot() + labs( x = "Number of bedrooms", y = "Sale price (USD)", title = "Homes for sale in Duke Forest, Durham, NC", subtitle = "Data are from November 2020" ) # Area and price ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + labs( x = "Area (square feet)", y = "Sale price (USD)", title = "Homes for sale in Duke Forest, Durham, NC", subtitle = "Data are from November 2020" )
library(ggplot2) # Number of bedrooms and price ggplot(duke_forest, aes(x = as.factor(bed), y = price)) + geom_boxplot() + labs( x = "Number of bedrooms", y = "Sale price (USD)", title = "Homes for sale in Duke Forest, Durham, NC", subtitle = "Data are from November 2020" ) # Area and price ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + labs( x = "Area (square feet)", y = "Sale price (USD)", title = "Homes for sale in Duke Forest, Durham, NC", subtitle = "Data are from November 2020" )
Select set of notable earthquakes from 1900 to 1999.
earthquakes
earthquakes
A data frame with 123 rows and 7 variables.
Year the earthquake took place.
Month the earthquake took place.
Day the earthquake took place
Magnitude of earthquake using the Richter Scale.
City or geographic location of earthquakes.
Country or countries if the earthquake occurred on a border.
Approximate number of deaths caused by earthquake
World Almanac and Book of Facts: 2011.
library(ggplot2) ggplot(earthquakes, aes(x = richter, y = deaths)) + geom_point() ggplot(earthquakes, aes(x = log(deaths))) + geom_histogram()
library(ggplot2) ggplot(earthquakes, aes(x = richter, y = deaths)) + geom_point() ggplot(earthquakes, aes(x = log(deaths))) + geom_histogram()
In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll asked New Yorkers whether they favored a "mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient". This poll included responses of 1,042 New York adults between October 26th and 28th, 2014.
ebola_survey
ebola_survey
A data frame with 1042 observations on the following variable.
Indicates whether the respondent is in favor
or
against
the mandatory quarantine.
Poll ID NY141026 on maristpoll.marist.edu.
table(ebola_survey)
table(ebola_survey)
Explore different plotting methods using a click interface.
edaPlot( dataFrame, Col = c("#888888", "#FF0000", "#222222", "#FFFFFF", "#CCCCCC", "#3377AA") )
edaPlot( dataFrame, Col = c("#888888", "#FF0000", "#222222", "#FFFFFF", "#CCCCCC", "#3377AA") )
dataFrame |
A data frame. |
Col |
A vector containing six colors. The colors may be given in any form. |
David Diez
histPlot
, densityPlot
,
boxPlot
, dotPlot
data(mlbbat10) bat <- mlbbat10[mlbbat10$at_bat > 200, ] # edaPlot(bat) data(mariokart) mk <- mariokart[mariokart$total_pr < 100, ] # edaPlot(mk)
data(mlbbat10) bat <- mlbbat10[mlbbat10$at_bat > 200, ] # edaPlot(bat) data(mariokart) mk <- mariokart[mariokart$total_pr < 100, ] # edaPlot(mk)
A random sample of 50 students gift aid for students at Elmhurst College.
elmhurst
elmhurst
A data frame with 50 observations on the following 3 variables.
Family income of the student.
Gift aid, in $1000s.
Price paid by the student (tuition - gift aid).
These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: https://www.chronicle.com/article/what-students-really-pay-to-go-to-college/?sra=true.
library(ggplot2) library(broom) ggplot(elmhurst, aes(x = family_income, y = gift_aid)) + geom_point() + geom_smooth(method = "lm") mod <- lm(gift_aid ~ family_income, data = elmhurst) tidy(mod)
library(ggplot2) library(broom) ggplot(elmhurst, aes(x = family_income, y = gift_aid)) + geom_point() + geom_smooth(method = "lm") mod <- lm(gift_aid ~ family_income, data = elmhurst) tidy(mod)
These data represent incoming emails for the first three months of 2012 for an email account (see Source).
A email
(email_sent
) data frame has 3921 (1252)
observations on the following 21 variables.
Indicator for whether the email was spam.
Indicator for whether the email was addressed to more than one recipient.
Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
Number of people cc'ed.
Indicator for whether the sender had been sent an email in the last 30 days.
Time at which email was sent.
The number of images attached.
The number of attached files.
The number of times a dollar sign or the word “dollar” appeared in the email.
Indicates whether “winner” appeared in the email.
The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
The number of times “viagra” appeared in the email.
The number of times “password” appeared in the email.
The number of characters in the email, in thousands.
The number of line breaks in the email (does not count text wrapping).
Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
Whether there was an exclamation point in the subject.
Whether the word “urgent” was in the email subject.
The number of exclamation points in the email message.
Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
David Diez's Gmail Account, early months of 2012. All personally identifiable information has been removed.
e <- email # ______ Variables For Logistic Regression ______# # Variables are modified to match # OpenIntro Statistics, Second Edition # As Is (7): spam, to_multiple, winner, format, # re_subj, exclaim_subj # Omitted (6): from, sent_email, time, image, # viagra, urgent_subj, number # Become Indicators (5): cc, attach, dollar, # inherit, password e$cc <- ifelse(email$cc > 0, 1, 0) e$attach <- ifelse(email$attach > 0, 1, 0) e$dollar <- ifelse(email$dollar > 0, 1, 0) e$inherit <- ifelse(email$inherit > 0, 1, 0) e$password <- ifelse(email$password > 0, 1, 0) # Transform (3): num_char, line_breaks, exclaim_mess # e$num_char <- cut(email$num_char, c(0,1,5,10,20,1000)) # e$line_breaks <- cut(email$line_breaks, c(0,10,100,500,10000)) # e$exclaim_mess <- cut(email$exclaim_mess, c(-1,0,1,5,10000)) g <- glm( spam ~ to_multiple + winner + format + re_subj + exclaim_subj + cc + attach + dollar + inherit + password, # + # num_char + line_breaks + exclaim_mess, data = e, family = binomial ) summary(g) # ______ Variable Selection Via AIC ______# g. <- step(g) plot(predict(g., type = "response"), e$spam) # ______ Splitting num_char by html ______# x <- log(email$num_char) bw <- 0.004 R <- range(x) + c(-1, 1) wt <- sum(email$format == 1) / nrow(email) htmlAll <- density(x, bw = 0.4, from = R[1], to = R[2]) htmlNo <- density(x[email$format != 1], bw = 0.4, from = R[1], to = R[2] ) htmlYes <- density(x[email$format == 1], bw = 0.4, from = R[1], to = R[2] ) htmlNo$y <- htmlNo$y #* (1-wt) htmlYes$y <- htmlYes$y #* wt + htmlNo$y plot(htmlAll, xlim = c(-4, 6), ylim = c(0, 0.4)) lines(htmlNo, col = 4) lines(htmlYes, lwd = 2, col = 2)
e <- email # ______ Variables For Logistic Regression ______# # Variables are modified to match # OpenIntro Statistics, Second Edition # As Is (7): spam, to_multiple, winner, format, # re_subj, exclaim_subj # Omitted (6): from, sent_email, time, image, # viagra, urgent_subj, number # Become Indicators (5): cc, attach, dollar, # inherit, password e$cc <- ifelse(email$cc > 0, 1, 0) e$attach <- ifelse(email$attach > 0, 1, 0) e$dollar <- ifelse(email$dollar > 0, 1, 0) e$inherit <- ifelse(email$inherit > 0, 1, 0) e$password <- ifelse(email$password > 0, 1, 0) # Transform (3): num_char, line_breaks, exclaim_mess # e$num_char <- cut(email$num_char, c(0,1,5,10,20,1000)) # e$line_breaks <- cut(email$line_breaks, c(0,10,100,500,10000)) # e$exclaim_mess <- cut(email$exclaim_mess, c(-1,0,1,5,10000)) g <- glm( spam ~ to_multiple + winner + format + re_subj + exclaim_subj + cc + attach + dollar + inherit + password, # + # num_char + line_breaks + exclaim_mess, data = e, family = binomial ) summary(g) # ______ Variable Selection Via AIC ______# g. <- step(g) plot(predict(g., type = "response"), e$spam) # ______ Splitting num_char by html ______# x <- log(email$num_char) bw <- 0.004 R <- range(x) + c(-1, 1) wt <- sum(email$format == 1) / nrow(email) htmlAll <- density(x, bw = 0.4, from = R[1], to = R[2]) htmlNo <- density(x[email$format != 1], bw = 0.4, from = R[1], to = R[2] ) htmlYes <- density(x[email$format == 1], bw = 0.4, from = R[1], to = R[2] ) htmlNo$y <- htmlNo$y #* (1-wt) htmlYes$y <- htmlYes$y #* wt + htmlNo$y plot(htmlAll, xlim = c(-4, 6), ylim = c(0, 0.4)) lines(htmlNo, col = 4) lines(htmlYes, lwd = 2, col = 2)
This is a subsample of the email
dataset.
email50
email50
A data frame with 50 observations on the following 21 variables.
Indicator for whether the email was spam.
Indicator for whether the email was addressed to more than one recipient.
Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
Number of people cc'ed.
Indicator for whether the sender had been sent an email in the last 30 days.
Time at which email was sent.
The number of images attached.
The number of attached files.
The number of times a dollar sign or the word “dollar” appeared in the email.
Indicates whether “winner” appeared in the email.
The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
The number of times “viagra” appeared in the email.
The number of times “password” appeared in the email.
The number of characters in the email, in thousands.
The number of line breaks in the email (does not count text wrapping).
Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
Whether there was an exclamation point in the subject.
Whether the word “urgent” was in the email subject.
The number of exclamation points in the email message.
Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
David Diez's Gmail Account, early months of 2012. All personally identifiable information has been removed.
index <- c( 101, 105, 116, 162, 194, 211, 263, 308, 361, 374, 375, 465, 509, 513, 571, 691, 785, 842, 966, 968, 1051, 1201, 1251, 1433, 1519, 1727, 1760, 1777, 1899, 1920, 1943, 2013, 2052, 2252, 2515, 2629, 2634, 2710, 2823, 2835, 2944, 3098, 3227, 3360, 3452, 3496, 3530, 3665, 3786, 3877 ) order <- c( 3, 33, 12, 1, 21, 15, 43, 49, 8, 6, 34, 25, 24, 35, 41, 9, 22, 50, 4, 48, 7, 14, 46, 10, 38, 32, 26, 18, 23, 45, 30, 16, 17, 20, 40, 47, 31, 37, 27, 11, 5, 44, 29, 19, 13, 36, 39, 42, 28, 2 ) d <- email[index, ][order, ] identical(d, email50)
index <- c( 101, 105, 116, 162, 194, 211, 263, 308, 361, 374, 375, 465, 509, 513, 571, 691, 785, 842, 966, 968, 1051, 1201, 1251, 1433, 1519, 1727, 1760, 1777, 1899, 1920, 1943, 2013, 2052, 2252, 2515, 2629, 2634, 2710, 2823, 2835, 2944, 3098, 3227, 3360, 3452, 3496, 3530, 3665, 3786, 3877 ) order <- c( 3, 33, 12, 1, 21, 15, 43, 49, 8, 6, 34, 25, 24, 35, 41, 9, 22, 50, 4, 48, 7, 14, 46, 10, 38, 32, 26, 18, 23, 45, 30, 16, 17, 20, 40, 47, 31, 37, 27, 11, 5, 44, 29, 19, 13, 36, 39, 42, 28, 2 ) d <- email[index, ][order, ] identical(d, email50)
Pew Research conducted a poll to find whether American adults support regulation or believe the private market will move the American economy towards renewable energy.
env_regulation
env_regulation
A data frame with 705 observations on the following variable.
There were three possible outcomes for each person:
"Regulations necessary"
, "Private marketplace will ensure"
,
and "Don't know"
.
The exact statements being selected were: (1) Government regulations are necessary to encourage businesses and consumers to rely more on renewable energy sources. (2) The private marketplace will ensure that businesses and consumers rely more on renewable energy sources, even without government regulations.
The actual sample size was 1012. However, the original data were not from a simple random sample; after accounting for the design, the equivalent sample size was about 705, which was what was used for the dataset here to keep things simpler for intro stat analyses.
table(env_regulation)
table(env_regulation)
Details from the EPA.
epa2012
epa2012
A data frame with 1129 observations on the following 28 variables.
a numeric vector
Manufacturer name.
Vehicle division.
Vehicle line.
Manufacturer code.
Model type index.
Engine displacement.
Number of cylinders.
Transmission speed.
City mileage.
Highway mileage.
Combined mileage.
Whether the car is considered a "guzzler" or not, a factor with levels N
and Y.
Air aspiration method.
Air aspiration method description.
Transmission type.
Transmission type description.
Number of gears.
Whether transmission locks up, a factor with levels N
and Y
.
A factor with level N
only.
Drive system, a factor with levels.
Drive system description.
Fuel usage, a factor with levels.
Fuel usage description.
Class of car.
Car or truck, a factor with levels car
, 1
, 2
.
Date of vehicle release.
Whether the car has a fuel cell or not, a factor with levels N
, Y
.
Fueleconomy.gov, Shared MPG Estimates: Toyota Prius 2012.
epa2021
library(ggplot2) library(dplyr) # Variable descriptions distinct(epa2012, air_aspir_method_desc, air_aspir_method) distinct(epa2012, transmission_desc, transmission) distinct(epa2012, drive_desc, drive_sys) distinct(epa2012, fuel_usage_desc, fuel_usage) # Guzzlers and their mileages ggplot(epa2012, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) + geom_point() + facet_wrap(~guzzler, ncol = 1)
library(ggplot2) library(dplyr) # Variable descriptions distinct(epa2012, air_aspir_method_desc, air_aspir_method) distinct(epa2012, transmission_desc, transmission) distinct(epa2012, drive_desc, drive_sys) distinct(epa2012, fuel_usage_desc, fuel_usage) # Guzzlers and their mileages ggplot(epa2012, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) + geom_point() + facet_wrap(~guzzler, ncol = 1)
Details from the EPA.
epa2021
epa2021
A data frame with 1108 observations on the following 28 variables.
a numeric vector
Manufacturer name.
Vehicle division.
Vehicle line.
Manufacturer code.
Model type index.
Engine displacement.
Number of cylinders.
Transmission speed.
City mileage.
Highway mileage.
Combined mileage.
Whether the car is considered a "guzzler" or not, a factor with levels N
and Y.
Air aspiration method.
Air aspiration method description.
Transmission type.
Transmission type description.
Number of gears.
Whether transmission locks up, a factor with levels N
and Y
.
A factor with level N
only.
Drive system, a factor with levels.
Drive system description.
Fuel usage, a factor with levels.
Fuel usage description.
Class of car.
Car or truck, a factor with levels car
, 1
, ??
, 1
.
Date of vehicle release.
Whether the car has a fuel cell or not, a factor with levels N
, NA
.
Fuel Economy Data from fueleconomy.gov. Retrieved 6 May, 2021.
epa2012
library(ggplot2) library(dplyr) # Variable descriptions distinct(epa2021, air_aspir_method_desc, air_aspir_method) distinct(epa2021, transmission_desc, transmission) distinct(epa2021, drive_desc, drive_sys) distinct(epa2021, fuel_usage_desc, fuel_usage) # Guzzlers and their mileages ggplot(epa2021, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) + geom_point() + facet_wrap(~guzzler, ncol = 1) # Compare to 2012 epa2021 |> bind_rows(epa2012) |> group_by(model_yr) |> summarise( mean_city = mean(city_mpg), mean_hwy = mean(hwy_mpg) )
library(ggplot2) library(dplyr) # Variable descriptions distinct(epa2021, air_aspir_method_desc, air_aspir_method) distinct(epa2021, transmission_desc, transmission) distinct(epa2021, drive_desc, drive_sys) distinct(epa2021, fuel_usage_desc, fuel_usage) # Guzzlers and their mileages ggplot(epa2021, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) + geom_point() + facet_wrap(~guzzler, ncol = 1) # Compare to 2012 epa2021 |> bind_rows(epa2012) |> group_by(model_yr) |> summarise( mean_city = mean(city_mpg), mean_hwy = mean(hwy_mpg) )
This dataset comes from the 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship. Countries are given an overall sustainability score as well as scores in each of several different environmental areas.
esi
esi
A data frame with 146 observations on the following 29 variables.
ISO3 country code.
Country.
Environmental Sustainability Index.
ESI core component: systems
ESI core component: stresses
ESI core component: vulnerability
ESI core component: capacity
ESI core component: global stewardship
Air quality.
Biodiversity.
Land.
Water quality.
Water quantity.
Reducing air pollution.
Reducing ecosystem stress.
Reducing population pressure.
Reducing waste and consumption pressures.
Reducing water stress.
Natural resource management.
Environmental health.
Basic human sustenance.
Exposure to natural disasters.
Environmental governance.
Eco-efficiency.
Private sector responsiveness.
Science and technology.
Participation in international collaboration efforts.
Greenhouse gas emissions.
Reducing transboundary environmental pressures.
ESI and Component scores are presented as standard normal percentiles. Indicator scores are in the form of z-scores. See Appendix A of the report for information on the methodology and Appendix C for more detail on original data sources.
For more information on how each of the indices were calculated, see the documentation linked below.
ESI Component Indicators. 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship, Yale Center for Environmental Law and Policy, Yale University & Center for International Earth Science Information Network (CIESIN), Columbia University
In collaboration with: World Economic Forum, Geneva, Switzerland Joint Research Centre of the European Commission, Ispra, Italy.
Available at https://www.earth.columbia.edu/news/2005/images/ESI2005_policysummary.pdf.
Esty, Daniel C., Marc Levy, Tanja Srebotnjak, and Alexander de Sherbinin (2005). 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship. New Haven: Yale Center for Environmental Law and Policy
library(ggplot2) ggplot(esi, aes(x = cap_st, y = glo_col)) + geom_point(color = ifelse(esi$code == "USA", "red", "black")) + geom_text( aes(label = ifelse(code == "USA", as.character(code), "")), hjust = 1.2, color = "red" ) + labs(x = "Science and technology", y = "Participation in international collaboration efforts") ggplot(esi, aes(x = vulner, y = cap)) + geom_point(color = ifelse(esi$code == "USA", "red", "black")) + geom_text( aes(label = ifelse(code == "USA", as.character(code), "")), hjust = 1.2, color = "red" ) + labs(x = "Vulnerability", y = "Capacity")
library(ggplot2) ggplot(esi, aes(x = cap_st, y = glo_col)) + geom_point(color = ifelse(esi$code == "USA", "red", "black")) + geom_text( aes(label = ifelse(code == "USA", as.character(code), "")), hjust = 1.2, color = "red" ) + labs(x = "Science and technology", y = "Participation in international collaboration efforts") ggplot(esi, aes(x = vulner, y = cap)) + geom_point(color = ifelse(esi$code == "USA", "red", "black")) + geom_text( aes(label = ifelse(code == "USA", as.character(code), "")), hjust = 1.2, color = "red" ) + labs(x = "Vulnerability", y = "Capacity")
Experiment where 3 different treatments of ethanol were tested on the treatment of oral cancer tumors in hamsters.
ethanol
ethanol
A data frame with 24 observations, each representing one hamster, on the following 2 variables.
Treatment the hamster received.
a factor with levels no
yes
The ethyl_cellulose
and pure_ethanol
treatments consisted of
about a quarter of the volume of the tumors, while the
pure_ethanol_16x
treatment was 16x that, so about 4 times the size of
the tumors.
Morhard R, et al. 2017. Development of enhanced ethanol ablation as an alternative to surgery in treatment of superficial solid tumors. Scientific Reports 7:8750.
table(ethanol) fisher.test(table(ethanol))
table(ethanol) fisher.test(table(ethanol))
The data are gathered from end of semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin. In addition, six students rate the professors' physical appearance. The result is a data frame where each row contains a different course and each column has information on the course and the professor who taught that course.
evals
evals
A data frame with 463 observations on the following 23 variables.
Variable identifying the course (out of 463 courses).
Variable identifying the professor who taught the course (out of 94 professors).
Average professor evaluation score: (1) very unsatisfactory - (5) excellent.
Rank of professor: teaching, tenure track, tenured.
Ethnicity of professor: not minority, minority.
Gender of professor: female, male.
Language of school where professor received education: English or non-English.
Age of professor.
Percent of students in class who completed evaluation.
Number of students in class who completed evaluation.
Total number of students in class.
Class level: lower, upper.
Number of professors teaching sections in course in sample: single, multiple.
Number of credits of class: one credit (lab, PE, etc.), multi credit.
Beauty rating of professor from lower level female: (1) lowest - (10) highest.
Beauty rating of professor from upper level female: (1) lowest - (10) highest.
Beauty rating of professor from second level female: (1) lowest - (10) highest.
Beauty rating of professor from lower level male: (1) lowest - (10) highest.
Beauty rating of professor from upper level male: (1) lowest - (10) highest.
Beauty rating of professor from second upper level male: (1) lowest - (10) highest.
Average beauty rating of professor.
Outfit of professor in picture: not formal, formal.
Color of professor's picture: color, black & white.
Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005. doi:10.1016/j.econedurev.2004.07.013.
evals
evals
Grades on three exams and overall course grade for 233 students during several years for a statistics course at a university.
exam_grades
exam_grades
A data frame with 233 observations, each representing a student.
Semester when grades were recorded.
Sex of the student as recorded on the university registration system: Man or Woman.
Exam 1 grade.
Exam 2 grade.
Exam 3 grade.
Overall course grade.
library(ggplot2) library(dplyr) # Course grade vs. each exam ggplot(exam_grades, aes(x = exam1, y = course_grade)) + geom_point() ggplot(exam_grades, aes(x = exam2, y = course_grade)) + geom_point() ggplot(exam_grades, aes(x = exam2, y = course_grade)) + geom_point() # Semester averages exam_grades |> group_by(semester) |> summarise(across(exam1:course_grade, mean, na.rm = TRUE))
library(ggplot2) library(dplyr) # Course grade vs. each exam ggplot(exam_grades, aes(x = exam1, y = course_grade)) + geom_point() ggplot(exam_grades, aes(x = exam2, y = course_grade)) + geom_point() ggplot(exam_grades, aes(x = exam2, y = course_grade)) + geom_point() # Semester averages exam_grades |> group_by(semester) |> summarise(across(exam1:course_grade, mean, na.rm = TRUE))
Exam scores from a class of 19 students.
exams
exams
A data frame with 19 observations on the following variable.
a numeric vector
hist(exams$scores)
hist(exams$scores)
A survey conducted on a reasonably random sample of 203 undergraduates asked, among many other questions, about the number of exclusive relationships these students have been in.
exclusive_relationship
exclusive_relationship
A data frame with 218 observations on the following variable.
Number of exclusive relationships.
summary(exclusive_relationship$num) table(exclusive_relationship$num) hist(exclusive_relationship$num)
summary(exclusive_relationship$num) table(exclusive_relationship$num) hist(exclusive_relationship$num)
Pew Research Center conducted a survey in 2018, asking a sample of U.S. adults to categorize five factual and five opinion statements. This dataset provides data from this survey, with information on the age group of the participant as well as the number of factual and opinion statements they classified correctly (out of 5).
fact_opinion
fact_opinion
A data frame with 5,035 rows and 3 variables.
Age group of survey participant.
Number of factual statements classified correctly (out of 5).
Number of opinion statements classified correctly (out of 5).
Younger Americans are better than older Americans at telling factual news statements from opinions, Pew Research Center, October 23, 2018.
library(ggplot2) library(dplyr) library(tidyr) library(forcats) # Distribution of fact_correct by age group ggplot(fact_opinion, aes(x = age_group, y = fact_correct)) + geom_boxplot() + labs( x = "Age group", y = "Number correct (factual)", title = "Number of factual statements classified correctly by age group" ) # Distribution of opinion_correct by age group ggplot(fact_opinion, aes(x = age_group, y = opinion_correct)) + geom_boxplot() + labs( x = "Age group", y = "Number correct (opinion)", title = "Number of opinion statements classified correctly by age group" ) # Replicating the figure from Pew report (see source for link) fact_opinion |> mutate( facts = case_when( fact_correct <= 2 ~ "Two or fewer", fact_correct %in% c(3, 4) ~ "Three or four", fact_correct == 5 ~ "All five" ), facts = fct_relevel(facts, "Two or fewer", "Three or four", "All five"), opinions = case_when( opinion_correct <= 2 ~ "Two or fewer", opinion_correct %in% c(3, 4) ~ "Three or four", opinion_correct == 5 ~ "All five" ), opinions = fct_relevel(opinions, "Two or fewer", "Three or four", "All five") ) |> select(-fact_correct, -opinion_correct) |> pivot_longer(cols = -age_group, names_to = "question_type", values_to = "n_correct") |> ggplot(aes(y = fct_rev(age_group), fill = n_correct)) + geom_bar(position = "fill") + facet_wrap(~question_type, ncol = 1) + scale_fill_viridis_d(guide = guide_legend(reverse = TRUE)) + labs( x = "Proportion", y = "Age group", fill = "Number of\ncorrect\nclassifications" )
library(ggplot2) library(dplyr) library(tidyr) library(forcats) # Distribution of fact_correct by age group ggplot(fact_opinion, aes(x = age_group, y = fact_correct)) + geom_boxplot() + labs( x = "Age group", y = "Number correct (factual)", title = "Number of factual statements classified correctly by age group" ) # Distribution of opinion_correct by age group ggplot(fact_opinion, aes(x = age_group, y = opinion_correct)) + geom_boxplot() + labs( x = "Age group", y = "Number correct (opinion)", title = "Number of opinion statements classified correctly by age group" ) # Replicating the figure from Pew report (see source for link) fact_opinion |> mutate( facts = case_when( fact_correct <= 2 ~ "Two or fewer", fact_correct %in% c(3, 4) ~ "Three or four", fact_correct == 5 ~ "All five" ), facts = fct_relevel(facts, "Two or fewer", "Three or four", "All five"), opinions = case_when( opinion_correct <= 2 ~ "Two or fewer", opinion_correct %in% c(3, 4) ~ "Three or four", opinion_correct == 5 ~ "All five" ), opinions = fct_relevel(opinions, "Two or fewer", "Three or four", "All five") ) |> select(-fact_correct, -opinion_correct) |> pivot_longer(cols = -age_group, names_to = "question_type", values_to = "n_correct") |> ggplot(aes(y = fct_rev(age_group), fill = n_correct)) + geom_bar(position = "fill") + facet_wrap(~question_type, ncol = 1) + scale_fill_viridis_d(guide = guide_legend(reverse = TRUE)) + labs( x = "Proportion", y = "Age group", fill = "Number of\ncorrect\nclassifications" )
Fade colors so they are transparent.
fadeColor(col, fade = "FF")
fadeColor(col, fade = "FF")
col |
An integer, color name, or RGB hexadecimal. |
fade |
The amount to fade |
David Diez
data(mariokart) new <- mariokart$cond == "new" used <- mariokart$cond == "used" # ===> color numbers <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = 2, cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = 4, pch = 20, cex = 2) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = fadeColor(2, "22"), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = fadeColor(4, "22"), pch = 20, cex = 2 ) # ===> color names <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = "red", cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = "blue", pch = 20, cex = 2) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = fadeColor("red", "22"), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = fadeColor("blue", "22"), pch = 20, cex = 2 ) # ===> hexadecimal <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = "#FF0000", cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = "#0000FF", pch = 20, cex = 2 ) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = fadeColor("#FF0000", "22"), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = fadeColor("#0000FF", "22"), pch = 20, cex = 2 ) # ===> alternative: rgb function <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = rgb(1, 0, 0), cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = rgb(0, 0, 1), pch = 20, cex = 2 ) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = rgb(1, 0, 0, 1 / 8), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = rgb(0, 0, 1, 1 / 8), pch = 20, cex = 2 )
data(mariokart) new <- mariokart$cond == "new" used <- mariokart$cond == "used" # ===> color numbers <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = 2, cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = 4, pch = 20, cex = 2) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = fadeColor(2, "22"), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = fadeColor(4, "22"), pch = 20, cex = 2 ) # ===> color names <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = "red", cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = "blue", pch = 20, cex = 2) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = fadeColor("red", "22"), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = fadeColor("blue", "22"), pch = 20, cex = 2 ) # ===> hexadecimal <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = "#FF0000", cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = "#0000FF", pch = 20, cex = 2 ) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = fadeColor("#FF0000", "22"), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = fadeColor("#0000FF", "22"), pch = 20, cex = 2 ) # ===> alternative: rgb function <===# dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), pch = 20, col = rgb(1, 0, 0), cex = 2, main = "using regular colors" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = rgb(0, 0, 1), pch = 20, cex = 2 ) dotPlot(mariokart$total_pr[new], ylim = c(0, 3), xlim = c(25, 80), col = rgb(1, 0, 0, 1 / 8), pch = 20, cex = 2, main = "fading the colors first" ) dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = rgb(0, 0, 1, 1 / 8), pch = 20, cex = 2 )
A simulated dataset based on real population summaries.
family_college
family_college
A data frame with 792 observations on the following 2 variables.
Whether the teen goes to college
or not
.
Whether the parent holds a college degree
or not
.
Simulation based off of summary information provided at https://eric.ed.gov/?id=ED460660.
library(dplyr) family_college |> count(teen, parents)
library(dplyr) family_college |> count(teen, parents)
This dataset is a subset of the larger data set from the Functional SNPs
Associated with Muscle Size and Strength (FAMuSS) by Thompson et.al. It
contains demographic, response and coding for the SNP for the study participants.
Unlike the data in the previous version of the oibiostat
data package,
this dataset retains the missing values. The data are also discussed in the
Foulkes text. Strength was measured in both dominant and non-dominant arms
before and after resistance training. The particular gene of interest was
ACTN3, the "sports gene."
famuss
famuss
A tibble with 1397 rows and 10 variables
ndrm.ch
A numeric vector, the percent change in strength in a non-dominant arm, from before training and after.
drm.ch
A numeric vector, percent change in strength in dominant arm.
sex
A factor with levels Female
and Male
age
A numeric vector, age in years.
race
A factor with levels African Am
Asian
Caucasian
Hispanic
Other
height
A numeric vector, height in inches.
weight
A numeric vector, weight in pounds.
actn3.r577x
A factor with levels CC
CT
TT
,
that shows the genotype at residue rs540874 (location r577x) within the ACTN3
SNP.
bmi
A numeric vector, body mass index
Personal communication from A. Foulkes
Thompson PMoyna NSeip R et al. Medicine and Science in Sports and Exercise, (2004), 1132-1139, 36(7). Clarkson P, et al., Journal of Applied Physiology 99: 154-163, 2005.Pescatello L, et al. Highlights from the functional single nucleotide polymorphisms associated with human muscle size and strength or FAMuSS study, BioMed Research International 2013. Foulkes, Andrea S. Applied Statistical Genetics using R for Population Association Studies. Springer, 2009).
Nutrition amounts in 515 fast food items. The author of the data scraped only entrees (not sides, drinks, desserts, etc.).
fastfood
fastfood
A data frame with 515 observations on the following 17 variables.
Name of restaurant
Name of item
Number of calories
Calories from fat
Total fat
Saturated fat
Trans fat
Cholesterol
Sodium
Total carbs
Fiber
Suger
Protein
Vitamin A
Vitamin C
Calcium
Salad or not
Retrieved from Tidy Tuesday Fast food entree data.
Sample of heights based on the weighted sample in the survey.
fcid
fcid
A data frame with 100 observations on the following 2 variables.
a numeric vector
a numeric vector
fcid
fcid
24 sample observations.
fheights
fheights
A data frame with 24 observations on the following variable.
height, in inches
hist(fheights$heights)
hist(fheights$heights)
Samples of 50 Tobis fish, or Sand Eels, were collected at three different locations in the North Sea and the number of one-year-old fish were counted.
fish_age
fish_age
A data frame with 300 rows and 3 variables:
Year the fish was caught with levels 1997 and 1998.
Site the fish was caught with levels A, B and C.
Is the fish one-year-old, yes or no?
Henrik Madsen, Paul Thyregod. 2011. Introduction to General and Generalized Linear Models CRC Press. Boca Raton, FL. ISBN: 978-1-4200-9155-7 Website
library(dplyr) library(tidyr) # Count the number of one-year-old fish at each location. fish_age |> filter(one_year_old == "yes") |> count(year, location) |> pivot_wider(names_from = location, values_from = n)
library(dplyr) library(tidyr) # Count the number of one-year-old fish at each location. fish_age |> filter(one_year_old == "yes") |> count(year, location) |> pivot_wider(names_from = location, values_from = n)
The results summarize each of the health outcomes for an experiment where 12,933 subjects received a 1g fish oil supplement daily and 12,938 received a placebo daily. The experiment's duration was 5-years.
fish_oil_18
fish_oil_18
The format is a list of 24 matrices. Each matrix is a 2x2 table, and below are the named items in the list, which also represent the outcomes.
Major cardiovascular event. (Primary end point.)
Cardiovascular event in expanded composite endpoint.
Total myocardial infarction. (Heart attack.)
Total stroke.
Death from cardiovascular causes.
Percutaneous coronary intervention.
Coronary artery bypass graft.
Total coronary heart disease.
Ischemic stroke.
Hemorrhagic stroke.
Death from coronary heart disease.
Death from myocardial infraction.
Death from stroke.
Invasive cancer of any type. (Primary end point.)
Breast cancer.
Prostate cancer.
Colorectal cancer.
Death from cancer.
Death from any cause.
Major cardiovascular event, excluding the first 2 years of follow-up.
Total myocardial infarction, excluding the first 2 years of follow-up.
Invasive cancer of any type, excluding the first 2 years of follow-up.
Death from cancer, excluding the first 2 years of follow-up.
Death from any cause, excluding the first 2 years of follow-up.
Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403. doi:10.1056/NEJMoa1811403.
names(fish_oil_18) (tab <- fish_oil_18[["major_cardio_event"]]) chisq.test(tab) fisher.test(tab) (tab <- fish_oil_18[["myocardioal_infarction"]]) chisq.test(tab) fisher.test(tab)
names(fish_oil_18) (tab <- fish_oil_18[["major_cardio_event"]]) chisq.test(tab) fisher.test(tab) (tab <- fish_oil_18[["myocardioal_infarction"]]) chisq.test(tab) fisher.test(tab)
Flow rates (mesured in cubic feet per second) of Clarks Creek, Leach Creek, Silver Creek, and Wildwood Creek Spring collected by volunteers of the Pierce Conservation District in the State of Washington in the US.
flow_rates
flow_rates
A data frame with 31 rows and 3 variables.
Location where measurements were taken.
Date measurements were taken.
Flow rate of the river in cubic feet per second.
Pierce County Water Data Viewer.
library(ggplot2) # River flow rates by site ggplot(flow_rates, aes(x = site, y = flow)) + geom_boxplot() + labs( title = "River flow rates by site", x = "Site", y = expression(paste("Flow (ft"^3 * "/s)")) ) # River flow rates over time ggplot(flow_rates, aes(x = date, y = flow, color = site, shape = site)) + geom_point(size = 2) + labs( title = "River flow rates over time", x = "Date", y = expression(paste("Flow (ft"^3 * "/s)")), color = "Site", shape = "Site" )
library(ggplot2) # River flow rates by site ggplot(flow_rates, aes(x = site, y = flow)) + geom_boxplot() + labs( title = "River flow rates by site", x = "Site", y = expression(paste("Flow (ft"^3 * "/s)")) ) # River flow rates over time ggplot(flow_rates, aes(x = date, y = flow, color = site, shape = site)) + geom_point(size = 2) + labs( title = "River flow rates over time", x = "Date", y = expression(paste("Flow (ft"^3 * "/s)")), color = "Site", shape = "Site" )
Contains a subset of the variables from a larger 1987 study analyzing the effect of habitat fragmentation on bird abundance in the Latrobe Valley of southeastern Victoria, Australia. Habitat fragmentation is the process in which land development disrupts the native habitat of certain species. The dataset has variables on forest bird abundance in a forest patch (typically the response of interest) and features of patch.
forest.birds
forest.birds
A tibble with 56 rows and 8 variables:
abundance
Numeric vector. Average number of forest birds observed in the patch, as calculated from several independent 20-minute counting sessions.
patch.area
Numeric vector. The area of the patch. Areas were measured in hectares; 1 hectare is 10,000 square meters and approximately 2.47 acres.
year.of.isolation
The year the patch was isolated by fragmentation of local environment.
dist.nearest
Numeric vector. Distance to the nearest patch, measured in kilometers.
dist.larger
Numeric vector. Distance to the nearest patch that is larger than the current patch, measured in kilometers.
grazing.intensity
Factor. A score indicating the extent of livestock grazing. The categories are: "light", "less than average", "average", "moderately heavy", "heavy".
altitude
Numeric vector. Altitude of the patch, measured in meters.
yrs.isolation
Numeric vector. Number of years of isolation at the time study was conducted (1983).Computed as 1983 - year.of.isolation
https://users.monash.edu.au/~murray/BDAR/ Listed under chapter 9 datasets
Loyn R.H. 1987 Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victorian forests." In Nature Conservation: The Role of Remnants of Native Vegetation. Saunders DA, Arnold GW, Burbridge AA, and Hopkins AJM eds. Surrey Beatty and Sons, Chipping Norton, NSW, 65-77, 1987. Logan, M 2011 Biostatistical Design and Analysis Using R. Wiley-Blackwell, Chapter 9
This dataset addresses issues of how superstitions regarding Friday the 13th affect human behavior, and whether Friday the 13th is an unlucky day. Scanlon, et al. collected data on traffic and shopping patterns and accident frequency for Fridays the 6th and 13th between October of 1989 and November of 1992.
friday
friday
A data frame with 61 observations and 6 variables.
Type of observation, traffic
, shopping
, or accident
.
Year and month of observation.
Counts on the 6th of the month.
Counts on the 13th of the month.
Difference between the sixth and the thirteenth.
Location where data is collected.
There are three types of observations: traffic, shopping, and accident. For traffic, the researchers obtained information from the British Department of Transport regarding the traffic flows between junctions 7 to 8 and junctions 9 to 10 of the M25 motorway. For shopping, they collected the numbers of shoppers in nine different supermarkets in southeast England. For accidents, they collected numbers of emergency admissions to hospitals due to transport accidents.
Scanlon, T.J., Luben, R.N., Scanlon, F.L., Singleton, N. (1993), "Is Friday the 13th Bad For Your Health?," BMJ, 307, 1584-1586. https://dasl.datadescription.com/datafile/friday-the-13th-traffic and https://dasl.datadescription.com/datafile/friday-the-13th-accidents.
library(dplyr) library(ggplot2) friday |> filter(type == "traffic") |> ggplot(aes(x = sixth)) + geom_histogram(binwidth = 2000) + xlim(110000, 140000) friday |> filter(type == "traffic") |> ggplot(aes(x = thirteenth)) + geom_histogram(binwidth = 2000) + xlim(110000, 140000)
library(dplyr) library(ggplot2) friday |> filter(type == "traffic") |> ggplot(aes(x = sixth)) + geom_histogram(binwidth = 2000) + xlim(110000, 140000) friday |> filter(type == "traffic") |> ggplot(aes(x = thirteenth)) + geom_histogram(binwidth = 2000) + xlim(110000, 140000)
From February to April 2013, the study team studied various populations of frogs living between 2035 to 3494m above sea level in the eastern Tibetan Plateau. They located breeding ponds at various altitudes, and at each one, obtained a small sample of freshly laid egg clutches.They counted the number of eggs and weighed the clutch to determine egg weight, and approximated egg size from photographs. The data are used to estimate whether maternal investment changes at varying altitudes on the Tibetan Plateau. Investment is assessed by measuring how reproducing females allocated their energy to egg productions of size or number, all characteristics of offspring fitness. Source data on size and volume in log_10 scale have been converted to standard numeric scale.
frog
frog
A data frame with 431 observations on the following 6 variables.
altitude
Numeric, altitude of study site in meters above sea level.
latitude
Numeric, latitude of study site measured in degrees.
clutch.size
Numeric, estimated number of eggs in clutch.
body.size
Numeric, length of mother frog who laid the egg clutch in cm.
clutch.volume
Numeric, volume of egg clutch in mm^3.
egg.size
Numeric, average diameter of an individual egg to the 0.01mm.
https://dx.doi.org/10.5061/dryad.6v0c1
Chen, W., et al. Maternal investment increases with altitude in a frog on the Tibetan Plateau. Journal of evolutionary biology 26.12 (2013): 2710-2715. https://doi.org/10.1111/jeb.12271
Poll about use of full-body airport scanners, where about 4-in-5 people supported the use of the scanners.
full_body_scan
full_body_scan
A data frame with 1137 observations on the following 2 variables.
a factor with levels do not know / no
answer
should
should not
a
factor with levels Democrat
Independent
Republican
S. Condon. Poll: 4 in 5 Support Full-Body Airport Scanners. In: CBS News (2010).
full_body_scan
full_body_scan
From World Bank, GDP in current U.S. dollars 1960-2020 by decade
gdp_countries
gdp_countries
A data frame with 659 rows and 9 variables.
Name of country.
description of data: GDP (in current US$), GDP growth (annual %), GDP per capita (in current US$)
value in 1960
value in 1970
value in 1980
value in 1990
value in 2000
value in 2010
value in 2020
library(dplyr) # don't use scientific notation options(scipen = 999) # List the top 10 countries by GDP (There is a row for World) gdp_countries |> filter(description == "GDP") |> mutate(year2020 = format(year_2020, big.mark = ",")) |> select(country, year2020) |> arrange(desc(year2020)) |> top_n(n = 11) # List the 10 countries with the biggest GDP per capita change from 1960 to 2020 gdp_countries |> filter(description == "GDP per capita") |> mutate(change = format(round(year_2020 - year_1960, 0), big.mark = ",")) |> select(country, change, year_1960, year_2020) |> na.omit() |> arrange(desc(change)) |> top_n(n = 10)
library(dplyr) # don't use scientific notation options(scipen = 999) # List the top 10 countries by GDP (There is a row for World) gdp_countries |> filter(description == "GDP") |> mutate(year2020 = format(year_2020, big.mark = ",")) |> select(country, year2020) |> arrange(desc(year2020)) |> top_n(n = 11) # List the 10 countries with the biggest GDP per capita change from 1960 to 2020 gdp_countries |> filter(description == "GDP per capita") |> mutate(change = format(round(year_2020 - year_1960, 0), big.mark = ",")) |> select(country, change, year_1960, year_2020) |> na.omit() |> arrange(desc(change)) |> top_n(n = 10)
Made-up data for whether a sample of two gear companies' parts pass inspection.
gear_company
gear_company
A data frame with 2000 observations on the following 2 variables.
a factor with levels current
prospective
a factor with levels not
pass
gear_company
gear_company
Study from the 1970s about whether gender influences hiring recommendations.
gender_discrimination
gender_discrimination
A data frame with 48 observations on the following 2 variables.
a factor with levels female
and male
a factor with levels not promoted
and promoted
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
library(ggplot2) table(gender_discrimination) ggplot(gender_discrimination, aes(y = gender, fill = decision)) + geom_bar(position = "fill")
library(ggplot2) table(gender_discrimination) ggplot(gender_discrimination, aes(y = gender, fill = decision)) + geom_bar(position = "fill")
Get it Dunn is a small regional run that got extra attention when a runner, Nichole Porath, made the Guiness Book of World Records for the fastest time pushing a double stroller in a half marathon. This dataset contains results from the 2017 and 2018 races.
get_it_dunn_run
get_it_dunn_run
A data frame with 978 observations on the following 10 variables.
Date of the run.
Run distance.
Bib number of the runner.
First name of the runner.
Initial of the runner's last name.
Sex of the runner.
Age of the runner.
City of residence.
State of residence.
Run time, in minutes.
Data were collected from GSE Timing: 2018 data, 2017 race data.
d <- subset( get_it_dunn_run, race == "5k" & date == "2018-05-12" & !is.na(age) & state %in% c("MN", "WI") ) head(d) m <- lm(run_time_minutes ~ sex + age + state, d) summary(m) plot(m$fitted, m$residuals) boxplot(m$residuals ~ d$sex) plot(m$residuals ~ d$age) hist(m$residuals)
d <- subset( get_it_dunn_run, race == "5k" & date == "2018-05-12" & !is.na(age) & state %in% c("MN", "WI") ) head(d) m <- lm(run_time_minutes ~ sex + age + state, d) summary(m) plot(m$fitted, m$residuals) boxplot(m$residuals ~ d$sex) plot(m$residuals ~ d$age) hist(m$residuals)
An investigator is interested in understanding the relationship, if any, between the analytical skills of young gifted children and the following variables: father's IQ, mother's IQ, age in month when the child first said "mummy" or "daddy", age in month when the child first counted to 10 successfully, average number of hours per week the child's mother or father reads to the child, average number of hours per week the child watched an educational program on TV during the past three months, average number of hours per week the child watched cartoons on TV during the past three months. The analytical skills are evaluated using a standard testing procedure, and the score on this test is used as the response variable.
gifted
gifted
A data frame with 36 observations and 8 variables.
Score in test of analytical skills.
Father's IQ.
Mother's IQ.
Age in months when the child first said "mummy" or "daddy".
Age in months when the child first counted to 10 successfully.
Average number of hours per week the child's mother or father reads to the child.
Average number of hours per week the child watched an educational program on TV during the past three months.
Average number of hours per week the child watched cartoons on TV during the past three months.
Data were collected from schools in a large city on a set of thirty-six children who were identified as gifted children soon after they reached the age of four.
Graybill, F.A. & Iyer, H.K., (1994) Regression Analysis: Concepts and Applications, Duxbury, p. 511-6.
gifted
gifted
A 2010 Pew Research poll asked 1,306 Americans, "From what you've read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?"
global_warming_pew
global_warming_pew
A data frame with 2253 observations on the following 2 variables.
a factor with levels
Conservative Republican
Liberal Democrat
Mod/Cons
Democrat
Mod/Lib Republican
Response.
Pew Research Center, Majority of Republicans No Longer See Evidence of Global Warming, data collected on October 27, 2010.
global_warming_pew
global_warming_pew
Google stock data from 2006 to early 2014, where data from the first day each month was collected.
goog
goog
A data frame with 98 observations on the following 7 variables.
a factor with levels 2006-01-03
, 2006-02-01
, and so on
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Yahoo! Finance.
goog
goog
The poll's focus is on Obama and then Democrats and Republicans in Congress.
gov_poll
gov_poll
A data frame with 4223 observations on the following 2 variables.
a factor with levels approve
disapprove
a factor with levels Democrats
Obama
Republicans
See the Pew Research website: www.people-press.org/2012/03/14/romney-leads-gop-contest-trails-in- matchup-with-obama. The counts in Table 6.19 are approximate.
gov_poll
gov_poll
A survey of 55 Duke University students asked about their GPA, number of hours they study at night, number of nights they go out, and their gender.
gpa
gpa
A data frame with 55 observations on the following 5 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a factor with
levels female
male
gpa
gpa
Data on 78 students including GPA, IQ, and gender.
gpa_iq
gpa_iq
A data frame with 78 observations representing students on the following 5 variables.
a numeric vector
Grade point average (GPA).
IQ.
Gender.
a numeric vector
gpa_iq
gpa_iq
A data frame with 193 rows and 2 columns. The columns represent the
variables gpa
and study_hours
for a sample of 193
undergraduate students who took an introductory statistics course in 2012 at
a private US university.
gpa_study_hours
gpa_study_hours
A data frame with 193 observations on the following 2 variables.
Grade point average (GPA) of student.
Number of hours students study per week.
GPA ranges from 0 to 4 points, however one student reported a GPA > 4. This is a data error but this observation has been left in the dataset as it is used to illustrate issues with real survey data. Both variables are self reported, hence may not be accurate.
Collected at a private US university as part of an anonymous survey in an introductory statistics course.
library(ggplot2) ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) + geom_point(alpha = 0.5) + labs(x = "Study hours/week", y = "GPA")
library(ggplot2) ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) + geom_point(alpha = 0.5) + labs(x = "Study hours/week", y = "GPA")
This is a simulated dataset to be used to estimate the relationship between number of hours per week students watch TV and the grade they got in a statistics class.
gradestv
gradestv
A data frame with 25 observations on the following 2 variables.
Number of hours per week students watch TV.
Grades students got in a statistics class (out of 100).
There are a few potential outliers in this dataset. When analyzing the data one should consider how (if at all) these outliers may affect the estimates of correlation coefficient and regression parameters.
Simulated data
library(ggplot2) ggplot(gradestv, aes(x = tv, y = grades)) + geom_point() + geom_smooth(method = "lm")
library(ggplot2) ggplot(gradestv, aes(x = tv, y = grades)) + geom_point() + geom_smooth(method = "lm")
The data were simulated to look like sample results from a Google search experiment.
gsearch
gsearch
A data frame with 10000 observations on the following 2 variables.
a factor with levels new search
no new search
a factor with levels current
test 1
test 2
library(ggplot2) table(gsearch$type, gsearch$outcome) ggplot(gsearch, aes(x = type, fill = outcome)) + geom_bar(position = "fill") + labs(y = "proportion")
library(ggplot2) table(gsearch$type, gsearch$outcome) ggplot(gsearch, aes(x = type, fill = outcome)) + geom_bar(position = "fill") + labs(y = "proportion")
A data frame containing data from the General Social Survey.
gss_wordsum_class
gss_wordsum_class
A data frame with 795 observations on the following 2 variables.
A vocabulary score calculated based on a ten question vocabulary test, where a higher score means better vocabulary. Scores range from 1 to 10.
Self-identified social class has 4 levels: lower, working, middle, and upper class.
library(dplyr) gss_wordsum_class |> group_by(class) |> summarize(mean_wordsum = mean(wordsum))
library(dplyr) gss_wordsum_class |> group_by(class) |> summarize(mean_wordsum = mean(wordsum))
Data from the 2010 General Social Survey.
gss2010
gss2010
A data frame with 2044 observations on the following 5 variables.
After an average work day, about how many hours do you have to relax or pursue activities that you enjoy
For how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?
Hours worked each week.
Educational attainment or degree.
Do you think the use of marijuana should be made legal, or not?
US 2010 General Social Survey.
gss2010
gss2010
Survey responses for 20,000 responses to the Behavioral Risk Factor Surveillance System.
health_coverage
health_coverage
A data frame with 20000 observations on the following 2 variables.
Whether the person had health coverage or not.
The person's health status.
Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System, BRFSS 2010 Survey Data.
table(health_coverage)
table(health_coverage)
For example, Pew Research Center conducted a survey with the following question: "As you may know, by 2014 nearly all Americans will be required to have health insurance. People who do not buy insurance will pay a penalty while people who cannot afford it will receive financial help from the government. Do you approve or disapprove of this policy?" For each randomly sampled respondent, the statements in brackets were randomized: either they were kept in the order given above, or the two statements were reversed.
healthcare_law_survey
healthcare_law_survey
A data frame with 1503 observations on the following 2 variables.
a factor with levels cannot_afford_second
penalty_second
a factor with levels approve
disapprove
other
www.people-press.org/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/. Sample sizes for each polling group are approximate.
healthcare_law_survey
healthcare_law_survey
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated officially a heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Then the actual heart transplant occurs between a few weeks to several months depending on the availability of a donor. Very few candidates during this waiting period show improvement and get deselected as a heart transplant candidate, but for the purposes of this experiment those patients were kept in the data as continuing candidates.
heart_transplant
heart_transplant
A data frame with 103 observations on the following 8 variables.
ID number of the patient.
Year of acceptance as a heart transplant candidate.
Age of the patient at the beginning of the study.
Survival status with levels alive
and dead
.
Number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study
Whether or not the patient had prior surgery with levels
yes
and no
.
Transplant status with levels control
(did not
receive a transplant) and treatment
(received a transplant).
Waiting Time for Transplant
http://www.stat.ucla.edu/~jsanchez/data/stanford.txt
Turnbull B, Brown B, and Hu M (1974). "Survivorship of heart transplant data." Journal of the American Statistical Association, vol. 69, pp. 74-80.
library(ggplot2) ggplot(heart_transplant, aes(x = transplant, y = survtime)) + geom_boxplot() + labs(x = "Transplant", y = "Survival time (days)") ggplot(heart_transplant, aes(x = transplant, fill = survived)) + geom_bar(position = "fill") + labs(x = "Transplant", y = "Proportion", fill = "Outcome")
library(ggplot2) ggplot(heart_transplant, aes(x = transplant, y = survtime)) + geom_boxplot() + labs(x = "Transplant", y = "Survival time (days)") ggplot(heart_transplant, aes(x = transplant, fill = survived)) + geom_bar(position = "fill") + labs(x = "Transplant", y = "Proportion", fill = "Outcome")
At the 1976 Pro Bowl, Ray Guy, a punter for the Oakland Raiders, punted a ball that hung mid-air long enough for officials to question whether the pigskin was filled with helium. The ball was found to be filled with air, but since then many have tossed around the idea that a helium-filled football would outdistance an air-filled one. Students at Ohio State University conducted an experiment to test this myth. They used two identical footballs, one air filled with air and one filled with helium. Each football was kicked 39 times and the two footballs were alternated with each kick.
helium
helium
A data frame with 39 observations on the following 3 variables.
Trial number.
Distance in years for air-filled football.
Distance in years for helium-filled football.
Lafferty, M. B. (1993), "OSU scientists get a kick out of sports controversy, "The Columbus Dispatch (November, 21, 1993), B7.
Previously part of the Data and Story Library, https://dasl.datadescription.com. Removed as of 2020.
boxPlot(helium$air, xlab = "air") boxPlot(helium$helium, xlab = "helium")
boxPlot(helium$air, xlab = "air") boxPlot(helium$helium, xlab = "helium")
Examining the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet).
helmet
helmet
A data frame with 12 observations representing neighborhoods on the following 2 variables.
Percent of students receiving reduced-fee school lunches.
Percent of bike riders wearing helmets.
library(ggplot2) ggplot(helmet, aes(x = lunch, y = helmet)) + geom_point()
library(ggplot2) ggplot(helmet, aes(x = lunch, y = helmet)) + geom_point()
The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
hfi
hfi
A data frame with 1458 observations on the following 123 variables.
Year
ISO code of country
Name of country
Region where country is located
Procedural justice
Civil justice
Criminal justice
Rule of law
Homicide
Disappearances
Violent conflicts
Violent conflicts
Terrorism fatalities
Terrorism injuries
Disappearances, conflict, and terrorism
Female genital mutilation
Missing women
Inheritance rights for widows
Inheritance rights for daughters
Inheritance
Women's security
Security and safety
Freedom of domestic movement
Freedom of foreign movement
Women's movement
Freedom of movement
Freedom to establish religious organizations
Freedom to operate religious organizations
Freedom to establish and operate religious organizations
Harassment and physical hostilities
Legal and regulatory restrictions
Religious freedom
Freedom of association
Freedom of assembly
Freedom to establish political parties
Freedom to operate political parties
Freedom to establish and operate political parties
Freedom to establish professional organizations
Freedom to operate professional organizations
Freedom to establish and operate professional organizations
Freedom to establish educational, sporting, and cultural organizations
Freedom to operate educational, sporting, and cultural organizations
Freedom to establish and operate educational, sporting, and cultural organizations
Freedom to associate and assemble with peaceful individuals or organizations
Press killed
Press jailed
Laws and regulations that influence media content
Political pressures and controls on media content
Access to cable/satellite
Access to foreign newspapers
State control over internet access
Freedom of expression
Legal gender
Parental rights in marriage
Parental rights after divorce
Parental rights
Male-to-male relationships
Female-to-female relationships
Same-sex relationships
Divor
Identity and relationships
Personal Freedom (score)
Personal Freedom (rank)
Government consumption
Transfers and subsidies
Government enterprises and investments
Top marginal income tax rate - Top marginal income tax rates
Top marginal income tax rate - Top marginal income and payroll tax rate
Top marginal tax rate
Size of government
Judicial independence
Impartial courts
Protection of property rights
Military interference in rule of law and politics
Integrity of the legal system
Legal enforcement of contracts
Regulatory restrictions on the sale of real property
Reliability of police
Business costs of crime
Gender adjustment
Legal system and property rights
Money growth
Standard deviation of inflation
Inflation - most recent year
Freedom to own foreign currency bank account
Sound money
Tariffs - Revenue from trade taxes (percentage of trade sector)
Tariffs - Mean tariff rate
Tariffs - Standard deviation of tariffs rates
Tariffs
Regulatory trade barriers - Nontariff trade barriers
Regulatory trade barriers - Compliance costs of importing and exporting
Regulatory trade barriers
Black-market exchange rates
Controls of the movement of capital and people - Foreign ownership/investment restrictions
Controls of the movement of capital and people - Capital controls
Controls of the movement of capital and people - Freedom of foreigners to visit
Controls of the movement of capital and people
Freedom to trade internationally
Credit market regulations - Ownership of banks
Credit market regulations - Private sector credit
Credit market regulations - Interest rate controls/negative real interest rates
Credit market regulation
Labor market regulations - Hiring regulations and minimum wage
Labor market regulations - Hiring and firing regulations
Labor market regulations - Centralized collective bargaining
Labor market regulations - Hours regulations
Labor market regulations - Dismissal regulations
Labor market regulations - Conscription
Labor market regulation
Business regulations - Administrative requirements
Business regulations - Bureaucracy costs
Business regulations - Starting a business
Business regulations - Extra payments/bribes/favoritism
Business regulations - Licensing restrictions
Business regulations - Cost of tax compliance
Business regulation
Economic freedom regulation score
Economic freedom score
Economic freedom rank
Human freedom score
Human freedom rank
Human freedom quartile
This dataset contains information from Human Freedom Index reports from 2008-2016.
Ian Vasquez and Tanja Porcnik, The Human Freedom Index 2018: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute, Fraser Institute, and the Friedrich Naumann Foundation for Freedom, 2018). https://www.cato.org/sites/cato.org/files/human-freedom-index-files/human-freedom-index-2016.pdf. https://www.kaggle.com/gsutters/the-human-freedom-index.
Create histograms and hollow histograms. This function permits easy color and appearance customization.
histPlot( x, col = fadeColor("black", "22"), border = "black", breaks = "default", probability = FALSE, hollow = FALSE, add = FALSE, lty = 2, lwd = 1, freqTable = FALSE, right = TRUE, axes = TRUE, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ... )
histPlot( x, col = fadeColor("black", "22"), border = "black", breaks = "default", probability = FALSE, hollow = FALSE, add = FALSE, lty = 2, lwd = 1, freqTable = FALSE, right = TRUE, axes = TRUE, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ... )
x |
Numerical vector or a frequency table (matrix) where the first
column represents the observed values and the second column the frequencies.
See also |
col |
Shading of the histogram bins. |
border |
Color of histogram bin borders. |
breaks |
A vector for the bin boundaries or an approximate number of bins. |
probability |
If |
hollow |
If |
add |
If |
lty |
Line type. Applies only if |
lwd |
Line width. Applies only if |
freqTable |
Set to |
right |
Set to |
axes |
If |
xlab |
Label for the x axis. |
ylab |
Label for the y axis. |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
... |
Additional arguments to |
David Diez
histPlot(tips$tip, main = "Tips") # overlaid hollow histograms histPlot(tips$tip[tips$day == "Tuesday"], probability = TRUE, hollow = TRUE, main = "Tips by day" ) histPlot(tips$tip[tips$day == "Friday"], probability = TRUE, hollow = TRUE, add = TRUE, lty = 3, border = "red" ) legend("topright", col = c("black", "red"), lty = 1:2, legend = c("Tuesday", "Friday") ) # breaks and colors histPlot(tips$tip, col = fadeColor("yellow", "33"), border = "darkblue", probability = TRUE, breaks = 30, lwd = 3 ) # custom breaks brks <- c(-1, 0, 1, 2, 3, 4, seq(5, 20, 5), 22, 24, 26) histPlot(tips$tip, probability = TRUE, breaks = brks, col = fadeColor("darkgoldenrod4", "33"), xlim = c(0, 26) )
histPlot(tips$tip, main = "Tips") # overlaid hollow histograms histPlot(tips$tip[tips$day == "Tuesday"], probability = TRUE, hollow = TRUE, main = "Tips by day" ) histPlot(tips$tip[tips$day == "Friday"], probability = TRUE, hollow = TRUE, add = TRUE, lty = 3, border = "red" ) legend("topright", col = c("black", "red"), lty = 1:2, legend = c("Tuesday", "Friday") ) # breaks and colors histPlot(tips$tip, col = fadeColor("yellow", "33"), border = "darkblue", probability = TRUE, breaks = 30, lwd = 3 ) # custom breaks brks <- c(-1, 0, 1, 2, 3, 4, seq(5, 20, 5), 22, 24, 26) histPlot(tips$tip, probability = TRUE, breaks = brks, col = fadeColor("darkgoldenrod4", "33"), xlim = c(0, 26) )
The make-up of the United States House of Representatives every two years since 1789. The last Congress included is the 112th Congress, which completed its term in 2013.
house
house
A data frame with 112 observations on the following 12 variables.
The number of that year's Congress
Starting year
Ending year
Total number of seats
Name of the first political party
Number of seats held by the first political party
Name of the second political party
Number of seats held by the second political party
Other
Vacancy
Delegate
Resident commissioner
Party Divisions of the House of Representatives, 1789 to Present. https://history.house.gov/Institution/Party-Divisions/Party-Divisions.
library(dplyr) library(ggplot2) library(forcats) # Examine two-party relationship since 1855 house_since_1855 <- house |> filter(year_start >= 1855) |> mutate( p1_perc = 100 * np1 / seats, p2_perc = 100 * np2 / seats, era = case_when( between(year_start, 1861, 1865) ~ "Civil War", between(year_start, 1914, 1918) ~ "World War I", between(year_start, 1929, 1939) ~ "Great Depression", between(year_start, 1940, 1945) ~ "World War II", between(year_start, 1960, 1965) ~ "Vietnam War Start", between(year_start, 1965, 1975) ~ "Vietnam War Escalated", TRUE ~ NA_character_ ), era = fct_relevel( era, "Civil War", "World War I", "Great Depression", "World War II", "Vietnam War Start", "Vietnam War Escalated" ) ) ggplot(house_since_1855, aes(x = year_start)) + geom_rect(aes( xmin = year_start, xmax = lead(year_start), ymin = -Inf, ymax = Inf, fill = era )) + geom_line(aes(y = p1_perc, color = "Democrats")) + # Democrats geom_line(aes(y = p2_perc, color = "Republicans")) + # Republicans scale_fill_brewer(palette = "Pastel1", na.translate = FALSE) + scale_color_manual( name = "Party", values = c("Democrats" = "blue", "Republicans" = "red"), labels = c("Democrats", "Republicans") ) + theme_minimal() + ylim(0, 100) + labs(x = "Year", y = "Percentage of seats", fill = "Era")
library(dplyr) library(ggplot2) library(forcats) # Examine two-party relationship since 1855 house_since_1855 <- house |> filter(year_start >= 1855) |> mutate( p1_perc = 100 * np1 / seats, p2_perc = 100 * np2 / seats, era = case_when( between(year_start, 1861, 1865) ~ "Civil War", between(year_start, 1914, 1918) ~ "World War I", between(year_start, 1929, 1939) ~ "Great Depression", between(year_start, 1940, 1945) ~ "World War II", between(year_start, 1960, 1965) ~ "Vietnam War Start", between(year_start, 1965, 1975) ~ "Vietnam War Escalated", TRUE ~ NA_character_ ), era = fct_relevel( era, "Civil War", "World War I", "Great Depression", "World War II", "Vietnam War Start", "Vietnam War Escalated" ) ) ggplot(house_since_1855, aes(x = year_start)) + geom_rect(aes( xmin = year_start, xmax = lead(year_start), ymin = -Inf, ymax = Inf, fill = era )) + geom_line(aes(y = p1_perc, color = "Democrats")) + # Democrats geom_line(aes(y = p2_perc, color = "Republicans")) + # Republicans scale_fill_brewer(palette = "Pastel1", na.translate = FALSE) + scale_color_manual( name = "Party", values = c("Democrats" = "blue", "Republicans" = "red"), labels = c("Democrats", "Republicans") ) + theme_minimal() + ylim(0, 100) + labs(x = "Year", y = "Percentage of seats", fill = "Era")
Each observation represents a simulated rent price for a student.
housing
housing
A data frame with 75 observations on the following variable.
a numeric vector
housing
housing
Two hundred observations were randomly sampled from the High School and Beyond survey, a survey conducted on high school seniors by the National Center of Education Statistics.
hsb2
hsb2
A data frame with 200 observations and 11 variables.
Student ID.
Student's gender, with levels
female
and male
.
Student's race, with levels
african american
, asian
, hispanic
, and white
.
Socio economic status of student's family, with levels
low
, middle
, and high
.
Type of school,
with levels public
and private
.
Type of program,
with levels general
, academic
, and vocational
.
Standardized reading score.
Standardized writing score.
Standardized math score.
Standardized science score.
Standardized social studies score.
UCLA Institute for Digital Research & Education - Statistical Consulting.
library(ggplot2) ggplot(hsb2, aes(x = read - write, y = ses)) + geom_boxplot() + labs( x = "Difference between reading and writing scores", y = "Socio-economic status" )
library(ggplot2) ggplot(hsb2, aes(x = read - write, y = ses)) + geom_boxplot() + labs( x = "Difference between reading and writing scores", y = "Socio-economic status" )
The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights of the husbands and wives.
husbands_wives
husbands_wives
A data frame with 199 observations on the following 8 variables.
Age of husband.
Age of wife.
Height of husband (mm).
Height of wife (mm).
Age of husband at the time they married.
Age of wife at the time they married.
Number of years married.
Hand DJ. 1994. A handbook of small data sets. Chapman & Hall/CRC.
library(ggplot2) ggplot(husbands_wives, aes(x = ht_husband, y = ht_wife)) + geom_point()
library(ggplot2) ggplot(husbands_wives, aes(x = ht_husband, y = ht_wife)) + geom_point()
These data are from a cross-sectional study examining the association of hyperuricemia with dietary magnesium in 5,168 participants in China. The study measured several other possible predictors, including body mass index (BMI, measured in kg/m^2) and are used in the chapter on logistic regression in Introductory Statistics for the Life and Biomedical Sciences (ISLBS).
hyperuricemia
hyperuricemia
A tibble with 5168 rows and 8 variables:
sex
Factor with levels male
and female
age
Numeric, measured in years
height
Numeric, measured in cm
weight
Numeric, Measured in kg
bmi
Numeric, body mass index, weight divided by height in meters squared
uric.acid
measured in micromolar/liter. Hyperuricemia (HU) was defined as uric acid >= 416 micromolar/L for males and >= 360 micromolar/L for females.
magnesium.intake
Daily magnesium intake from a food frequency questionnaire, measured in mg/day
hu
A factor, with levels no
, hyperuricemia absent,
yes
, hyperuricemia present.
Hyperuricemia (HU) was defined
as uric acid >= 416 micromolar/L for males and >= 360 micromolar/L for females.
Wang, Yi-lun, et al. "Association between dietary magnesium intake and hyperuricemia." PLoS One 10.11 (2015): e0141079. 10.1371/journal.pone.0141079
hyperuricemia
dataset.Random sample of 500 cases from the hyperuricemia
dataset.
hyperuricemia.samp
hyperuricemia.samp
A tibble with 5168 rows and 8 variables:
sex
Factor with levels male
and female
age
Numeric, measured in years
height
Numeric, measured in cm
weight
Numeric, Measured in kg
bmi
Numeric, body mass index, weight divided by height in meters squared
uric.acid
measured in micromolar/liter. Hyperuricemia (HU) was defined as uric acid >= 416 micromolar/L for males and >= 360 micromolar/L for females.
magnesium.intake
Daily magnesium intake from a food frequency questionnaire, measured in mg/day
hu
A factor, with levels no
, hyperuricemia absent,
yes
, hyperuricemia present.
Hyperuricemia (HU) was defined
as uric acid >= 416 micromolar/L for males and >= 360 micromolar/L for females.
Wang, Yi-lun, et al. "Association between dietary magnesium intake and hyperuricemia." PLoS One 10.11 (2015): e0141079. 10.1371/journal.pone.0141079
910 randomly sampled registered voters in Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country as well as their political ideology.
immigration
immigration
A data frame with 910 observations on the following 2 variables.
a factor with levels Apply for citizenship
Guest worker
Leave the country
Not sure
a factor with levels conservative
liberal
moderate
SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.
immigration
immigration
These are the core colors used for the Introduction to Modern Statistics textbook. The blue, green, pink, yellow, and red colors are also gray-scaled, meaning no changes are required when printing black and white copies.
IMSCOL
IMSCOL
A 8-by-13
matrix of 7 colors with four fading scales: blue,
green, pink, yellow, red, black, gray, and light gray.
plot(1:7, 7:1, col = IMSCOL, pch = 19, cex = 6, xlab = "", ylab = "", xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE ) text(1:7, 7:1 + 0.7, paste("IMSCOL[", 1:7, "]", sep = ""), cex = 0.9) points(1:7, 7:1 - 0.7, col = IMSCOL[, 2], pch = 19, cex = 6) points(1:7, 7:1 - 1.4, col = IMSCOL[, 3], pch = 19, cex = 6) points(1:7, 7:1 - 2.1, col = IMSCOL[, 4], pch = 19, cex = 6)
plot(1:7, 7:1, col = IMSCOL, pch = 19, cex = 6, xlab = "", ylab = "", xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE ) text(1:7, 7:1 + 0.7, paste("IMSCOL[", 1:7, "]", sep = ""), cex = 0.9) points(1:7, 7:1 - 0.7, col = IMSCOL[, 2], pch = 19, cex = 6) points(1:7, 7:1 - 1.4, col = IMSCOL[, 3], pch = 19, cex = 6) points(1:7, 7:1 - 2.1, col = IMSCOL[, 4], pch = 19, cex = 6)
Infant mortality data extracted from September 2023 posting of US Centers for Disease Control and Prevention. Mortality data for 2022 is listed as provisional and is subject to change. Physician data extracted from table 16 of Health United States 2019, National Center for Health Statistics (US) and represents number of physicians in patient care per 100,000 resident population in 2018, by state.
infant_mortality_2022
infant_mortality_2022
A data frame with 51 rows and 3 columns.
state_name
Character vector vector, US State including the District of Columbia
infant_mortality_rate
Numeric vector, number of deaths per 1000 live births between 1 day and 1 year of age
doctors
Numeric, number of physicians in patient care per 100,000 population
https://www.cdc.gov/nchs/pressroom/sosmap/infant_mortality_rates/infant_mortality.htm, https://www.ncbi.nlm.nih.gov/books/NBK569310/table/ch2.tab16/
This entry gives the number of deaths of infants under one year old in 2012 per 1,000 live births in the same year. This rate is often used as an indicator of the level of health in a country.
infmortrate
infmortrate
A data frame with 222 observations on the following 2 variables.
Name of country.
Infant mortality rate per 1,000 live births.
The data is given in decreasing order of infant mortality rates. There are a few potential outliers.
CIA World Factbook, https://www.cia.gov/the-world-factbook/field/infant-mortality-rate/country-comparison.
library(ggplot2) ggplot(infmortrate, aes(x = inf_mort_rate)) + geom_histogram(binwidth = 10) ggplot(infmortrate, aes(x = inf_mort_rate)) + geom_density()
library(ggplot2) ggplot(infmortrate, aes(x = inf_mort_rate)) + geom_histogram(binwidth = 10) ggplot(infmortrate, aes(x = inf_mort_rate)) + geom_density()
A data frame containing information about the 2016 US Presidential Election for the state of Iowa.
iowa
iowa
A data frame with 1386 observations on the following 5 variables.
The office that the candidates were running for.
President/Vice President pairs who were running for office.
Political part of the candidate.
County in Iowa where the votes were cast.
Number of votes received by the candidate.
library(ggplot2) library(dplyr) plot_data <- iowa |> filter(candidate != "Total") |> group_by(candidate) |> summarize(total_votes = sum(votes) / 1000) ggplot(plot_data, aes(total_votes, candidate)) + geom_col() + theme_minimal() + labs( title = "2016 Presidential Election in Iowa", subtitle = "Popular vote", y = "", x = "Number of Votes (in thousands) " )
library(ggplot2) library(dplyr) plot_data <- iowa |> filter(candidate != "Total") |> group_by(candidate) |> summarize(total_votes = sum(votes) / 1000) ggplot(plot_data, aes(total_votes, candidate)) + geom_col() + theme_minimal() + labs( title = "2016 Presidential Election in Iowa", subtitle = "Popular vote", y = "", x = "Number of Votes (in thousands) " )
On Feb 1st, 2011, Facebook Inc. filed an S-1 form with the Securities and Exchange Commission as part of their initial public offering (IPO). This dataset includes the text of that document as well as text from the IPOs of two competing companies: Google and LinkedIn.
ipo
ipo
The format is a list of three character vectors. Each vector contains the line-by-line text of the IPO Prospectus of Facebook, Google, and LinkedIn, respectively.
Each of the three prospectuses is encoded in UTF-8 format and contains some non-word characters related to the layout of the original documents. For analysis on the words, it is recommended that the data be processed with packages such as tidytext. See examples below.
All IPO prospectuses are available from the U.S. Securities and Exchange Commission: Facebook, Google, LinkedIn.
Zweig, J., 2020. Mark Zuckerberg: CEO For Life?. WSJ.
library(tidytext) library(tibble) library(dplyr) library(ggplot2) library(forcats) # Analyzing Facebook IPO text facebook <- tibble(text = ipo$facebook, company = "Facebook") facebook |> unnest_tokens(word, text) |> anti_join(stop_words) |> count(word, sort = TRUE) |> slice_head(n = 20) |> ggplot(aes(y = fct_reorder(word, n), x = n, fill = n)) + geom_col() + labs( title = "Top 20 most common words in Facebook IPO", x = "Frequency", y = "Word" ) # Comparisons to Google and LinkedIn IPO texts google <- tibble(text = ipo$google, company = "Google") linkedin <- tibble(text = ipo$linkedin, company = "LinkedIn") ipo_texts <- bind_rows(facebook, google, linkedin) ipo_texts |> unnest_tokens(word, text) |> count(company, word, sort = TRUE) |> bind_tf_idf(word, company, n) |> arrange(desc(tf_idf)) |> group_by(company) |> slice_max(tf_idf, n = 15) |> ungroup() |> ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = company)) + geom_col(show.legend = FALSE) + facet_wrap(~company, ncol = 3, scales = "free") + labs(x = "tf-idf", y = NULL)
library(tidytext) library(tibble) library(dplyr) library(ggplot2) library(forcats) # Analyzing Facebook IPO text facebook <- tibble(text = ipo$facebook, company = "Facebook") facebook |> unnest_tokens(word, text) |> anti_join(stop_words) |> count(word, sort = TRUE) |> slice_head(n = 20) |> ggplot(aes(y = fct_reorder(word, n), x = n, fill = n)) + geom_col() + labs( title = "Top 20 most common words in Facebook IPO", x = "Frequency", y = "Word" ) # Comparisons to Google and LinkedIn IPO texts google <- tibble(text = ipo$google, company = "Google") linkedin <- tibble(text = ipo$linkedin, company = "LinkedIn") ipo_texts <- bind_rows(facebook, google, linkedin) ipo_texts |> unnest_tokens(word, text) |> count(company, word, sort = TRUE) |> bind_tf_idf(word, company, n) |> arrange(desc(tf_idf)) |> group_by(company) |> slice_max(tf_idf, n = 15) |> ungroup() |> ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = company)) + geom_col(show.legend = FALSE) + facet_wrap(~company, ncol = 3, scales = "free") + labs(x = "tf-idf", y = NULL)
A simulated dataset on lengths of songs on an iPod.
ipod
ipod
A data frame with 3000 observations on the following variable.
Length of song (in minutes).
Simulated data.
library(ggplot2) ggplot(ipod, aes(x = song_length)) + geom_histogram(binwidth = 0.5)
library(ggplot2) ggplot(ipod, aes(x = song_length)) + geom_histogram(binwidth = 0.5)
A data frame containing information about the 2009 Presidential Election in Iran. There were widespread claims of election fraud in this election both internationally and within Iran.
iran
iran
A data frame with 366 observations on the following 9 variables.
Iranian province where votes were cast.
City within province where votes were cast.
Number of votes received by Ahmadinejad.
Number of votes received by Rezai.
Number of votes received by Karrubi.
Number of votes received by Mousavi.
Total number of votes cast.
Number of votes that were not counted.
Number of votes that were counted.
library(dplyr) library(ggplot2) library(tidyr) library(stringr) plot_data <- iran |> summarize( ahmadinejad = sum(ahmadinejad) / 1000, rezai = sum(rezai) / 1000, karrubi = sum(karrubi) / 1000, mousavi = sum(mousavi) / 1000 ) |> pivot_longer( cols = c(ahmadinejad, rezai, karrubi, mousavi), names_to = "candidate", values_to = "votes" ) |> mutate(candidate = str_to_title(candidate)) ggplot(plot_data, aes(votes, candidate)) + geom_col() + theme_minimal() + labs( title = "2009 Iranian Presidential Election", x = "Number of votes (in thousands)", y = "" )
library(dplyr) library(ggplot2) library(tidyr) library(stringr) plot_data <- iran |> summarize( ahmadinejad = sum(ahmadinejad) / 1000, rezai = sum(rezai) / 1000, karrubi = sum(karrubi) / 1000, mousavi = sum(mousavi) / 1000 ) |> pivot_longer( cols = c(ahmadinejad, rezai, karrubi, mousavi), names_to = "candidate", values_to = "votes" ) |> mutate(candidate = str_to_title(candidate)) ggplot(plot_data, aes(votes, candidate)) + geom_col() + theme_minimal() + labs( title = "2009 Iranian Presidential Election", x = "Number of votes (in thousands)", y = "" )
Simulated dataset of registered voters proportions and representation on juries.
jury
jury
A data frame with 275 observations on the following variable.
a factor with levels black
hispanic
other
white
jury
jury
Data from the five games the Los Angeles Lakers played against the Orlando Magic in the 2009 NBA finals.
kobe_basket
kobe_basket
A data frame with 133 rows and 6 variables:
A categorical vector, ORL if the Los Angeles Lakers played against Orlando
A numerical vector, game in the 2009 NBA finals
A categorical vector, quarter in the game, OT stands for overtime
A character vector, time at which Kobe took a shot
A character vector, description of the shot
A categorical vector, H if the shot was a hit, M if the shot was a miss
Each row represents a shot Kobe Bryant took during the five games of the 2009 NBA finals. Kobe Bryant's performance earned him the title of Most Valuable Player and many spectators commented on how he appeared to show a hot hand.
Acts as a simplified template to common parameters passed to rmarkdown::html_document().
lab_report( highlight = "pygments", theme = "spacelab", toc = TRUE, toc_float = TRUE, code_download = TRUE, code_folding = "show" )
lab_report( highlight = "pygments", theme = "spacelab", toc = TRUE, toc_float = TRUE, code_download = TRUE, code_folding = "show" )
highlight |
Syntax highlighting style. Supported styles include
"default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn",
"haddock", and "textmate". Pass |
theme |
Visual theme ("default", "cerulean", "journal", "flatly",
"readable", "spacelab", "united", "cosmo", "lumen", "paper", "sandstone",
"simplex", or "yeti"). Pass |
toc |
|
toc_float |
|
code_download |
Embed the Rmd source code within the document and provide a link that can be used by readers to download the code. |
code_folding |
Enable document readers to toggle the display of R code
chunks. Specify |
Original data from the experiment run by Bertrand and Mullainathan (2004).
labor_market_discrimination
labor_market_discrimination
A tibble with 4870 observations of 63 variables.
Highest education, with levels of 0 = not reported; 1 = high school diploma; 2 = high school graduate; 3 = some college; 4 = college or more.
Number of jobs listed on resume.
Number of years of work experience on the resume.
Indicator variable for which 1 = resume mentions some honors.
Indicator variable for which 1 = resume mentions some volunteering experience.
Indicator variable for which 1 = resume mentions some military experience.
Indicator variable for which 1 = resume mentions some employment holes.
1990 Census Occupation Code. See sources for a key.
Occupation broad with levels 1 = executives and managerial occupations, 2 = administrative supervisors, 3 = sales representatives, 4 = sales workers, 5 = secretaries and legal assistants, 6 = clerical occupations
Indicator variable for which 1 = resume mentions some work experience while at school
Indicator variable for which 1 = email address on applicant's resume.
Indicator variable for which 1 = resume mentions some computer skills.
Indicator variable for which 1 = resume mentions some special skills.
Applicant's first name.
Sex, with levels of 'f' = female; 'm' = male.
Race, with levels of 'b' = black; 'w' = white.
Indicator variable for which 1 = high quality resume.
Indicator variable for which 1 = low quality resume.
Indicator variable for which 1 = applicant was called back.
City, with levels of 'c' = chicago; 'b' = boston.
Kind, with levels of 'a' = administrative; 's' = sales.
Employment ad identifier.
Fraction of blacks in applicant's zip.
Fraction of whites in applicant's zip.
Log median household income in applicant's zip.
Fraction of high-school dropouts in applicant's zip.
Fraction of college degree or more in applicant's zip
Log per capita income in applicant's zip.
Indicator variable for which 1 = applicant has college degree or more.
Minimum experience required, if any (in years when numeric).
Specific education requirement, if any. 'hsg' = high school graduate, 'somcol' = some college, 'colp' = four year degree or higher
Indicator variable for which 1 = ad mentions employer is 'Equal Opportunity Employer'.
Sales of parent company (in millions of US $).
Number of parent company employees.
Sales of branch (in millions of US $).
Number of branch employees.
Indicator variable for which 1 = employer is a federal contractor.
Fraction of blacks in employers's zipcode.
Fraction of whites in employer's zipcode.
Log median household income in employer's zipcode.
Fraction of high-school dropouts in employer's zipcode.
Fraction of college degree or more in employer's zipcode.
Log per capita income in employer's zipcode.
Indicator variable for which 1 = executives or managers wanted.
Indicator variable for which 1 = administrative supervisors wanted.
Indicator variable for which 1 = secretaries or legal assistants wanted.
Indicator variable for which 1 = clerical workers wanted.
Indicator variable for which 1 = sales representative wanted.
Indicator variable for which 1 = retail sales worker wanted.
Indicator variable for which 1 = ad mentions any requirement for job.
Indicator variable for which 1 = ad mentions some experience requirement.
Indicator variable for which 1 = ad mentions some communication skills requirement.
Indicator variable for which 1 = ad mentions some educational requirement.
Indicator variable for which 1 = ad mentions some computer skill requirement.
Indicator variable for which 1 = ad mentions some organizational skills requirement.
Indicator variable for which 1 = employer industry is manufacturing.
Indicator variable for which 1 = employer industry is transport or communication.
Indicator variable for which 1 = employer industry is finance, insurance or real estate.
Indicator variable for which 1 = employer industry is wholesale or retail trade.
Indicator variable for which 1 = employer industry is business or personal services.
Indicator variable for which 1 = employer industry is health, education or social services.
Indicator variable for which 1 = employer industry is other or unknown.
Ownership status of employer, with levels of 'non-profit'; 'private'; 'public'
From the summary: "We study race in the labor market by sending fictitious resumes to help-wanted ads in Boston and Chicago newspapers. To manipulate perceived race, resumes are randomly assigned African-American- or White-sounding names. White names receive 50 percent more callbacks for interviews. Callbacks are also more responsive to resume quality for White names than for African-American ones. The racial gap is uniform across occupation, industry, and employer size. We also find little evidence that employers are inferring social class from the names. Differential treatment by race still appears to be prominent in the U. S. labor market."
Bertrand, Marianne, and Mullainathan, Sendhil. Replication data for: Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. Nashville, TN: American Economic Association [publisher], 2004. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2019-12-06. doi:10.3886/E116023V1.
Note: The description of the variables follows closely the labels provided in the original dataset, with small edits for clarity.
library(dplyr) # Percent callback for typical White names and typical African-American names (table 1, p. 997) labor_market_discrimination |> group_by(race) |> summarise(call_back = mean(call))
library(dplyr) # Percent callback for typical White names and typical African-American names (table 1, p. 997) labor_market_discrimination |> group_by(race) |> summarise(call_back = mean(call))
Data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010.
LAhomes
LAhomes
A data frame with 1594 observations on the following 8 variables.
City where the home is located.
Type of home with levels Condo/Twh
- condo or townhouse, SFR
- single family residence, and NA
Number of bedrooms in the home.
Number of bathrooms in the home.
Number of cars that can be parked in the garage. Note that a value of 4
refers to 4 or more garage spaces.
Squarefootage of the home.
Indicates if the home has a pool.
Listing price of the home.
library(ggplot2) ggplot(LAhomes, aes(sqft, price)) + geom_point(alpha = 0.2) + theme_minimal() + labs( title = "Can we predict list price from squarefootage?", subtitle = "Homes in the Los Angeles area", x = "Square feet", y = "List price" )
library(ggplot2) ggplot(LAhomes, aes(sqft, price)) + geom_point(alpha = 0.2) + theme_minimal() + labs( title = "Can we predict list price from squarefootage?", subtitle = "Homes in the Los Angeles area", x = "Square feet", y = "List price" )
Resumes were sent out to 316 top law firms in the United States, and there were two randomized characteristics of each resume. First, the gender associated with the resume was randomized by assigning a first name of either James or Julia. Second, the socioeconomic class of the candidate was randomly assigned and represented through five minor changes associated with personal interests and other other minor details (e.g. an extracurricular activity of sailing team vs track and field). The outcome variable was whether the candidate was received an interview.
law_resume
law_resume
A data frame with 316 observations on the following 3 variables. Each row represents a resume sent a top law firm for this experiment.
The resume represented irrelevant details suggesting
either "low"
or "high"
socioeconomic class.
The
resume implied the candidate was either "male"
or "female"
.
If the candidate received an invitation for an
"interview"
or "not"
.
For a casual overview, see https://hbr.org/2016/12/research-how-subtle-class-cues-can-backfire-on-your-resume.
For the academic paper, see Tilcsik A, Rivera LA. 2016. Class Advantage, Commitment Penalty. The Gendered Effect of Social Class Signals in an Elite Labor Market. American Sociological Review 81:6 p1097-1131. doi:10.1177/0003122416668154.
tapply(law_resume$outcome == "interview", law_resume[, c("class", "gender")], mean) m <- glm(I(outcome == "interview") ~ gender * class, data = law_resume, family = binomial) summary(m) predict(m, type = "response")
tapply(law_resume$outcome == "interview", law_resume[, c("class", "gender")], mean) m <- glm(I(outcome == "interview") ~ gender * class, data = law_resume, family = binomial) summary(m) predict(m, type = "response")
This study examined whether early exposure to peanuts increased tolerance and protection from developing a peanut allergy in children who are allergic to eggs or who have severe eczema. Participants between 4 and 11 months old were randomized to either avoid versus consume peanut based products during the first three years of life. The longer title of the study is Induction of Tolerance Through Early Introduction of Peanut in High-Risk Children and can be found in https://clinicaltrials.gov/ as study NCT00329784.
LEAP
LEAP
A data frame with 640 rows and 7 columns
participant.ID
Character vector, unique identifier for each participant.
stratum
Factor, outcome of a skin prick test (SPT) conducted
before randomization, with levels SPT-Negative
, participant
shows no evidence of peanut allergy, and SPT-Positive
, evidence
of a peanut allergy. Participants were
randomized separately within each stratum. The primary analysis of the
study is typically restricted to the SPT-Negative group.
treatment.group
Factor, randomized assignment for each participant,
with levels Peanut Avoidance
and Peanut Consumption
.
age.months
Participant age in months at randomization.
sex
Factor, sex of participant with levels Female
and
Male
primary.ethnicity
Factor variable with levels Asian
,
Black
, Other
, Mixed
, and White
.
overall.V60.outcome
Factor, indicating whether after 5 years,
the participant had an allergic reaction in the OFC,
with levels for having a reaction to a peanut based oral food challenge,
with levels (FAIL OFC
) (allergic reaction),
(PASS OFC
) (no allergic reaction)
More variables are available at the site in the source.
These data are a subset of variables from the file ADSTART0_2015-03-03_14-20-10.txt, available by downloading study files from https://www.immport.org/shared/study/SDY660
Du Toit, George, et al. "Randomized trial of peanut consumption in infants at risk for peanut allergy." New England Journal of Medicine 372.9 (2015): 803-813. doi 10.1056/nejmoa1414850
Data was collected from 276 students in a university psychology course to determine the effect of lecture delivery method on learning. Students were presented a live lecture by the professor on one day and a pre-recorded lecture on a different topic by the same professor on a different day. Survey data was collected during the lectures to determine mind wandering, interest, and motivation. Students were also ultimately asked about the preferred lecture delivery method. Finally, students completed an assessment at the end of the lecture to determine memory recall.
lecture_learning
lecture_learning
A data frame with 552 rows and 8 variables.
Identification number of a specific student. Each identification appears twice because same student heard both lecture delivery methods.
Gender of student. Recored a binary variable with levels Male and Female in the study.
Delivery method of lecture was either in-person(Live) or pre-recorded(Video).
An indicator of distraction during the lecture. It is a proportion of six mind wandering probes during the lecture when a student answered yes that mind wandering had just occurred.
An indicator of recall of information provided during the lecture. It is the proportion of correct answers in a six question assessment given at the end of the lecture presentation.
A Likert scale that gauged student interest level concerning the lecture.
After experiencing both lecture delivery methods, students were asked about which method they were most motivated to remain attentive.
After a single lecture delivery experience, this Likert scale was used to gauge motivation to remain attentive during the lecture.
library(dplyr) library(ggplot2) # Calculate the average memory test proportion by lecture delivery method # and gender. lecture_learning |> group_by(method, gender) |> summarize(average_memory = mean(memory), count = n(), .groups = "drop") # Compare visually the differences in memory test proportions by delivery # method and gender. ggplot(lecture_learning, aes(x = method, y = memory, fill = gender)) + geom_boxplot() + theme_minimal() + labs( title = "Difference in memory test proportions", x = "Method", y = "Memory", fill = "Gender" ) # Use a paired t-test to determine whether memory test proportion score # differed by delivery method. Note that paired t-tests are identical # to one sample t-test on the difference between the Live and Video methods. learning_diff <- lecture_learning |> tidyr::pivot_wider(id_cols = student, names_from = method, values_from = memory) |> mutate(time_diff = Live - Video) t.test(time_diff ~ 1, data = learning_diff) # Calculating the proportion of students who were most motivated to remain # attentive in each delivery method. lecture_learning |> count(motivation_both) |> mutate(proportion = n / sum(n))
library(dplyr) library(ggplot2) # Calculate the average memory test proportion by lecture delivery method # and gender. lecture_learning |> group_by(method, gender) |> summarize(average_memory = mean(memory), count = n(), .groups = "drop") # Compare visually the differences in memory test proportions by delivery # method and gender. ggplot(lecture_learning, aes(x = method, y = memory, fill = gender)) + geom_boxplot() + theme_minimal() + labs( title = "Difference in memory test proportions", x = "Method", y = "Memory", fill = "Gender" ) # Use a paired t-test to determine whether memory test proportion score # differed by delivery method. Note that paired t-tests are identical # to one sample t-test on the difference between the Live and Video methods. learning_diff <- lecture_learning |> tidyr::pivot_wider(id_cols = student, names_from = method, values_from = memory) |> mutate(time_diff = Live - Video) t.test(time_diff ~ 1, data = learning_diff) # Calculating the proportion of students who were most motivated to remain # attentive in each delivery method. lecture_learning |> count(motivation_both) |> mutate(proportion = n / sum(n))
In a 2010 Survey USA poll, 70% of the 119 respondents between the ages of 18 and 34 said they would vote in the 2010 general election for Prop 19, which would change California law to legalize marijuana and allow it to be regulated and taxed.
leg_mari
leg_mari
A data frame with 119 observations on the following variable.
One of two values: oppose
or
support
.
Survey USA, Election Poll #16804, data collected July 8-11, 2010.
table(leg_mari)
table(leg_mari)
Data about Lego Sets for sale. Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com
lego_population
lego_population
A data frame with 1304 rows and 14 variables.
Set Item number
Name of the set.
Set theme: Duplo, City or Friends.
Number of pieces in the set.
Recommended retail price from LEGO.
Price of the set at Amazon.
Year that it was produced.
LEGO's recommended ages of children for the set
Pages in the instruction booklet.
Number of LEGO people in the data, if unknown "NA" was recorded.
Type of packaging: bag, box, etc.
Weight of the set of LEGOS in pounds and kilograms.
Number of pieces classified as unique in the instruction manual.
Size of the lego pieces: Large if safe for small children and Small for older children.
Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3),1-7. doi:10.1080/26939169.2021.1946450
BrickInstructions.com. (n.d.). Retrieved February 2, 2021 from
Brickset. (n.d.). BRICKSET: Your LEGO® set guide. Retrieved February 2, 2021 from
library(ggplot2) library(dplyr) lego_population |> filter(theme == "Friends" | theme == "City") |> ggplot(aes(x = pieces, y = amazon_price)) + geom_point(alpha = 0.3) + labs( x = "Pieces in the Set", y = "Amazon Price", title = "Amazon Price vs Number of Pieces in Lego Sets", subtitle = "Friends and City Themes" )
library(ggplot2) library(dplyr) lego_population |> filter(theme == "Friends" | theme == "City") |> ggplot(aes(x = pieces, y = amazon_price)) + geom_point(alpha = 0.3) + labs( x = "Pieces in the Set", y = "Amazon Price", title = "Amazon Price vs Number of Pieces in Lego Sets", subtitle = "Friends and City Themes" )
Data about Lego Sets for sale. Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com
lego_sample
lego_sample
A data frame with 75 rows and 15 variables.
Set Item number
Name of the set.
Set theme: Duplo, City or Friends.
Number of pieces in the set.
Recommended retail price from LEGO.
Price of the set at Amazon.
Year that it was produced.
LEGO's recommended ages of children for the set
Pages in the instruction booklet.
Number of LEGO people in the data, if unknown "NA" was recorded.
Type of packaging: bag, box, etc.
Weight of the set of LEGOS in pounds and kilograms.
Number of pieces classified as unique in the instruction manual.
Size of the lego pieces: Large if safe for small children and Small for older children.
Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3),1-7. doi:10.1080/26939169.2021.1946450
BrickInstructions.com. (n.d.). Retrieved February 2, 2021 from
Brickset. (n.d.). BRICKSET: Your LEGO® set guide. Retrieved February 2, 2021 from
library(ggplot2) library(dplyr) lego_sample |> filter(theme == "Friends" | theme == "City") |> ggplot(aes(x = pieces, y = amazon_price)) + geom_point(alpha = 0.3) + labs( x = "Pieces in the Set", y = "Amazon Price", title = "Amazon Price vs Number of Pieces in Lego Sets", subtitle = "Friends and City Themes" )
library(ggplot2) library(dplyr) lego_sample |> filter(theme == "Friends" | theme == "City") |> ggplot(aes(x = pieces, y = amazon_price)) + geom_point(alpha = 0.3) + labs( x = "Pieces in the Set", y = "Amazon Price", title = "Amazon Price vs Number of Pieces in Lego Sets", subtitle = "Friends and City Themes" )
A data frame with 3142 rows and 4 columns. County level data for life expectancy and median income in the United States.
life_exp
life_exp
A data frame with 3142 observations on the following 4 variables.
Name of the state.
Name of the county.
Life expectancy in the county.
Median income in the county, measured in US $.
library(ggplot2) # Income V Expectancy ggplot(life_exp, aes(x = income, y = expectancy)) + geom_point(color = openintro::IMSCOL["green", "full"], alpha = 0.2) + theme_minimal() + labs( title = "Is there a relationship between median income and life expectancy?", x = "Median income (US $)", y = "Life Expectancy (year)" )
library(ggplot2) # Income V Expectancy ggplot(life_exp, aes(x = income, y = expectancy)) + geom_point(color = openintro::IMSCOL["green", "full"], alpha = 0.2) + theme_minimal() + labs( title = "Is there a relationship between median income and life expectancy?", x = "Median income (US $)", y = "Life Expectancy (year)" )
Create a simple regression plot with residual plot.
linResPlot( x, y, axes = FALSE, wBox = TRUE, wLine = TRUE, lCol = "#00000088", lty = 1, lwd = 1, main = "", xlab = "", ylab = "", marRes = NULL, col = fadeColor(4, "88"), pch = 20, cex = 1.5, yR = 0.1, ylim = NULL, subset = NULL, ... )
linResPlot( x, y, axes = FALSE, wBox = TRUE, wLine = TRUE, lCol = "#00000088", lty = 1, lwd = 1, main = "", xlab = "", ylab = "", marRes = NULL, col = fadeColor(4, "88"), pch = 20, cex = 1.5, yR = 0.1, ylim = NULL, subset = NULL, ... )
x |
Predictor variable. |
y |
Outcome variable. |
axes |
Whether to plot axis labels. |
wBox |
Whether to plot boxes around each plot. |
wLine |
Add a regression line. |
lCol |
Line color. |
lty |
Line type. |
lwd |
Line width. |
main |
Title for the top plot. |
xlab |
x-label. |
ylab |
y-label. |
marRes |
Margin for the residuals plot. |
col |
Color of the points. |
pch |
Plotting character of points. |
cex |
Size of points. |
yR |
An additional vertical stretch factor on the plot. |
ylim |
y-limits. |
subset |
Boolean vector, if wanting a subset of the data. |
... |
Additional arguments passed to both plots. |
# Currently seems broken for this example. n <- 25 x <- runif(n) y <- 5 * x + rnorm(n) myMat <- rbind(matrix(1:2, 2)) myW <- 1 myH <- c(1, 0.45) par(mar = c(0.35, 0.654, 0.35, 0.654)) layout(myMat, myW, myH) linResPlot(x, y, col = COL[1, 2])
# Currently seems broken for this example. n <- 25 x <- runif(n) y <- 5 * x + rnorm(n) myMat <- rbind(matrix(1:2, 2)) myW <- 1 myH <- c(1, 0.45) par(mar = c(0.35, 0.654, 0.35, 0.654)) layout(myMat, myW, myH) linResPlot(x, y, col = COL[1, 2])
Data on here lizard was observed and the level of sunlight. The data are collected on Sceloporus occidentalis (western fence lizards) by Stephen C. Adolph in 1983 (in desert and mountain sites) and by Dee Asbury in 2002-3 (in valley site).
lizard_habitat
lizard_habitat
A data frame with 332 observations on the following 2 variables.
Site of lizard observation: desert
, mountain
, or valley
.
Sunlight level at time of observation:
sun
(lizard was observed perching in full sunlight),
partial
(lizard was observed perching with part of its body in the sun, part in the shade),
shade
(lizard was observed perching in the shade).
Adolph, S. C. 1990. Influence of behavioral thermoregulation on microhabitat use by two Sceloporus lizards. Ecology 71: 315-327. Asbury, D.A., and S. C. Adolph. 2007. Behavioral plasticity in an ecological generalist: microhabitat use by western fence lizards. Evolutionary Ecology Research 9:801-815.
library(ggplot2) # Frequencies table(lizard_habitat) # Stacked bar plots ggplot(lizard_habitat, aes(y = site, fill = sunlight)) + geom_bar(position = "fill") + labs(x = "Proportion")
library(ggplot2) # Frequencies table(lizard_habitat) # Stacked bar plots ggplot(lizard_habitat, aes(y = site, fill = sunlight)) + geom_bar(position = "fill") + labs(x = "Proportion")
Data on top speeds measured on a laboratory race track for two species of lizards: Western fence lizard (Sceloporus occidentalis) and Sagebrush lizard (Sceloporus graciosus).
lizard_run
lizard_run
A data frame with 48 observations on the following 3 variables.
Top speed of lizard, meters per second.
Common name: Western fence lizard
and Sagebrush lizard
.
Scientific name (Genus and species): Sceloporus occidentalis
and Sceloporus graciosus
.
Adolph, S. C. 1987. Physiological and behavioral ecology of the lizards Sceloporus occidentalis and Sceloporus graciosus. Dissertation. University of Washington, Seattle, Washington, USA.
library(ggplot2) library(dplyr) # Top speed by species ggplot(lizard_run, aes(x = top_speed, color = common_name, fill = common_name)) + geom_density(alpha = 0.5) # Top speed summary statistics by species lizard_run |> group_by(common_name) |> summarise( n = n(), mean = mean(top_speed), sd = sd(top_speed) )
library(ggplot2) library(dplyr) # Top speed by species ggplot(lizard_run, aes(x = top_speed, color = common_name, fill = common_name)) + geom_density(alpha = 0.5) # Top speed summary statistics by species lizard_run |> group_by(common_name) |> summarise( n = n(), mean = mean(top_speed), sd = sd(top_speed) )
Plot data, the linear model, and a residual plot simultaneously.
lmPlot( x, y, xAxis = 0, yAxis = 4, resAxis = 3, resSymm = TRUE, wBox = TRUE, wLine = TRUE, lCol = "#00000088", lty = 1, lwd = 1, xlab = "", ylab = "", marRes = NULL, col = "#22558888", pch = 20, cex = 1.5, xR = 0.02, yR = 0.1, xlim = NULL, ylim = NULL, subset = NULL, parCustom = FALSE, myHeight = c(1, 0.45), plots = c("both", "mainOnly", "resOnly"), highlight = NULL, hlCol = NULL, hlCex = 1.5, hlPch = 20, na.rm = TRUE, ... )
lmPlot( x, y, xAxis = 0, yAxis = 4, resAxis = 3, resSymm = TRUE, wBox = TRUE, wLine = TRUE, lCol = "#00000088", lty = 1, lwd = 1, xlab = "", ylab = "", marRes = NULL, col = "#22558888", pch = 20, cex = 1.5, xR = 0.02, yR = 0.1, xlim = NULL, ylim = NULL, subset = NULL, parCustom = FALSE, myHeight = c(1, 0.45), plots = c("both", "mainOnly", "resOnly"), highlight = NULL, hlCol = NULL, hlCex = 1.5, hlPch = 20, na.rm = TRUE, ... )
x |
The x coordinates of points in the plot. |
y |
The y coordinates of points in the plot. |
xAxis |
The maximum number of x axis labels. |
yAxis |
The maximum number of y axis labels. |
resAxis |
The maximum number of y axis labels in the residual plot. |
resSymm |
Boolean determining whether the range of the residual plot should be symmetric about zero. |
wBox |
Boolean determining whether a box should be added around each plot. |
wLine |
Boolean determining whether to add a regression line to the plot. |
lCol |
The color of the regression line to be added. |
lty |
The line type of the regression line to be added. |
lwd |
The line width of the regression line to be added. |
xlab |
A label for the x axis. |
ylab |
A label for the y axis |
marRes |
Margin specified for the residuals. |
col |
Color of points. |
pch |
Plotting character. |
cex |
Plotting character size. |
xR |
Scaling the limits of the x axis. Ignored if |
yR |
Scaling the limits of the y axis. Ignored if |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
subset |
A subset of the data to be used for the linear model. |
parCustom |
If |
myHeight |
A numerical vector of length 2 representing the ratio of the primary plot to the residual plot, in height. |
plots |
Not currently utilized. |
highlight |
Numerical vector specifying particular points to highlight. |
hlCol |
Color of highlighted points. |
hlCex |
Size of highlighted points. |
hlPch |
Plotting characters of highlighted points. |
na.rm |
Remove cases with |
... |
Additional arguments to |
David Diez
lmPlot(satgpa$sat_sum, satgpa$fy_gpa) lmPlot(gradestv$tv, gradestv$grades, xAxis = 4, xlab = "time watching TV", yR = 0.2, highlight = c(1, 15, 20) )
lmPlot(satgpa$sat_sum, satgpa$fy_gpa) lmPlot(gradestv$tv, gradestv$grades, xAxis = 4, xlab = "time watching TV", yR = 0.2, highlight = c(1, 15, 20) )
This dataset represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals. Of course, not all loans are created equal. Someone who is a essentially a sure bet to pay back a loan will have an easier time getting a loan with a low interest rate than someone who appears to be riskier. And for people who are very risky? They may not even get a loan offer, or they may not have accepted the loan offer due to a high interest rate. It is important to keep that last part in mind, since this dataset only represents loans actually made, i.e. do not mistake this data for loan applications!
loans_full_schema
loans_full_schema
A data frame with 10,000 observations on the following 55 variables.
Job title.
Number of years in the job, rounded down.
If longer than 10 years, then this is represented by the value
10
.
Two-letter state code.
The ownership status of the applicant's residence.
Annual income.
Type of verification of the applicant's income.
Debt-to-income ratio.
If this is a joint application, then the annual income of the two parties applying.
Type of verification of the joint income.
Debt-to-income ratio for the two parties.
Delinquencies on lines of credit in the last 2 years.
Months since the last delinquency.
Year of the applicant's earliest line of credit
Inquiries into the applicant's credit during the last 12 months.
Total number of credit lines in this applicant's credit history.
Number of currently open lines of credit.
Total available credit, e.g. if only credit cards, then the total of all the credit limits. This excludes a mortgage.
Total credit balance, excluding a mortgage.
Number of collections in the last 12 months. This excludes medical collections.
The number of derogatory public records, which roughly means the number of times the applicant failed to pay.
Months since the last time the applicant was 90 days late on a payment.
Number of accounts where the applicant is currently delinquent.
The total amount that the applicant has had against them in collections.
Number of installment accounts, which are (roughly) accounts with a fixed payment amount and period. A typical example might be a 36-month car loan.
Number of new lines of credit opened in the last 24 months.
Number of months since the last credit inquiry on this applicant.
Number of satisfactory accounts.
Number of current accounts that are 120 days past due.
Number of current accounts that are 30 days past due.
Number of currently active bank cards.
Total of all bank card limits.
Total number of credit card accounts in the applicant's history.
Total number of currently open credit card accounts.
Number of credit cards that are carrying a balance.
Number of mortgage accounts.
Percent of all lines of credit where the applicant was never delinquent.
a numeric vector
Number of bankruptcies listed in the public record for this applicant.
The category for the purpose of the loan.
The type of application: either
individual
or joint
.
The amount of the loan the applicant received.
The number of months of the loan the applicant received.
Interest rate of the loan the applicant received.
Monthly payment for the loan the applicant received.
Grade associated with the loan.
Detailed grade associated with the loan.
Month the loan was issued.
Status of the loan.
Initial listing status of the loan. (I think this has to do with whether the lender provided the entire loan or if the loan is across multiple lenders.)
Dispersement method of the loan.
Current balance on the loan.
Total that has been paid on the loan by the applicant.
The difference between the original loan amount and the current balance on the loan.
The amount of interest paid so far by the applicant.
Late fees paid by the applicant.
This data comes from Lending Club (https://www.lendingclub.com/info/statistics.action), which provides a very large, open set of data on the people who received loans through their platform.
loans_full_schema
loans_full_schema
This dataset contains the coordinates of the boundaries of all 32 boroughs of the Greater London area.
london_boroughs
london_boroughs
A data frame with 45341 observations on the following 3 variables.
Name of the borough.
The "easting" component of the coordinate, see details.
The "northing" component of the coordinate, see details.
Map data was made available through the Ordnance Survey Open Data
initiative. The data use the
National Grid coordinate system,
based upon eastings (x
) and northings (y
) instead of longitude and latitude.
The name
variable covers all 32 boroughs in Greater London:
Barking & Dagenham
, Barnet
, Bexley
, Brent
,
Bromley
, Camden
, Croydon
, Ealing
,
Enfield
, Greenwich
, Hackney
, Hammersmith &
Fulham
, Haringey
, Harrow
, Havering
, Hillingdon
,
Hounslow
, Islington
, Kensington & Chelsea
,
Kingston
, Lambeth
, Lewisham
, Merton
,
Newham
, Redbridge
, Richmond
, Southwark
,
Sutton
, Tower Hamlets
, Waltham Forest
,
Wandsworth
, Westminster
https://data.london.gov.uk/dataset/ordnance-survey-code-point
Contains Ordinance Survey data released under the Open Government License, OGL v2.
london_murders
library(dplyr) library(ggplot2) # Calculate number of murders by borough london_murders_counts <- london_murders |> group_by(borough) |> add_tally() london_murders_counts ## Not run: # Add number of murders to geographic boundary data london_boroughs_murders <- inner_join(london_boroughs, london_murders_counts, by = "borough") # Map murders ggplot(london_boroughs_murders) + geom_polygon(aes(x = x, y = y, group = borough, fill = n), colour = "white") + scale_fill_distiller(direction = 1) + labs(x = "Easting", y = "Northing", fill = "Number of murders") ## End(Not run)
library(dplyr) library(ggplot2) # Calculate number of murders by borough london_murders_counts <- london_murders |> group_by(borough) |> add_tally() london_murders_counts ## Not run: # Add number of murders to geographic boundary data london_boroughs_murders <- inner_join(london_boroughs, london_murders_counts, by = "borough") # Map murders ggplot(london_boroughs_murders) + geom_polygon(aes(x = x, y = y, group = borough, fill = n), colour = "white") + scale_fill_distiller(direction = 1) + labs(x = "Easting", y = "Northing", fill = "Number of murders") ## End(Not run)
This dataset contains the victim name, age, and location of every murder recorded in the Greater London area by the Metropolitan Police from January 1, 2006 to September 7, 2011.
london_murders
london_murders
A data frame with 838 observations on the following 5 variables.
First name(s) of the victim.
Age of the victim.
Date of the murder (YYYY-MM-DD).
Year of the murder.
The London borough in which the murder took place. See the Details section for a list of all the boroughs.
To visualize this dataset using a map, see the
london_boroughs
dataset, which contains the latitude and
longitude of polygons that define the boundaries of the 32 boroughs of
Greater London.
The borough
variable covers all 32 boroughs in Greater London:
Barking & Dagenham
, Barnet
, Bexley
, Brent
,
Bromley
, Camden
, Croydon
, Ealing
,
Enfield
, Greenwich
, Hackney
, Hammersmith &
Fulham
, Haringey
, Harrow
, Havering
, Hillingdon
,
Hounslow
, Islington
, Kensington & Chelsea
,
Kingston
, Lambeth
, Lewisham
, Merton
,
Newham
, Redbridge
, Richmond
, Southwark
,
Sutton
, Tower Hamlets
, Waltham Forest
,
Wandsworth
, Westminster
https://www.theguardian.com/news/datablog/2011/oct/05/murder-london-list#data
Inspired by The Guardian Datablog.
library(dplyr) library(ggplot2) library(lubridate) london_murders |> mutate( day_count = as.numeric(date - ymd("2006-01-01")), date_cut = cut(day_count, seq(0, 2160, 90)) ) |> group_by(date_cut) |> add_tally() |> ggplot(aes(x = date_cut, y = n)) + geom_col() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + labs(x = "Date from 01/2006 - 09/2011", y = "Number of deaths per 90 days")
library(dplyr) library(ggplot2) library(lubridate) london_murders |> mutate( day_count = as.numeric(date - ymd("2006-01-01")), date_cut = cut(day_count, seq(0, 2160, 90)) ) |> group_by(date_cut) |> add_tally() |> ggplot(aes(x = date_cut, y = n)) + geom_col() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + labs(x = "Date from 01/2006 - 09/2011", y = "Number of deaths per 90 days")
NOTE: utils::txtProgressBar()
and utils::setTxtProgressBar()
are better.
Output a message while inside a for loop to update the user on progress. This
function is useful in tracking progress when the number of iterations is large
or the procedures in each iteration take a long time.
loop(i, n = NULL, every = 1, extra = NULL)
loop(i, n = NULL, every = 1, extra = NULL)
i |
The index value used in the loop. |
n |
The last entry in the loop. |
every |
The number of loops between messages. |
extra |
Additional information to print. |
David Diez
for (i in 1:160) { loop(i, 160, 20, paste("iter", i)) }
for (i in 1:160) { loop(i, 160, 20, paste("iter", i)) }
Creae a simple plot showing a line segment.
lsegments( x = c(3, 7), l = "o", r = "c", ticks = TRUE, labs = 1, add = 0, ylim = c(-0.75, 0.25) )
lsegments( x = c(3, 7), l = "o", r = "c", ticks = TRUE, labs = 1, add = 0, ylim = c(-0.75, 0.25) )
x |
The endpoints of the interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity. |
l |
Indicate whether the left end point should be open ( |
r |
Indicate whether the right end point should be open ( |
ticks |
Indicate whether to show tick marks ( |
labs |
The position for the point labels. Set to |
add |
Indicate whether the line segment should be added to an existing
plot ( |
ylim |
A vector of length 2 specifying the vertical plotting limits,
which may be useful for fine-tuning plots. The default is
|
David Diez
lsegments(c(2, 7), "o", "c", ylim = c(-0.3, 0.2)) lsegments(c(5, 7), "c", "c", ylim = c(-0.3, 0.2)) lsegments(c(4, 1000), "o", "o", ylim = c(-0.3, 0.2))
lsegments(c(2, 7), "o", "c", ylim = c(-0.3, 0.2)) lsegments(c(5, 7), "c", "c", ylim = c(-0.3, 0.2)) lsegments(c(4, 1000), "o", "o", ylim = c(-0.3, 0.2))
This study investigated whether finding a coin influenced a person's likelihood of mailing a sealed but addressed letter that appeared to have been accidentally left in a conspicuous place. Several variables were collected during the experiment, including two randomized variables of whether there was a coin to be found and whether the letter already had a stamp on it.
mail_me
mail_me
A data frame with 42 observations on the following 4 variables.
a factor with levels no
yes
a factor with levels coin
no_coin
a factor with levels female
male
a factor with levels no
yes
The precise context was in a phone booth (this study is from the 1970s!), where a person who entered a phone booth would find a dime in the phone tray, which would be sufficient to pay for their phone call. There was also a letter next to the phone, which sometimes had a stamp on it.
Levin PF, Isen AM. 1975. Studies on the Effect of Feeling Good on Helping. Sociometry 31(1), p141-147.
table(mail_me) (x <- table(mail_me[, c("mailed_letter", "found_coin")])) chisq.test(x) (x <- table(mail_me[, c("mailed_letter", "stamped")])) chisq.test(x) m <- glm(mailed_letter ~ stamped + found_coin + gender, data = mail_me, family = binomial ) summary(m)
table(mail_me) (x <- table(mail_me[, c("mailed_letter", "found_coin")])) chisq.test(x) (x <- table(mail_me[, c("mailed_letter", "stamped")])) chisq.test(x) m <- glm(mailed_letter ~ stamped + found_coin + gender, data = mail_me, family = binomial ) summary(m)
Survey of 218 students, collecting information on their GPAs and their academic major.
major_survey
major_survey
A data frame with 218 observations on the following 2 variables.
Grade point average (GPA).
Area of academic major.
library(ggplot2) ggplot(major_survey, aes(x = major, y = gpa)) + geom_boxplot()
library(ggplot2) ggplot(major_survey, aes(x = major, y = gpa)) + geom_boxplot()
Produce a linear, quadratic, or nonparametric tube for regression data.
makeTube( x, y, Z = 2, R = 1, col = "#00000022", border = "#00000000", type = c("lin", "quad", "robust"), stDev = c("constant", "linear", "other"), length.out = 99, bw = "default", plotTube = TRUE, addLine = TRUE, ... )
makeTube( x, y, Z = 2, R = 1, col = "#00000022", border = "#00000000", type = c("lin", "quad", "robust"), stDev = c("constant", "linear", "other"), length.out = 99, bw = "default", plotTube = TRUE, addLine = TRUE, ... )
x |
|
y |
|
Z |
Number of standard deviations out from the regression line to extend the tube. |
R |
Control of how far the tube extends to the left and right. |
col |
Fill color of the tube. |
border |
Border color of the tube. |
type |
The type of model fit to the data. Here |
stDev |
Choices are constant variance ( |
length.out |
The number of observations used to build the regression model. This argument may be increased to increase the smoothing of a quadratic or nonparametric curve. |
bw |
Bandwidth used if |
plotTube |
Whether the tube should be plotted. |
addLine |
Whether the linear model should be plotted. |
... |
Additional arguments passed to the |
X |
|
Y |
|
tubeX |
|
tubeY |
|
David Diez
# possum example plot(possum$total_l, possum$head_l) makeTube(possum$total_l, possum$head_l, 1) makeTube(possum$total_l, possum$head_l, 2) makeTube(possum$total_l, possum$head_l, 3) # grades and TV example plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5) plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5, stDev = "o") plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust") plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust", stDev = "o") # what can go wrong with a basic least squares model # 1 x <- runif(100) y <- 25 * x - 20 * x^2 + rnorm(length(x), sd = 1.5) plot(x, y) makeTube(x, y, type = "q") # 2 x <- c(-0.6, -0.46, -0.091, runif(97)) y <- 25 * x + rnorm(length(x)) y[2] <- y[2] + 8 y[1] <- y[1] + 1 plot(x, y, ylim = range(y) + c(-10, 5)) makeTube(x, y) # 3 x <- runif(100) y <- 5 * x + rnorm(length(x), sd = x) plot(x, y) makeTube(x, y, stDev = "l", bw = 0.03)
# possum example plot(possum$total_l, possum$head_l) makeTube(possum$total_l, possum$head_l, 1) makeTube(possum$total_l, possum$head_l, 2) makeTube(possum$total_l, possum$head_l, 3) # grades and TV example plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5) plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5, stDev = "o") plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust") plot(gradestv) makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust", stDev = "o") # what can go wrong with a basic least squares model # 1 x <- runif(100) y <- 25 * x - 20 * x^2 + rnorm(length(x), sd = 1.5) plot(x, y) makeTube(x, y, type = "q") # 2 x <- c(-0.6, -0.46, -0.091, runif(97)) y <- 25 * x + rnorm(length(x)) y[2] <- y[2] + 8 y[1] <- y[1] + 1 plot(x, y, ylim = range(y) + c(-10, 5)) makeTube(x, y) # 3 x <- runif(100) y <- 5 * x + rnorm(length(x), sd = x) plot(x, y) makeTube(x, y, stDev = "l", bw = 0.03)
Volunteer patients were randomized into one of two experiment groups where they would receive an experimental vaccine or a placebo. They were subsequently exposed to a drug-sensitive strain of malaria and observed to see whether they came down with an infection.
malaria
malaria
A data frame with 20 observations on the following 2 variables.
Whether a person was given the experimental vaccine
or a placebo
.
Whether the person got an infection
or no infection
.
In this study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine or 6 patients received a placebo vaccine. Nineteen weeks later, all 20 patients were exposed to a drug-sensitive malaria virus strain; the motivation of using a drug-sensitive strain of virus here is for ethical considerations, allowing any infections to be treated effectively.
Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. doi:10.1073/pnas.1615324114.
library(dplyr) # Calculate conditional probabilities of infection after vaccine/placebo malaria |> count(treatment, outcome) |> group_by(treatment) |> mutate(prop = n / sum(n)) # Fisher's exact text fisher.test(table(malaria))
library(dplyr) # Calculate conditional probabilities of infection after vaccine/placebo malaria |> count(treatment, outcome) |> group_by(treatment) |> mutate(prop = n / sum(n)) # Fisher's exact text fisher.test(table(malaria))
Random sample based on Food Commodity Intake Database distribution
male_heights
male_heights
A data frame with 100 observations on the following variable.
a numeric vector
What We Eat In America - Food Commodity Intake Database. Available at https://fcid.foodrisk.org/.
male_heights
male_heights
This sample is based on data from the USDA Food Commodity Intake Database.
male_heights_fcid
male_heights_fcid
A data frame with 100 observations on the following variable.
Height, in inches.
Simulated based on data from USDA.
data(male_heights_fcid) histPlot(male_heights_fcid$height_inch)
data(male_heights_fcid) histPlot(male_heights_fcid$height_inch)
This dataset includes data for 39 species of mammals distributed over 13 orders. The data were used for analyzing the relationship between constitutional and ecological factors and sleeping in mammals. Two qualitatively different sleep variables (dreaming and non dreaming) were recorded. Constitutional variables such as life span, body weight, brain weight and gestation time were evaluated. Ecological variables such as severity of predation, safety of sleeping place and overall danger were inferred from field observations in the literature.
mammals
mammals
A data frame with 62 observations on the following 11 variables.
Species of mammals
Total body weight of the mammal (in kg)
Brain weight of the mammal (in kg)
Number of hours of non dreaming sleep
Number of hours of dreaming sleep
Total number of hours of sleep
Life span (in years)
Gestation time (in days)
An index of how likely the mammal is to be preyed upon. 1 = least likely to be preyed upon. 5 = most likely to be preyed upon.
An index of the how exposed the mammal is during sleep. 1 = least exposed (e.g., sleeps in a well-protected den). 5 = most exposed.
An index of how much danger the mammal faces from other animals. This index is based upon Predation and Exposure. 1 = least danger from other animals. 5 = most danger from other animals.
http://www.statsci.org/data/general/sleep.txt
T. Allison and D. Cicchetti, "Sleep in mammals: ecological and constitutional correlates," Arch. Hydrobiol, vol. 75, p. 442, 1975.
library(ggplot2) ggplot(mammals, aes(x = log(body_wt), y = log(brain_wt))) + geom_point() + geom_smooth(method = "lm") + labs(x = "Log of body weight", x = "Log of brain weight")
library(ggplot2) ggplot(mammals, aes(x = log(body_wt), y = log(brain_wt))) + geom_point() + geom_smooth(method = "lm") + labs(x = "Log of body weight", x = "Log of brain weight")
An experiment where 89,835 women were randomized to either get a mammogram or a non-mammogram breast screening. The response measured was whether they had died from breast cancer within 25 years.
mammogram
mammogram
A data frame with 89835 observations on the following 2 variables.
a factor with levels control
mammogram
a factor with levels no
yes
Miller AB. 2014. Twenty five year follow-up for breast cancer incidence and mortality of the Canadian National Breast Screening Study: randomised screening trial. BMJ 2014;348:g366.
table(mammogram) chisq.test(table(mammogram))
table(mammogram) chisq.test(table(mammogram))
A data frame containing data on apartment rentals in Manhattan.
manhattan
manhattan
A data frame with 20 observations on the following 1 variable.
Monthly rent for a 1 bedroom apartment listed as "For rent by owner".
library(ggplot2) ggplot(manhattan, aes(rent)) + geom_histogram(color = "white", binwidth = 300) + theme_minimal() + labs( title = "Rent in Manhattan", subtitle = "1 Bedroom Apartments", x = "Rent (in US$)", caption = "Source: Craigslist" )
library(ggplot2) ggplot(manhattan, aes(rent)) + geom_histogram(color = "white", binwidth = 300) + theme_minimal() + labs( title = "Rent in Manhattan", subtitle = "1 Bedroom Apartments", x = "Rent (in US$)", caption = "Source: Craigslist" )
Marathon times of male and female winners of the New York City Marathon 1970-1999. See nyc_marathon for a more updated dataset. We recommend not using this dataset since the data source has been taken off the web.
marathon
marathon
A data frame with 60 observations on the following 3 variables.
Year
Gender
Running time (in hours)
Data source has been removed.
library(ggplot2) ggplot(marathon, aes(x = time)) + geom_histogram(binwidth = 0.15) ggplot(marathon, aes(y = time, x = gender)) + geom_boxplot()
library(ggplot2) ggplot(marathon, aes(x = time)) + geom_histogram(binwidth = 0.15) ggplot(marathon, aes(y = time, x = gender)) + geom_boxplot()
Auction data from Ebay for the game Mario Kart for the Nintendo Wii. This data was collected in early October 2009.
mariokart
mariokart
A data frame with 143 observations on the following 12 variables. All prices are in US dollars.
Auction ID assigned by Ebay.
Auction length, in days.
Number of bids.
Game condition, either new
or used
.
Start price of the auction.
Shipping price.
Total price, which equals the auction price plus the shipping price.
Shipping speed or method.
The seller's rating on Ebay. This is the number of positive ratings minus the number of negative ratings for the seller.
Whether the auction feature photo was a stock photo or not. If the picture was used in many auctions, then it was called a stock photo.
Number of Wii wheels included in the auction. These are steering wheel attachments to make it seem as though you are actually driving in the game. When used with the controller, turning the wheel actually causes the character on screen to turn.
The title of the auctions.
There are several interesting features in the data. First off, note that there are two outliers in the data. These serve as a nice example of what one should do when encountering an outlier: examine the data point and remove it only if there is a good reason. In these two cases, we can see from the auction titles that they included other items in their auctions besides the game, which justifies removing them from the dataset.
This dataset includes all auctions for a full week in October 2009. Auctions were included in the dataset if they satisfied a number of conditions. (1) They were included in a search for "wii mario kart" on ebay.com, (2) items were in the Video Games > Games > Nintendo Wii section of Ebay, (3) the listing was an auction and not exclusively a "Buy it Now" listing (sellers sometimes offer an optional higher price for a buyer to end bidding and win the auction immediately, which is an optional Buy it Now auction), (4) the item listed was the actual game, (5) the item was being sold from the US, (6) the item had at least one bidder, (7) there were no other items included in the auction with the exception of racing wheels, either generic or brand-name being acceptable, and (8) the auction did not end with a Buy It Now option.
Ebay.
library(ggplot2) library(broom) library(dplyr) # Identify outliers ggplot(mariokart, aes(x = total_pr, y = cond)) + geom_boxplot() # Replot without the outliers mariokart |> filter(total_pr < 80) |> ggplot(aes(x = total_pr, y = cond)) + geom_boxplot() # Fit a multiple regression models mariokart_no <- mariokart |> filter(total_pr < 80) m1 <- lm(total_pr ~ cond + stock_photo + duration + wheels, data = mariokart_no) tidy(m1) m2 <- lm(total_pr ~ cond + stock_photo + wheels, data = mariokart_no) tidy(m2) m3 <- lm(total_pr ~ cond + wheels, data = mariokart_no) tidy(m3) # Fit diagnostics aug_m3 <- augment(m3) ggplot(aug_m3, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + labs(x = "Fitted values", y = "Residuals") ggplot(aug_m3, aes(x = .fitted, y = abs(.resid))) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + labs(x = "Fitted values", y = "Absolute value of residuals") ggplot(aug_m3, aes(x = 1:nrow(aug_m3), y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + labs(x = "Order of data collection", y = "Residuals") ggplot(aug_m3, aes(x = cond, y = .resid)) + geom_boxplot() + labs(x = "Condition", y = "Residuals") ggplot(aug_m3, aes(x = wheels, y = .resid)) + geom_point() + labs( x = "Number of wheels", y = "Residuals", title = "Notice curvature" )
library(ggplot2) library(broom) library(dplyr) # Identify outliers ggplot(mariokart, aes(x = total_pr, y = cond)) + geom_boxplot() # Replot without the outliers mariokart |> filter(total_pr < 80) |> ggplot(aes(x = total_pr, y = cond)) + geom_boxplot() # Fit a multiple regression models mariokart_no <- mariokart |> filter(total_pr < 80) m1 <- lm(total_pr ~ cond + stock_photo + duration + wheels, data = mariokart_no) tidy(m1) m2 <- lm(total_pr ~ cond + stock_photo + wheels, data = mariokart_no) tidy(m2) m3 <- lm(total_pr ~ cond + wheels, data = mariokart_no) tidy(m3) # Fit diagnostics aug_m3 <- augment(m3) ggplot(aug_m3, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + labs(x = "Fitted values", y = "Residuals") ggplot(aug_m3, aes(x = .fitted, y = abs(.resid))) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + labs(x = "Fitted values", y = "Absolute value of residuals") ggplot(aug_m3, aes(x = 1:nrow(aug_m3), y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + labs(x = "Order of data collection", y = "Residuals") ggplot(aug_m3, aes(x = cond, y = .resid)) + geom_boxplot() + labs(x = "Condition", y = "Residuals") ggplot(aug_m3, aes(x = wheels, y = .resid)) + geom_point() + labs( x = "Number of wheels", y = "Residuals", title = "Notice curvature" )
The Massachusetts Comprehensive Assessment System (MCAS, https://www.doe.mass.edu/mcas/) uses state-wide testing to assess whether school districts, schools, and students are meeting expectations. This dataset records the percentage of students scoring proficient or advanced in the 2018 Mathematics test. School-level variables include possible predictors of test performance such as the demographics of the student population and administrative features of the school.
mcas
mcas
A data frame with 356 rows and 21 columns.
PA_perc
Numeric, percentage of students scoring proficient or advanced.
average_class_size
Numeric, average class size in the school, regardless of subject.
average_math_class_size
Numeric, average size of math classes in the school.
student_teacher_ratio
Numeric, average student-teacher ratio in the school.
attendance_rate
Numeric, the number of full-time equivalent student-days attended by full-time students in grades 1-10 as a percentage of the total number of possible student-days during the period.
number_of_students
Numeric, the total number of students including special education beyond grade 12.
largest_minority
Character, largest minority group.
school_name
Character, school name.
district_name
Character, Massachusetts school district.
english_learner
Numeric, percentage of students for whom the first language is other than English and who cannot perform ordinary classroom work in English.
students_disabilities
Numeric, percentage of students in the school with an individual education plan (IEP) identifying special learning needs
econ_dis
Numeric, percentage of students from economically disadvantaged background. Determined based on student participation in one or more of the following state-administered programs: the Supplemental Nutrition Assistance Program (SNAP); the Transitional Assistance for Families with Dependent Children (TAFDC); the Department of Children and Families' (DCF) foster care program; and Medicaid.
african_american
Numeric, percentage of students in the school having origins in any of the black racial groups of Africa.
asian
Numeric, percentage of students having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent.
white
Numeric, percentage of students having origins in any of the original peoples of Europe, the Middle East, or North Africa.
hispanic
Numeric, percentage of students of Cuban, Mexican, Puerto Rican, South or Central American descent, or other Spanish culture or origin, regardless of race.
native_american
Numeric, percentage of students having origins in any of the original peoples of North and South America (including Central America), and who maintain tribal affiliation or community attachment.
native_hawaiian_pacific_islander
Numeric, percentage of students having origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands.
multi_race_non_hispanic
Numeric, percentage of students selecting more than one racial category and non-Hispanic.
exp_per_pupil
Numeric, amount spent by the school district per pupil, in dollars. Calculated by dividing a district's operating expenditures by its average pupil membership.
majority
Character, coded white
if 50% of the students in the school are in racial category white, otherwise coded
minority
https://profiles.doe.mass.edu/statereport/
A list of Marvel Cinematic Universe films through the Infinity saga. The Infinity saga is a 23 movie storyline spanning from Ironman in 2008 to Endgame in 2019.
mcu_films
mcu_films
A data frame with 23 rows and 7 variables.
Title of the movie.
Length of the movie: hours portion.
Length of the movie: minutes portion.
Date the movie was released in the US.
Box office totals for opening weekend in the US.
All box office totals in US.
All box office totals world wide.
Box office figures are not adjusted to a specific year. They are from the year the film was released.
library(ggplot2) library(scales) ggplot(mcu_films, aes(x = opening_weekend_us, y = gross_us)) + geom_point() + labs( title = "MCU Box Office Totals: Opening weekend vs. all-time", x = "Opening weekend totals (USD in millions)", y = "All-time totals (USD)" ) + scale_x_continuous(labels = label_dollar(scale = 1 / 1000000)) + scale_y_continuous(labels = label_dollar(scale = 1 / 1000000))
library(ggplot2) library(scales) ggplot(mcu_films, aes(x = opening_weekend_us, y = gross_us)) + geom_point() + labs( title = "MCU Box Office Totals: Opening weekend vs. all-time", x = "Opening weekend totals (USD in millions)", y = "All-time totals (USD)" ) + scale_x_continuous(labels = label_dollar(scale = 1 / 1000000)) + scale_y_continuous(labels = label_dollar(scale = 1 / 1000000))
Covers midterm elections.
midterms_house
midterms_house
A data frame with 29 observations on the following 5 variables.
Year.
The president in office.
President's party: Democrat or Republican.
Unemployment rate.
Change in House seats for the President's party.
An older version of this data is at unemploy_pres
.
Wikipedia.
library(ggplot2) ggplot(midterms_house, aes(x = unemp, y = house_change)) + geom_point()
library(ggplot2) ggplot(midterms_house, aes(x = unemp, y = house_change)) + geom_point()
Experiment involving acupuncture and sham acupuncture (as placebo) in the treatment of migraines.
migraine
migraine
A data frame with 89 observations on the following 2 variables.
a factor with levels control
treatment
a factor with levels no
yes
G. Allais et al. Ear acupuncture in the treatment of migraine attacks: a randomized trial on the efficacy of appropriate versus inappropriate acupoints. In: Neurological Sci. 32.1 (2011), pp. 173-175.
migraine
migraine
This dataset contains demographic information on every member of the US armed forces including gender, race, and rank.
military
military
A data frame with 1,414,593 observations on the following 6 variables.
The status of the service member as enlisted
officer
or warrant officer
.
The branch of the armed forces: air force
, army
, marine corps
, navy
.
Whether the service member is female
or male
.
The race identified by the service member: ami/aln
(american indian/alaskan native), asian
, black
, multi
(multi-ethnic), p/i
(pacific islander), unk
(unknown), or white
.
Whether a service member identifies with being hispanic (TRUE
) or not (FALSE
).
The numeric rank of the service member (higher number indicates higher rank).
The branches covered by this dataset include the Army, Navy, Air Force, and Marine Corps. Demographic information on the Coast Guard is contained in the original dataset but has not been included here.
Data provided by the Department of Defense and made available at https://catalog.data.gov/dataset/personnel-trends-by-gender-race, retrieved 2012-02-20.
## Not run: library(dplyr) library(ggplot2) library(forcats) # Proportion of females in military branches military |> ggplot(aes(x = branch, fill = gender)) + geom_bar(position = "fill") + labs( x = "Branch", y = "Proportion", fill = "Gender", title = "Proportion of females in military branches" ) # Proportion of army officer females across ranks military |> filter( grade == "officer", branch == "army" ) |> ggplot(aes(x = factor(rank), fill = fct_rev(gender))) + geom_bar(position = "fill") + labs( x = "Rank", y = "Proportion", fill = "Gender", title = "Proportion of army officer females across ranks" ) ## End(Not run)
## Not run: library(dplyr) library(ggplot2) library(forcats) # Proportion of females in military branches military |> ggplot(aes(x = branch, fill = gender)) + geom_bar(position = "fill") + labs( x = "Branch", y = "Proportion", fill = "Gender", title = "Proportion of females in military branches" ) # Proportion of army officer females across ranks military |> filter( grade == "officer", branch == "army" ) |> ggplot(aes(x = factor(rank), fill = fct_rev(gender))) + geom_bar(position = "fill") + labs( x = "Rank", y = "Proportion", fill = "Gender", title = "Proportion of army officer females across ranks" ) ## End(Not run)
Salary data for Major League Baseball players in the year 2010.
mlb
mlb
A data frame with 828 observations on the following 4 variables.
Player name
Team
Field position
Salary (in $1000s)
https://databases.usatoday.com/mlb-salaries/, retrieved 2011-02-23.
# _____ Basic Histogram _____ # hist(mlb$salary / 1000, breaks = 15, main = "", xlab = "Salary (millions of dollars)", ylab = "", axes = FALSE, col = "#22558844" ) axis(1, seq(0, 40, 10)) axis(2, c(0, 500)) axis(2, seq(100, 400, 100), rep("", 4), tcl = -0.2) # _____ Histogram on Log Scale _____ # hist(log(mlb$salary / 1000), breaks = 15, main = "", xlab = "log(Salary)", ylab = "", axes = FALSE, col = "#22558844" ) axis(1) # , seq(0, 40, 10)) axis(2, seq(0, 300, 100)) # _____ Box plot of log(salary) against position _____ # boxPlot(log(mlb$salary / 1000), mlb$position, horiz = TRUE, ylab = "")
# _____ Basic Histogram _____ # hist(mlb$salary / 1000, breaks = 15, main = "", xlab = "Salary (millions of dollars)", ylab = "", axes = FALSE, col = "#22558844" ) axis(1, seq(0, 40, 10)) axis(2, c(0, 500)) axis(2, seq(100, 400, 100), rep("", 4), tcl = -0.2) # _____ Histogram on Log Scale _____ # hist(log(mlb$salary / 1000), breaks = 15, main = "", xlab = "log(Salary)", ylab = "", axes = FALSE, col = "#22558844" ) axis(1) # , seq(0, 40, 10)) axis(2, seq(0, 300, 100)) # _____ Box plot of log(salary) against position _____ # boxPlot(log(mlb$salary / 1000), mlb$position, horiz = TRUE, ylab = "")
Batter statistics for 2018 Major League Baseball season.
mlb_players_18
mlb_players_18
A data frame with 1270 observations on the following 19 variables.
Player name
Team abbreviation
Position abbreviation: 1B
= first base,
2B
= second base, 3B
= third base, C
= catcher,
CF
= center field (outfield), DH
= designated hitter,
LF
= left field (outfield), P
= pitcher,
RF
= right field (outfield), SS
= shortstop.
Number of games played.
At bats.
Runs.
Hits.
Doubles.
Triples.
Home runs.
Runs batted in.
Walks.
Strike outs.
Stolen bases.
Number of times caught stealing a base.
Batting average.
On-base percentage.
Slugging percentage.
On-base percentage plus slugging percentage.
d <- subset(mlb_players_18, !position %in% c("P", "DH") & AB >= 100) dim(d) # _____ Per Position, No Further Grouping _____ # plot(d$OBP ~ as.factor(d$position)) model <- lm(OBP ~ as.factor(position), d) summary(model) anova(model) # _____ Simplified Analysis, Fewer Positions _____ # pos <- list( c("LF", "CF", "RF"), c("1B", "2B", "3B", "SS"), "C" ) POS <- c("OF", "IF", "C") table(d$position) # _____ On-Base Percentage Across Positions _____ # out <- c() gp <- c() for (i in 1:length(pos)) { these <- which(d$position %in% pos[[i]]) out <- c(out, d$OBP[these]) gp <- c(gp, rep(POS[i], length(these))) } plot(out ~ as.factor(gp)) summary(lm(out ~ as.factor(gp))) anova(lm(out ~ as.factor(gp)))
d <- subset(mlb_players_18, !position %in% c("P", "DH") & AB >= 100) dim(d) # _____ Per Position, No Further Grouping _____ # plot(d$OBP ~ as.factor(d$position)) model <- lm(OBP ~ as.factor(position), d) summary(model) anova(model) # _____ Simplified Analysis, Fewer Positions _____ # pos <- list( c("LF", "CF", "RF"), c("1B", "2B", "3B", "SS"), "C" ) POS <- c("OF", "IF", "C") table(d$position) # _____ On-Base Percentage Across Positions _____ # out <- c() gp <- c() for (i in 1:length(pos)) { these <- which(d$position %in% pos[[i]]) out <- c(out, d$OBP[these]) gp <- c(gp, rep(POS[i], length(these))) } plot(out ~ as.factor(gp)) summary(lm(out ~ as.factor(gp))) anova(lm(out ~ as.factor(gp)))
A subset of data on Major League Baseball teams from Lahman's Baseball Database. The full dataset is available in the Lahman R package.
mlb_teams
mlb_teams
A data frame with 2784 rows and 41 variables.
Year of play.
League the team plays in with levels AL (American League) and NL (National League).
Division the team plays in with levels W (west), E (east) and C (central).
Team's rank in their division at the end of the regular season.
Games played.
Games played at home.
Number of games won.
Number of games lost.
Did the team win their division? Levels of Y (yes) and N (no).
Was the team a wild card winner. Levels of Y (yes) and N (no).
Did the team win their league? Levels of Y (yes) and N (no).
Did the team win the World Series? Levels of Y (yes) and N (no).
Number of runs scored during the season.
Number of at bats during the season.
Number of hits during the season. Includes singles, doubles, triples and homeruns.
Number of doubles hit.
Number of triples hit.
Homeruns by batters.
Number of walks.
Number of batters struckout.
Number of stolen bases.
Number of base runners caught stealing.
Number of batters hit by a pitch.
Number of sacrifice flies.
Number of runs scored by opponents.
Number of earned runs allowed.
Earned run average.
Number of games where a single pitcher played the entire game.
Number of shutouts.
Number of saves.
Number of outs pitched for the season (number of innings pitched times 3).
Number of hits made by opponents.
Number of homeruns hit by opponents.
Number of opponents who were walked.
Number of opponents who were struckout.
Number of errors.
Number of double plays.
Teams fielding percentage.
Full name of team.
Home ballpark name.
Home attendance total.
Lahmans Baseball Database
library(dplyr) # List the World Series winning teams for each year mlb_teams |> filter(world_series_winner == "Y") |> select(year, team_name, ball_park) # List the teams with their average number of wins and losses mlb_teams |> group_by(team_name) |> summarize(mean_wins = mean(wins), mean_losses = mean(losses)) |> arrange((team_name))
library(dplyr) # List the World Series winning teams for each year mlb_teams |> filter(world_series_winner == "Y") |> select(year, team_name, ball_park) # List the teams with their average number of wins and losses mlb_teams |> group_by(team_name) |> summarize(mean_wins = mean(wins), mean_losses = mean(losses)) |> arrange((team_name))
Major League Baseball Player Hitting Statistics for 2010.
mlbbat10
mlbbat10
A data frame with 1199 observations on the following 19 variables.
Player name
Team abbreviation
Player position
Number of games
Number of at bats
Number of runs
Number of hits
Number of doubles
Number of triples
Number of home runs
Number of runs batted in
Total bases, computed as 3HR + 23B + 1*2B + H
Number of walks
Number of strikeouts
Number of stolen bases
Number of times caught stealing
On base percentage
Slugging percentage (total_base / at_bat)
Batting average
https://www.mlb.com, retrieved 2011-04-22.
library(ggplot2) library(dplyr) library(scales) mlbbat10_200 <- mlbbat10 |> filter(mlbbat10$at_bat > 200) # On-base percentage across positions ggplot(mlbbat10_200, aes(x = position, y = obp, fill = position)) + geom_boxplot(show.legend = FALSE) + scale_y_continuous(labels = label_number(suffix = "%", accuracy = 0.01)) + labs( title = "On-base percentage across positions", y = "On-base percentage across positions", x = "Position" ) # Batting average across positions ggplot(mlbbat10_200, aes(x = bat_avg, fill = position)) + geom_density(alpha = 0.5) + labs( title = "Batting average across positions", fill = NULL, y = "Batting average", x = "Position" ) # Mean number of home runs across positions mlbbat10_200 |> group_by(position) |> summarise(mean_home_run = mean(home_run)) |> ggplot(aes(x = position, y = mean_home_run, fill = position)) + geom_col(show.legend = FALSE) + labs( title = "Mean number of home runs across positions", y = "Home runs", x = "Position" ) # Runs batted in across positions ggplot(mlbbat10_200, aes(x = run, y = obp, fill = position)) + geom_boxplot(show.legend = FALSE) + labs( title = "Runs batted in across positions", y = "Runs", x = "Position" )
library(ggplot2) library(dplyr) library(scales) mlbbat10_200 <- mlbbat10 |> filter(mlbbat10$at_bat > 200) # On-base percentage across positions ggplot(mlbbat10_200, aes(x = position, y = obp, fill = position)) + geom_boxplot(show.legend = FALSE) + scale_y_continuous(labels = label_number(suffix = "%", accuracy = 0.01)) + labs( title = "On-base percentage across positions", y = "On-base percentage across positions", x = "Position" ) # Batting average across positions ggplot(mlbbat10_200, aes(x = bat_avg, fill = position)) + geom_density(alpha = 0.5) + labs( title = "Batting average across positions", fill = NULL, y = "Batting average", x = "Position" ) # Mean number of home runs across positions mlbbat10_200 |> group_by(position) |> summarise(mean_home_run = mean(home_run)) |> ggplot(aes(x = position, y = mean_home_run, fill = position)) + geom_col(show.legend = FALSE) + labs( title = "Mean number of home runs across positions", y = "Home runs", x = "Position" ) # Runs batted in across positions ggplot(mlbbat10_200, aes(x = run, y = obp, fill = position)) + geom_boxplot(show.legend = FALSE) + labs( title = "Runs batted in across positions", y = "Runs", x = "Position" )
From Minneapolis, data from 2016 through August 2021
mn_police_use_of_force
mn_police_use_of_force
A data frame with 12925 rows and 13 variables.
DateTime of police response.
Problem that required police response.
Whether response was iniated by call to 911.
Offense of subject.
Whether subject was injured Yes/No/null.
Type of police force used.
Detail of police force used.
Race of subject.
Gender of subject.
Age of subject.
Resistance to police by subject.
Precinct where response occurred.
Neighborhood where response occurred.
library(dplyr) library(ggplot2) # List percent of total for each race mn_police_use_of_force |> count(race) |> mutate(percent = round(n / sum(n) * 100, 2)) |> arrange(desc(percent)) # Display use of force count by three races race_sub <- c("Asian", "White", "Black") ggplot( mn_police_use_of_force |> filter(race %in% race_sub), aes(force_type, ..count..) ) + geom_point(stat = "count", size = 4) + coord_flip() + facet_grid(race ~ .) + labs( x = "Force Type", y = "Number of Incidents" )
library(dplyr) library(ggplot2) # List percent of total for each race mn_police_use_of_force |> count(race) |> mutate(percent = round(n / sum(n) * 100, 2)) |> arrange(desc(percent)) # Display use of force count by three races race_sub <- c("Asian", "White", "Black") ggplot( mn_police_use_of_force |> filter(race %in% race_sub), aes(force_type, ..count..) ) + geom_point(stat = "count", size = 4) + coord_flip() + facet_grid(race ~ .) + labs( x = "Force Type", y = "Number of Incidents" )
Plot a mosaic plot custom built for a particular figure.
MosaicPlot( formula, data, col = "#00000022", border = 1, dir = c("v", "h"), off = 0.01, cex.axis = 0.7, col.dir = "v", flip = c("v"), ... )
MosaicPlot( formula, data, col = "#00000022", border = 1, dir = c("v", "h"), off = 0.01, cex.axis = 0.7, col.dir = "v", flip = c("v"), ... )
formula |
Formula describing the variable relationship. |
data |
Data frame for the variables, optional. |
col |
Colors for plotting. |
border |
Ignored. |
dir |
Ignored. |
off |
Fraction of white space between each box in the plot. |
cex.axis |
Axis label size. |
col.dir |
Direction to lay out colors. |
flip |
Whether to flip the ordering of the vertical ( |
... |
Ignored. |
David Diez
data(email) data(COL) email$spam <- ifelse(email$spam == 0, "not\nspam", "spam") MosaicPlot(number ~ spam, email, col = COL[1:3], off = 0.02)
data(email) data(COL) email$spam <- ifelse(email$spam == 0, "not\nspam", "spam") MosaicPlot(number ~ spam, email, col = COL[1:3], off = 0.02)
A dataset with information about movies released in 2003.
movies
movies
A data frame with 140 observations on the following 5 variables.
Title of the movie.
Genre of the movie.
Critics score of the movie on a 0 to 100 scale.
MPAA rating of the film.
Millions of dollars earned at the box office in the US and Canada.
Investigating Statistical Concepts, Applications and Methods
library(ggplot2) ggplot(movies, aes(score, box_office, color = genre)) + geom_point() + theme_minimal() + labs( title = "Does a critic score predict box office earnings?", x = "Critic rating", y = "Box office earnings (millions US$", color = "Genre" )
library(ggplot2) ggplot(movies, aes(score, box_office, color = genre)) + geom_point() + theme_minimal() + labs( title = "Does a critic score predict box office earnings?", x = "Critic rating", y = "Box office earnings (millions US$", color = "Genre" )
The data are from a convenience sample of 25 women and 10 men who were middle-aged or older. The purpose of the study was to understand the relationship between sedentary behavior and thickness of the medial temporal lobe (MTL) in the brain.
mtl
mtl
A data frame with 35 observations on the following 23 variables.
ID for the individual.
Gender, which takes values F
(female) or M
(male).
Ethnicity, simplified to Caucasian
and Other
.
Years of educational.
APOE-4 status, taking a value of E4
or Non-E4
.
Age, in years.
Score from the Mini-Mental State Examination, which is a global cognition evaluation.
Score on the Hamilton Rating Scale for anxiety.
Score on the Hamilton Rating Scale for depression.
We (the authors of this R package) are unsure as to the meaning of this variable.
We (the authors of this R package) are unsure as to the meaning of this variable.
We (the authors of this R package) are unsure as to the meaning of this variable.
Self-reported time sitting per day, averaged to the nearest hour.
Metabolic equivalent units score (activity level). A score of
0
means "no activity" while 3000
is considered "high activity".
Classification of METminwk
into Low
or High
.
Thickness of the CA1 subregion of the MTL.
Thickness of the CA23DG subregion of the MTL.
Thickness of a subregion of the MTL.
Thickness of the fusiform gyrus subregion of the MTL.
Thickness of the perirhinal cortex subregion of the MTL.
Thickness of the entorhinal cortex subregion of the MTL.
Thickness of the subiculum subregion of the MTL.
Total MTL thickness.
Siddarth P, Burggren AC, Eyre HA, Small GW, Merrill DA. 2018. Sedentary behavior associated with reduced medial temporal lobe thickness in middle-aged and older adults. PLoS ONE 13(4): e0195549. doi:10.1371/journal.pone.0195549.
Thank you to Professor Silas Bergen of Winona State University for pointing us to this dataset!
A New York Times article references this study. https://www.nytimes.com/2018/04/19/opinion/standing-up-at-your-desk-could-make-you-smarter.html
library(ggplot2) ggplot(mtl, aes(x = ipa_qgrp, y = met_minwk)) + geom_boxplot()
library(ggplot2) ggplot(mtl, aes(x = ipa_qgrp, y = met_minwk)) + geom_boxplot()
Population, percent in poverty, percent unemployment, and murder rate.
murders
murders
A data frame with 20 metropolitan areas on the following 4 variables.
Population.
Percent in poverty.
Percent unemployed.
Number of murders per year per million people.
We do not have provenance for these data hence recommend not using them for analysis.
library(ggplot2) ggplot(murders, aes(x = perc_pov, y = annual_murders_per_mil)) + geom_point() + labs( x = "Percent in poverty", y = "Number of murders per year per million people" )
library(ggplot2) ggplot(murders, aes(x = perc_pov, y = annual_murders_per_mil)) + geom_point() + labs( x = "Percent in poverty", y = "Number of murders per year per million people" )
A similar function to pdf
and png
, except that different
defaults are provided, including for the plotting parameters.
myPDF( fileName, width = 5, height = 3, mar = c(3.9, 3.9, 1, 1), mgp = c(2.8, 0.55, 0), las = 1, tcl = -0.3, ... )
myPDF( fileName, width = 5, height = 3, mar = c(3.9, 3.9, 1, 1), mgp = c(2.8, 0.55, 0), las = 1, tcl = -0.3, ... )
fileName |
File name for the image to be output. The name should end in
|
width |
The width of the image file (inches). Default: |
height |
The height of the image file (inches). Default: |
mar |
Plotting margins. To change, input a numerical vector of length 4. |
mgp |
Margin graphing parameters. To change, input a numerical vector of length 3. The first argument specifies where x and y labels are placed; the second specifies the axis labels are placed; and the third specifies how far to pull the entire axis from the plot. |
las |
Orientation of axis labels. Input |
tcl |
The tick mark length as a proportion of text height. The default
is |
... |
Additional arguments to |
David Diez
# save a plot to a PDF # myPDF("myPlot.pdf") histPlot(mariokart$total_pr) # dev.off() # save a plot to a PNG # myPNG("myPlot.png") histPlot(mariokart$total_pr) # dev.off()
# save a plot to a PDF # myPDF("myPlot.pdf") histPlot(mariokart$total_pr) # dev.off() # save a plot to a PNG # myPNG("myPlot.png") histPlot(mariokart$total_pr) # dev.off()
This dataset contains information about the teams who played in the NBA Finals from 1950 - 2022.
nba_finals
nba_finals
A data frame with 73 rows and 9 variables:
The year in which the Finals took place.
The team who won the series.
Number of series wins by the Western Conference Champions.
Number of series wins by the Eastern Conference Champions.
Team that won the Western Conference title and played in the Finals.
Team that won the Eastern Conference title and played in the Finals.
Coach of the Western Conference champions.
Coach of the Eastern Conference champions.
Which conference held home court advantage for the series.
Wikipedia: List of NBA Champions
library(dplyr) library(ggplot2) library(tidyr) # Top 5 Appearing Coaches nba_finals |> pivot_longer( cols = c("western_coach", "eastern_coach"), names_to = "conference", values_to = "coach" ) |> count(coach, sort = TRUE) |> slice_head(n = 5) # Top 5 Winning Coaches nba_finals |> mutate( winning_coach = case_when( western_wins == 4 ~ western_coach, eastern_wins == 4 ~ eastern_coach ) ) |> count(winning_coach, sort = TRUE) |> slice_head(n = 5)
library(dplyr) library(ggplot2) library(tidyr) # Top 5 Appearing Coaches nba_finals |> pivot_longer( cols = c("western_coach", "eastern_coach"), names_to = "conference", values_to = "coach" ) |> count(coach, sort = TRUE) |> slice_head(n = 5) # Top 5 Winning Coaches nba_finals |> mutate( winning_coach = case_when( western_wins == 4 ~ western_coach, eastern_wins == 4 ~ eastern_coach ) ) |> count(winning_coach, sort = TRUE) |> slice_head(n = 5)
A dataset with individual team summaries for the NBA Finals series from 1950 to 2022. To win the Finals, a team must win 4 games. The maximum number of games in a series is 7.
nba_finals_teams
nba_finals_teams
A data frame with 33 rows and 7 variables:
Team name.
Number of NBA Championships won.
Number of NBA Championships lost.
Number of NBA Finals appearances.
Win percentage.
Years in which the team won a Championship.
Years in which the team lost a Championship.
Notes:
The Chicago Stags folded in 1950, the Washington Capitols in 1951 and the Baltimore Bullets in 1954.
This list uses current team names. For example, the Seattle SuperSonics are not on the list as that team moved and became the Oklahoma City Thunder.
library(ggplot2) library(dplyr) library(openintro) teams_with_apps <- nba_finals_teams |> filter(apps != 0) ggplot(teams_with_apps, aes(x = win)) + geom_histogram(binwidth = 2) + labs( title = "Number of NBA Finals series wins", x = "Number of wins", y = "Number of teams" ) ggplot(teams_with_apps, aes(x = apps, y = win)) + geom_point(alpha = 0.3) + labs( title = "Can we predict how many NBA Championships a team has based on the number of appearances?", x = "Number of NBA Finals appearances", y = "Number of NBA Finals series wins" )
library(ggplot2) library(dplyr) library(openintro) teams_with_apps <- nba_finals_teams |> filter(apps != 0) ggplot(teams_with_apps, aes(x = win)) + geom_histogram(binwidth = 2) + labs( title = "Number of NBA Finals series wins", x = "Number of wins", y = "Number of teams" ) ggplot(teams_with_apps, aes(x = apps, y = win)) + geom_point(alpha = 0.3) + labs( title = "Can we predict how many NBA Championships a team has based on the number of appearances?", x = "Number of NBA Finals appearances", y = "Number of NBA Finals series wins" )
Heights of all NBA players from the 2008-9 season.
nba_heights
nba_heights
A data frame with 435 observations (players) on the following 4 variables.
Last name.
First name.
Height, in meters.
Height, in inches.
Collected from NBA.
qqnorm(nba_heights$h_meters)
qqnorm(nba_heights$h_meters)
Summary information from the NBA players for the 2018-2019 season.
nba_players_19
nba_players_19
A data frame with 494 observations on the following 7 variables.
First name.
Last name.
Team name
3-letter team abbreviation.
Player position.
Jersey number.
Height, in inches.
hist(nba_players_19$height, 20) table(nba_players_19$team)
hist(nba_players_19$height, 20) table(nba_players_19$team)
In 2004, the state of North Carolina released to the public a large dataset containing information on births recorded in this state. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from this dataset.
ncbirths
ncbirths
A data frame with 1000 observations on the following 13 variables.
Father's age in years.
Mother's age in years.
Maturity status of mother.
Length of pregnancy in weeks.
Whether the birth was classified as premature (premie) or full-term.
Number of hospital visits during pregnancy.
Weight gained by mother during pregnancy in pounds.
Weight of the baby at birth in pounds.
Whether baby was classified as low birthweight
(low
) or not (not low
).
Gender of the baby, female
or male
.
Status of the mother as a nonsmoker
or a smoker
.
Whether mother is married
or not married
at birth.
Whether mom is white
or not white
.
We do not have ideal provenance for these data. For a better documented and more recent dataset on a similar topic with similar variables, see births14.
library(ggplot2) ggplot(ncbirths, aes(x = habit, y = weight)) + geom_boxplot() + labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)") ggplot(ncbirths, aes(x = whitemom, y = visits)) + geom_boxplot() + labs(x = "Mother's race", y = "Number of doctor visits during pregnancy") ggplot(ncbirths, aes(x = mature, y = gained)) + geom_boxplot() + labs(x = "Mother's age category", y = "Weight gained during pregnancy")
library(ggplot2) ggplot(ncbirths, aes(x = habit, y = weight)) + geom_boxplot() + labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)") ggplot(ncbirths, aes(x = whitemom, y = visits)) + geom_boxplot() + labs(x = "Mother's race", y = "Number of doctor visits during pregnancy") ggplot(ncbirths, aes(x = mature, y = gained)) + geom_boxplot() + labs(x = "Mother's age category", y = "Weight gained during pregnancy")
The dataset NHANES (US National Health and Nutrition Examination Study) is
part of the CRAN package NHANES
(author Randall Purim, [email protected]) and
contains 76 variables on 100,000 participants from surveys conducted between
2009 and 2012. The surveys are part of a series conducted by the US National
Center for Health Statistics (NCHS) since the 1960's. See the NHANES
package
documentation for more information about the surveys.
nhanes.samp
nhanes.samp
A dataframe with 200 rows and 76 variables
The dataset NHANES is a weighted sample from the full survey dataset constructed so that it may be treated as a random sample of the US population. The dataset nhanes.samp contains data from a random sample of size 200 from NHANES and all 76 variables. See the NHANES package for variable definitions and coding.
https://CRAN.R-project.org/package=NHANES.
Pruim R (2015). NHANES: Data from the US National Health and Nutrition Examination Study. R package version 2.1.0,
The dataset NHANES (US National Health and Nutrition Examination Study) is
part of the CRAN package NHANES
(author Randall Purim, [email protected])
and contains 76 variables on 100,000 participants from surveys conducted
between 2009 and 2012. The surveys are part of a series conducted by the US
National Center for Health Statistics (NCHS) since the 1960's. See the NHANES
package documentation for more information about the surveys.
nhanes.samp.adult
nhanes.samp.adult
A dataframe with 135 rows and 76 variables
The dataset NHANES is a weighted sample from the full survey dataset constructed
so that it may be treated as a random sample of the US population. The dataset
nhanes.samp.adult contains data from the 135 participants 21 years of age or
older from the nhanes.samp dataset. See the NHANES
package for variable
definitions and coding.
https://CRAN.R-project.org/package=NHANES.
Pruim R (2015). NHANES: Data from the US National Health and Nutrition Examination Study. R package version 2.1.0,
The dataset NHANES (US National Health and Nutrition Examination Study) is
part of the CRAN package NHANES
(author Randall Purim, [email protected]) and
contains 76 variables on 100,000 participants from surveys conducted between
2009 and 2012. The surveys are part of a series conducted by the US National
Center for Health Statistics (NCHS) since the 1960's. See the NHANES
package
documentation for more information about the surveys.
nhanes.samp.adult.500
nhanes.samp.adult.500
A dataframe with 500 rows and 76 variables
The dataset NHANES is a weighted sample from the full survey dataset constructed so that it may be treated as a random sample of the US population. The dataset nhanes.samp.adult.500 contains data from 500 participants 21 years of age or older randomly sampled from the NHANES dataset. See the NHANES package for variable definitions and coding.
https://CRAN.R-project.org/package=NHANES.
Pruim R (2015). NHANES: Data from the US National Health and Nutrition Examination Study. R package version 2.1.0,
Produce a normal (or t) distribution and shaded tail.
normTail( m = 0, s = 1, L = NULL, U = NULL, M = NULL, df = 1000, curveColor = 1, border = 1, col = "#CCCCCC", xlim = NULL, ylim = NULL, xlab = "", ylab = "", digits = 2, axes = 1, detail = 999, xLab = c("number", "symbol"), cex.axis = 1, xAxisIncr = 1, add = FALSE, ... )
normTail( m = 0, s = 1, L = NULL, U = NULL, M = NULL, df = 1000, curveColor = 1, border = 1, col = "#CCCCCC", xlim = NULL, ylim = NULL, xlab = "", ylab = "", digits = 2, axes = 1, detail = 999, xLab = c("number", "symbol"), cex.axis = 1, xAxisIncr = 1, add = FALSE, ... )
m |
Numerical value for the distribution mean. |
s |
Numerical value for the distribution standard deviation. |
L |
Numerical value representing the cutoff for a shaded lower tail. |
U |
Numerical value representing the cutoff for a shaded upper tail. |
M |
Numerical value representing the cutoff for a shaded central region. |
df |
Numerical value describing the degrees of freedom. Default is
|
curveColor |
The color for the distribution curve. |
border |
The color for the border of the shaded area. |
col |
The color for filling the shaded area. |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
xlab |
A title for the x axis. |
ylab |
A title for the y axis. |
digits |
The maximum number of digits past the decimal to use in axes values. |
axes |
A numeric value denoting whether to draw both axes ( |
detail |
A number describing the number of points to use in drawing the normal curve. Smaller values correspond to a less smooth curve but reduced memory usage in the final file. |
xLab |
If |
cex.axis |
Numerical value controlling the size of the axis labels. |
xAxisIncr |
A number describing how often axis labels are placed,
scaled by standard deviations. This argument is ignored if |
add |
Boolean indicating whether to add this normal curve to the existing plot. |
... |
Additional arguments to |
David Diez
normTail(3, 2, 5) normTail(3, 2, 1, xLab = "symbol") normTail(3, 2, M = 1:2, xLab = "symbol", cex.axis = 0.8) normTail(3, 2, U = 5, axes = FALSE) normTail(L = -1, U = 2, M = c(0, 1), axes = 3, xAxisIncr = 2) normTail( L = -1, U = 2, M = c(0, 1), xLab = "symbol", cex.axis = 0.8, xAxisIncr = 2 )
normTail(3, 2, 5) normTail(3, 2, 1, xLab = "symbol") normTail(3, 2, M = 1:2, xLab = "symbol", cex.axis = 0.8) normTail(3, 2, U = 5, axes = FALSE) normTail(L = -1, U = 2, M = c(0, 1), axes = 3, xAxisIncr = 2) normTail( L = -1, U = 2, M = c(0, 1), xLab = "symbol", cex.axis = 0.8, xAxisIncr = 2 )
A simple random sample of 1,028 US adults in March 2013 found that 56\ support nuclear arms reduction.
nuclear_survey
nuclear_survey
A data frame with 1028 observations on the following variable.
Responses of favor
or
against
.
Gallup report: In U.S., 56 percent Favor U.S.-Russian Nuclear Arms Reductions. Available at https://news.gallup.com/poll/161198/favor-russian-nuclear-arms-reductions.aspx.
table(nuclear_survey)
table(nuclear_survey)
Zagat is a public survey where anyone can provide scores to a restaurant. The scores from the general public are then gathered to produce ratings. This dataset contains a list of 168 NYC restaurants and their Zagat Ratings.
nyc
nyc
A data frame with 168 observations on the following 6 variables.
Name of the restaurant.
Price of a mean for two, with drinks, in US $.
Zagat rating for food.
Zagat rating for decor.
Zagat rating for service.
Indicator variable for location of the restaurant. 0
= west of 5th Avenue, 1
= east of 5th Avenue
For each category the scales are as follows:
0 - 9: poor to fair 10 - 15: fair to good 16 - 19: good to very good 20 - 25: very good to excellent 25 - 30: extraordinary to perfection
library(dplyr) library(ggplot2) location_labs <- c("West", "East") names(location_labs) <- c(0, 1) ggplot(nyc, mapping = aes(x = price, group = east, fill = east)) + geom_boxplot(alpha = 0.5) + facet_grid(east ~ ., labeller = labeller(east = location_labs)) + labs( title = "Is food more expensive east of 5th Avenue?", x = "Price (US$)" ) + guides(fill = "none") + theme_minimal() + theme(axis.text.y = element_blank())
library(dplyr) library(ggplot2) location_labs <- c("West", "East") names(location_labs) <- c(0, 1) ggplot(nyc, mapping = aes(x = price, group = east, fill = east)) + geom_boxplot(alpha = 0.5) + facet_grid(east ~ ., labeller = labeller(east = location_labs)) + labs( title = "Is food more expensive east of 5th Avenue?", x = "Price (US$)" ) + guides(fill = "none") + theme_minimal() + theme(axis.text.y = element_blank())
Marathon times of runners in the Men and Women divisions of the New York City Marathon, 1970 - 2023.
nyc_marathon
nyc_marathon
A data frame with 108 observations on the following 7 variables.
Year of marathom.
Name of winner.
Country of winner.
Running time (HH:MM:SS).
Running time (in hours).
Division: Men
or Women
.
Note about the race or the winning time.
Wikipedia, List of winners of the New York City Marathon. Retrieved 6 November, 2023.
library(ggplot2) ggplot(nyc_marathon, aes(x = year, y = time_hrs, color = division, shape = division)) + geom_point()
library(ggplot2) ggplot(nyc_marathon, aes(x = year, y = time_hrs, color = division, shape = division)) + geom_point()
On-time data for a random sample of flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
nycflights
nycflights
A tbl_df with 32,735 rows and 16 variables:
Date of departure.
Departure and arrival times, local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Time of departure broken in to hour and minutes.
Two letter carrier abbreviation. See airlines
in the
nycflights13
package for more information or google the airline code.
Plane tail number.
Flight number.
Origin and destination. See airports
in the
nycflights13
package for more information or google airport the code.
Amount of time spent in the air.
Distance flown.
Hadley Wickham (2014). nycflights13: Data about flights departing NYC in 2013. R package version 0.1.
library(dplyr) # Longest departure delays nycflights |> select(flight, origin, dest, dep_delay, arr_delay) |> arrange(desc(dep_delay)) # Longest arrival delays nycflights |> select(flight, origin, dest, dep_delay, arr_delay) |> arrange(desc(arr_delay))
library(dplyr) # Longest departure delays nycflights |> select(flight, origin, dest, dep_delay, arr_delay) |> arrange(desc(dep_delay)) # Longest arrival delays nycflights |> select(flight, origin, dest, dep_delay, arr_delay) |> arrange(desc(arr_delay))
A 2010 survey asking a randomly sample of registered voters in California for their position on drilling for oil and natural gas off the Coast of California.
offshore_drilling
offshore_drilling
A data frame with 827 observations on the following 2 variables.
a factor with levels do not know
oppose
support
a factor with
levels no
yes
Survey USA, Election Poll #16804, data collected July 8-11, 2010.
offshore_drilling
offshore_drilling
A character string of full colors from IMSCOL[,1]
openintro_colors
openintro_colors
A named character string with 9 elements: "blue", "green", "pink", "yellow", "red", "black", "gray", "lgray
openintro_colors openintro_colors["blue"]
openintro_colors openintro_colors["blue"]
Uses full colors from IMSCOL
openintro_cols(...)
openintro_cols(...)
... |
Character names of openintro_colors |
openintro_cols("blue") openintro_cols("red")
openintro_cols("blue") openintro_cols("red")
Not exported
openintro_pal(palette = "main", reverse = FALSE, ...)
openintro_pal(palette = "main", reverse = FALSE, ...)
palette |
Character name of palette in openintro_palettes |
reverse |
Boolean indicating whether the palette should be reversed |
... |
Additional arguments to pass to |
A list with OpenIntro color palettes
openintro_palettes
openintro_palettes
A list with 8 color palettes: main, two, three, four, five, cool, hot, gray
openintro_palettes openintro_palettes$main openintro_palettes$three openintro_palettes$cool openintro_palettes$hot
openintro_palettes openintro_palettes$main openintro_palettes$three openintro_palettes$cool openintro_palettes$hot
Opportunity Insights (https://opportunityinsights.org/) is a research initiative with the goal of understanding upward mobility in the United States by studying barriers to economic opportunity and translating findings into policy recommendations. These data consist of a subset on anonymized dataset gathered in 2017 on all college students in the United States from 1999 - 2013 (30 million students) study to examine the association of higher education system and upward mobility. The data includes parental income distributions and student earnings outcomes by college. The data in this package do not include tiers 12 (less than two year schools of any type), 13 (students attending college with insufficient data), and 14 (students not in college between the ages of 19-22). Monetary values are measured in 2015 dollars; i.e. adjusted for inflation to 2015 dollars.
opp_insights_colleges
opp_insights_colleges
A dataframe with 2153 rows and 26 columns
super_opeid
Numeric, a college or university identifier constructed by the Opportunity Insights team based on tax records. It is similar but not identical to the U.S. Department of Education’s Office of Postsecondary Education ID (OPEID) and different from the ID in the Integrated Postsecondary Education Data System (IPEDS).
name
Character vector, college name
region
Factor, with levels 1
(Northeast),
2
(Midwest), 3
(South), 4
(West)
state
Character vector, two letter state ID
tier_name
Character vector, selectivity and type of college with 8 values,
Ivy Plus
, Other elite schools (private and public)
,
Highly selective public
, Highly selective private
,
Selective public
, Selective private
,
Nonselective 4-year public
, Nonselective 4-year private
,
Two-year (public and private not-for-profit)
,
Four-year for-profit
,
Two-year for-profit
type
Factor with 3 levels, public
,
private non-profit
,
for-profit
exp_instr_pc_2013
Numeric, instructional expenditures per student in 2013
ipeds_enrollment_2013
Numeric, total undergraduate enrollment in Fall 2013
sticker_price_2013
Numeric, average annual cost of attendance in 2013
scorecard_netprice_2013
Numeric, net annual cost of attendance for bottom income quintile in 2013
grad_rate_150_p_2013
Numeric, percentage of students graduating within 150% of normal time in 2013
avgfacsal_2013
Numeric, average faculty salary in 2013
sat_avg_2013
Numeric, average SAT scores (scaled to 1600) in 2013
endowment_pc_2000
endowment assets per student in 2000
mr_kq5_pq1
Numeric, mobility rate, top 20% of the income distribution
mr_ktop1_pq1
Numeric, mobility rate, top 1% of the income distribution
par_median
Numeric, median parent household income
par_q1
Numeric, fraction of parents in first (bottom) income quintile
par_q2
Numeric, fraction of parents in second income quintile
par_q3
Numeric, fraction of parents in third income quintile
par_q4
Numeric, fraction of parents in fourth income quintile
par_q5
Numeric, fraction of parents in fifth income quintile
par_top5pc
Numeric, fraction of parents in top 5% of income distribution
par_top1pc
Numeric, fraction of parents in top 1% of income distribution
k_median
Numeric, median child individual earnings in 2014 (at age 34)
k_top5pc
Numeric, fraction of children in top 5% of income distribution
k_top1pc
Numeric, fraction of children in top 1% of income distribution
Tables mrc_table2.csv
and mrc_table10.csv
from https://opportunityinsights.org/data/
Chetty, Raj, et al. "Income segregation and intergenerational mobility across colleges in the United States." The Quarterly Journal of Economics 135.3 (2020): 1567-1633.
opp_insights_colleges
that is restricted to 4-year, not-for-profit colleges.Data from opp_insights_colleges
that is restricted to 4-year, not-for-profit colleges.
opp_insights_colleges_4year
opp_insights_colleges_4year
A dataframe with 1285 rows and 26 variables
super_opeid
Numeric, a college or university identifier constructed by the Opportunity Insights team based on tax records. It is similar but not identical to the U.S. Department of Education’s Office of Postsecondary Education ID (OPEID) and different from the ID in the Integrated Postsecondary Education Data System (IPEDS).
name
Character vector, college name
region
Factor, with levels 1
(Northeast),
2
(Midwest), 3
(South), 4
(West)
state
Character vector, two letter state ID
tier_name
Character vector, selectivity and type of college with 8 values,
Ivy Plus
, Other elite schools (private and public)
,
Highly selective public
, Highly selective private
,
Selective public
, Selective private
,
Nonselective 4-year public
, Nonselective 4-year private
,
Two-year (public and private not-for-profit)
,
Four-year for-profit
,
Two-year for-profit
type
Factor with 3 levels, public
,
private non-profit
,
for-profit
exp_instr_pc_2013
Numeric, instructional expenditures per student in 2013
ipeds_enrollment_2013
Numeric, total undergraduate enrollment in Fall 2013
sticker_price_2013
Numeric, average annual cost of attendance in 2013
scorecard_netprice_2013
Numeric, net annual cost of attendance for bottom income quintile in 2013
grad_rate_150_p_2013
Numeric, percentage of students graduating within 150% of normal time in 2013
avgfacsal_2013
Numeric, average faculty salary in 2013
sat_avg_2013
Numeric, average SAT scores (scaled to 1600) in 2013
endowment_pc_2000
endowment assets per student in 2000
mr_kq5_pq1
Numeric, mobility rate, top 20% of the income distribution
mr_ktop1_pq1
Numeric, mobility rate, top 1% of the income distribution
par_median
Numeric, median parent household income
par_q1
Numeric, fraction of parents in first (bottom) income quintile
par_q2
Numeric, fraction of parents in second income quintile
par_q3
Numeric, fraction of parents in third income quintile
par_q4
Numeric, fraction of parents in fourth income quintile
par_q5
Numeric, fraction of parents in fifth income quintile
par_top5pc
Numeric, fraction of parents in top 5% of income distribution
par_top1pc
Numeric, fraction of parents in top 1% of income distribution
k_median
Numeric, median child individual earnings in 2014 (at age 34)
k_top5pc
Numeric, fraction of children in top 5% of income distribution
k_top1pc
Numeric, fraction of children in top 1% of income distribution
Tables mrc_table2.csv
and mrc_table10.csv
from https://opportunityinsights.org/data/
Chetty, Raj, et al. "Income segregation and intergenerational mobility across colleges in the United States." The Quarterly Journal of Economics 135.3 (2020): 1567-1633.
In a study on opportunity cost, 150 students were given the following statement: "Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below." Half of the students were given the following two options: (A) Buy this entertaining video. (B) Not buy this entertaining video. The other half were given the following two options (note the modified option B): (A) Buy this entertaining video. (B) Not buy this entertaining video. Keep the $14.99 for other purchases. The results of this study are in this dataset.
opportunity_cost
opportunity_cost
A data frame with 150 observations on the following 2 variables.
a factor with levels control
and treatment
a factor with levels buy video
and not buy video
Frederick S, Novemsky N, Wang J, Dhar R, Nowlis S. 2009. Opportunity Cost Neglect. Journal of Consumer Research 36: 553-561.
library(ggplot2) table(opportunity_cost) ggplot(opportunity_cost, aes(y = group, fill = decision)) + geom_bar(position = "fill")
library(ggplot2) table(opportunity_cost) ggplot(opportunity_cost, aes(y = group, fill = decision)) + geom_bar(position = "fill")
On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch.
orings
orings
A data frame with 23 observations on the following 4 variables.
Shuttle mission number.
Temperature, in Fahrenheit.
Number of damaged O-rings (out of 6).
Number of undamaged O-rings (out of 6).
https://archive.ics.uci.edu/dataset/92/challenger+usa+space+shuttle+o+ring
library(dplyr) library(forcats) library(tidyr) library(broom) # This is a wide data frame. You can convert it to a long # data frame to predict probability of O-ring damage based # on temperature using logistic regression. orings_long <- orings |> pivot_longer(cols = c(damaged, undamaged), names_to = "outcome", values_to = "n") |> uncount(n) |> mutate(outcome = fct_relevel(outcome, "undamaged", "damaged")) orings_mod <- glm(outcome ~ temperature, data = orings_long, family = "binomial") tidy(orings_mod)
library(dplyr) library(forcats) library(tidyr) library(broom) # This is a wide data frame. You can convert it to a long # data frame to predict probability of O-ring damage based # on temperature using logistic regression. orings_long <- orings |> pivot_longer(cols = c(damaged, undamaged), names_to = "outcome", values_to = "n") |> uncount(n) |> mutate(outcome = fct_relevel(outcome, "undamaged", "damaged")) orings_mod <- glm(outcome ~ temperature, data = orings_long, family = "binomial") tidy(orings_mod)
Best actor and actress Oscar winners from 1929 to 2018
oscars
oscars
A data frame with 182 observations on the following 10 variables.
Oscar ceremony number.
Year the Oscar ceremony was held.
Best actress
or Best actor
.
Name of winning actor or actress.
Name of movie actor or actress got the Oscar for.
Age at which the actor or actress won the Oscar.
US State where the actor or actress was born, country if foreign.
Birth date of actor or actress.
Birth month of actor or actress.
Birth day of actor or actress.
Birth year of actor or actress.
Although there have been only 84 Oscar ceremonies until 2012, there are 85 male winners and 85 female winners because ties happened on two occasions (1933 for the best actor and 1969 for the best actress).
Journal of Statistical Education, http://jse.amstat.org/datasets/oscars.dat.txt, updated through 2019 using information from Oscars.org and Wikipedia.org.
library(ggplot2) library(dplyr) ggplot(oscars, aes(x = award, y = age)) + geom_boxplot() ggplot(oscars, aes(x = factor(birth_mo))) + geom_bar() oscars |> count(birth_pl, sort = TRUE)
library(ggplot2) library(dplyr) ggplot(oscars, aes(x = award, y = age)) + geom_boxplot() ggplot(oscars, aes(x = factor(birth_mo))) + geom_bar() oscars |> count(birth_pl, sort = TRUE)
Data sets for showing different types of outliers
outliers
outliers
A data frame with 50 observations on the following 5 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
outliers
outliers
Compiled gold medal times for the 1500m race in the Olympic Games and the Paralympic Games. The times given for contestants competing in the Paralympic Games are for athletes with different visual impairments; T11 indicates fully blind (with an option to race with a guide-runner) with T12 and T13 as lower levels of visual impairment.
paralympic_1500
paralympic_1500
A data frame with 83 rows and 10 variables.
Year the games took place.
City of the games.
Country of the games.
Division: Men
or Women
.
Type.
Name of the athlete.
Country of athlete.
Time of gold medal race, in m:s.
Time of gold medal race, in decimal minutes (min + sec/60).
https://www.paralympic.org/ and https://en.wikipedia.org/wiki/1500_metres_at_the_Olympics.
library(ggplot2) library(dplyr) paralympic_1500 |> mutate( sight_level = case_when( type == "T11" ~ "total impairment", type == "T12" ~ "some impairment", type == "T13" ~ "some impairment", type == "Olympic" ~ "no impairment" ) ) |> filter(division == "Men", year > 1920) |> filter(type == "Olympic" | type == "T11") |> ggplot(aes(x = year, y = time_min, color = sight_level, shape = sight_level)) + geom_point() + scale_x_continuous(breaks = seq(1924, 2020, by = 8)) + labs( title = "Men's Olympic and Paralympic 1500m race times", x = "Year", y = "Time of Race (minutes)", color = "Sight level", shape = "Sight level" )
library(ggplot2) library(dplyr) paralympic_1500 |> mutate( sight_level = case_when( type == "T11" ~ "total impairment", type == "T12" ~ "some impairment", type == "T13" ~ "some impairment", type == "Olympic" ~ "no impairment" ) ) |> filter(division == "Men", year > 1920) |> filter(type == "Olympic" | type == "T11") |> ggplot(aes(x = year, y = time_min, color = sight_level, shape = sight_level)) + geom_point() + scale_x_continuous(breaks = seq(1924, 2020, by = 8)) + labs( title = "Men's Olympic and Paralympic 1500m race times", x = "Year", y = "Time of Race (minutes)", color = "Sight level", shape = "Sight level" )
The data was collected by the Planet Money podcast to test a theory about crowd-sourcing. Penelope's actual weight was 1,355 pounds.
penelope
penelope
A data frame with 17,184 observations on the following variable.
Guesses of Penelope's weight, in pounds.
library(ggplot2) ggplot(penelope, aes(x = weight)) + geom_histogram(binwidth = 250) summary(penelope$weight)
library(ggplot2) ggplot(penelope, aes(x = weight)) + geom_histogram(binwidth = 250) summary(penelope$weight)
The channel Project Farm on YouTube investigated penetrating oils and other options for loosening rusty bolts. Eight options were evaluated, including a control group, to determine which was most effective.
penetrating_oil
penetrating_oil
A data frame with 30 observations on the following 2 variables.
The different treatments tried:
none
(control), Heat
(via blow torch), Acetone/ATF
,
AeroKroil
, Liquid Wrench
, PB Blaster
, Royal
Purple
, and WD-40
.
Torque required to loosen the rusty bolt, which was measured in foot-pounds.
https://www.youtube.com/watch?v=xUEob2oAKVs
m <- lm(torque ~ treatment, data = penetrating_oil) anova(m) # There are 28 pairwise comparisons to be made. xbar <- tapply(penetrating_oil$torque, penetrating_oil$treatment, mean) n <- tapply(penetrating_oil$torque, penetrating_oil$treatment, length) s <- summary(m)$sigma df <- summary(m)$df[1] diff <- c() se <- c() k <- 0 N <- length(n) K <- N * (N - 1) / 2 for (i in 1:(N - 1)) { for (j in (i + 1):N) { k <- k + 1 diff[k] <- xbar[i] - xbar[j] se[k] <- s * sqrt(1 / n[i] + 1 / n[j]) if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.05) { cat("0.05 - ", names(n)[c(i, j)], "\n") } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.1) { cat("0.1 - ", names(n)[c(i, j)], "\n") } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.2) { cat("0.2 - ", names(n)[c(i, j)], "\n") } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.3) { cat("0.3 - ", names(n)[c(i, j)], "\n") } } } # Smallest p-value using Bonferroni min(2 * K * pt(-abs(diff / se), df)) # Better pairwise comparison method. anova(m1 <- aov(torque ~ treatment, data = penetrating_oil)) TukeyHSD(m1)
m <- lm(torque ~ treatment, data = penetrating_oil) anova(m) # There are 28 pairwise comparisons to be made. xbar <- tapply(penetrating_oil$torque, penetrating_oil$treatment, mean) n <- tapply(penetrating_oil$torque, penetrating_oil$treatment, length) s <- summary(m)$sigma df <- summary(m)$df[1] diff <- c() se <- c() k <- 0 N <- length(n) K <- N * (N - 1) / 2 for (i in 1:(N - 1)) { for (j in (i + 1):N) { k <- k + 1 diff[k] <- xbar[i] - xbar[j] se[k] <- s * sqrt(1 / n[i] + 1 / n[j]) if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.05) { cat("0.05 - ", names(n)[c(i, j)], "\n") } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.1) { cat("0.1 - ", names(n)[c(i, j)], "\n") } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.2) { cat("0.2 - ", names(n)[c(i, j)], "\n") } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.3) { cat("0.3 - ", names(n)[c(i, j)], "\n") } } } # Smallest p-value using Bonferroni min(2 * K * pt(-abs(diff / se), df)) # Better pairwise comparison method. anova(m1 <- aov(torque ~ treatment, data = penetrating_oil)) TukeyHSD(m1)
Sample of pennies and their ages. Taken in 2004.
penny_ages
penny_ages
A data frame with 648 observations on the following 2 variables.
Penny's year.
Age as of 2004.
hist(penny_ages$year)
hist(penny_ages$year)
US-based survey on support for expanding six different sources of energy, including solar, wind, offshore drilling, hydrolic fracturing ("fracking"), coal, and nuclear.
pew_energy_2018
pew_energy_2018
The format is: List of 6 $ solar_panel_farms : List of responses on solar farms. $ wind_turbine_farms : List of responses on wind turbine farms. $ offshore_drilling : List of responses on offshore drilling. $ hydrolic_fracturing : List of responses on hydrolic fracturing. $ coal_mining : List of responses on coal mining. $ nuclear_power_plants: List of responses on nuclear.
We did not have access to individual responses in original dataset, so we took the published percentages and backed out the breakdown
data(pew_energy_2018) lapply(pew_energy_2018, head) lapply(pew_energy_2018, length) lapply(pew_energy_2018, table) Prop <- function(x) { table(x) / length(x) } lapply(pew_energy_2018, Prop)
data(pew_energy_2018) lapply(pew_energy_2018, head) lapply(pew_energy_2018, length) lapply(pew_energy_2018, table) Prop <- function(x) { table(x) / length(x) } lapply(pew_energy_2018, Prop)
This is a simulated dataset for photo classifications based on a machine learning algorithm versus what the true classification is for those photos. While the data are not real, they resemble performance that would be reasonable to expect in a well-built classifier.
photo_classify
photo_classify
A data frame with 1822 observations on the following 2 variables.
The prediction by the machine learning system as to whether the photo is about fashion or not.
The actual classification of the photo by a team of humans.
The hypothetical ML algorithm has a precision of 90\ photos it claims are fashion, about 90\ The recall of the ML algorithm is about 64\ about fashion, it correctly predicts that they are about fashion about 64\ of the time.
The data are simulated / hypothetical.
data(photo_classify) table(photo_classify)
data(photo_classify) table(photo_classify)
This dataset contains observations on all 100 US Senators and 434 of the 325 US Congressional Representatives related to their support of anti-piracy legislation that was introduced at the end of 2011.
piracy
piracy
A data frame with 534 observations on the following 8 variables.
Name of legislator.
Party affiliation as democrat (D
), Republican (R
), or Independent (I
).
Two letter state abbreviation.
Amount of money in dollars contributed to the legislator's campaign in 2010 by groups generally thought to be supportive of PIPA/SOPA: movie and TV studios, record labels.
Amount of money in dollars contributed to the legislator's campaign in 2010 by groups generally thought to be opposed to PIPA/SOPA: computer and internet companies.
Number of years of service in Congress.
Degree of support for PIPA/SOPA with levels Leaning No
, No
, Undecided
, Unknown
, Yes
Whether the legislator is a member of either the house
or senate
.
The Stop Online Piracy Act (SOPA) and the Protect Intellectual Property Act (PIPA) were two bills introduced in the US House of Representatives and the US Senate, respectively, to curtail copyright infringement. The bill was controversial because there were concerns the bill limited free speech rights. ProPublica, the independent and non-profit news organization, compiled this dataset to compare the stance of legislators towards the bills with the amount of campaign funds that they received from groups considered to be supportive of or in opposition to the legislation.
For more background on the legislation and the formulation of
money_pro
and money_con
, read the documentation on ProPublica,
linked below.
https://projects.propublica.org/sopa The list may be slightly out of date since many politician's perspectives on the legislation were in flux at the time of data collection.
library(dplyr) library(ggplot2) pipa <- filter(piracy, chamber == "senate") pipa |> group_by(stance) |> summarise(money_pro_mean = mean(money_pro, na.rm = TRUE)) |> ggplot(aes(x = stance, y = money_pro_mean)) + geom_col() + labs( x = "Stance", y = "Average contribution, in $", title = "Average contribution to the legislator's campaign in 2010", subtitle = "by groups supportive of PIPA/SOPA (movie and TV studios, record labels)" ) ggplot(pipa, aes(x = stance, y = money_pro)) + geom_boxplot() + labs( x = "Stance", y = "Contribution, in $", title = "Contribution by groups supportive of PIPA/SOPA", subtitle = "Movie and TV studios, record labels" ) ggplot(pipa, aes(x = stance, y = money_con)) + geom_boxplot() + labs( x = "Stance", y = "Contribution, in $", title = "Contribution by groups opposed to PIPA/SOPA", subtitle = "Computer and internet companies" ) pipa |> filter( money_pro > 0, money_con > 0 ) |> mutate(for_pipa = ifelse(stance == "yes", "yes", "no")) |> ggplot(aes(x = money_pro, y = money_con, color = for_pipa)) + geom_point() + scale_color_manual(values = c("gray", "red")) + scale_y_log10() + scale_x_log10() + labs( x = "Contribution by pro-PIPA groups", y = "Contribution by anti-PIPA groups", color = "For PIPA" )
library(dplyr) library(ggplot2) pipa <- filter(piracy, chamber == "senate") pipa |> group_by(stance) |> summarise(money_pro_mean = mean(money_pro, na.rm = TRUE)) |> ggplot(aes(x = stance, y = money_pro_mean)) + geom_col() + labs( x = "Stance", y = "Average contribution, in $", title = "Average contribution to the legislator's campaign in 2010", subtitle = "by groups supportive of PIPA/SOPA (movie and TV studios, record labels)" ) ggplot(pipa, aes(x = stance, y = money_pro)) + geom_boxplot() + labs( x = "Stance", y = "Contribution, in $", title = "Contribution by groups supportive of PIPA/SOPA", subtitle = "Movie and TV studios, record labels" ) ggplot(pipa, aes(x = stance, y = money_con)) + geom_boxplot() + labs( x = "Stance", y = "Contribution, in $", title = "Contribution by groups opposed to PIPA/SOPA", subtitle = "Computer and internet companies" ) pipa |> filter( money_pro > 0, money_con > 0 ) |> mutate(for_pipa = ifelse(stance == "yes", "yes", "no")) |> ggplot(aes(x = money_pro, y = money_con, color = for_pipa)) + geom_point() + scale_color_manual(values = c("gray", "red")) + scale_y_log10() + scale_x_log10() + labs( x = "Contribution by pro-PIPA groups", y = "Contribution by anti-PIPA groups", color = "For PIPA" )
A table describing each of the 52 cards in a deck.
playing_cards
playing_cards
A data frame with 52 observations on the following 2 variables.
The number or card type.
Card suit, which takes one of four values: Club
, Diamond
, Heart
, or Spade
.
Whether the card counts as a face card.
This extremely complex dataset was generated from scratch.
playing_cards <- data.frame( number = rep(c(2:10, "J", "Q", "K", "A"), 4), suit = rep(c("Spade", "Diamond", "Club", "Heart"), rep(13, 4)) ) playing_cards$face_card <- ifelse(playing_cards$number %in% c(2:10, "A"), "no", "yes")
playing_cards <- data.frame( number = rep(c(2:10, "J", "Q", "K", "A"), 4), suit = rep(c("Spade", "Diamond", "Club", "Heart"), rep(13, 4)) ) playing_cards$face_card <- ifelse(playing_cards$number %in% c(2:10, "A"), "no", "yes")
Plot data and add a regression line.
PlotWLine( x, y, xlab = "", ylab = "", col = fadeColor(4, "88"), cex = 1.2, pch = 20, n = 4, nMax = 4, yR = 0.1, axes = TRUE, ... )
PlotWLine( x, y, xlab = "", ylab = "", col = fadeColor(4, "88"), cex = 1.2, pch = 20, n = 4, nMax = 4, yR = 0.1, axes = TRUE, ... )
x |
Predictor variable. |
y |
Outcome variable. |
xlab |
x-axis label. |
ylab |
y-axis label. |
col |
Color of points. |
cex |
Size of points. |
pch |
Plotting character. |
n |
The preferred number of axis labels. |
nMax |
The maximum number of axis labels. |
yR |
y-limit buffer factor. |
axes |
Boolean to indicate whether or not to include axes. |
... |
Passed to |
PlotWLine(1:10, seq(-5, -2, length.out = 10) + rnorm(10))
PlotWLine(1:10, seq(-5, -2, length.out = 10) + rnorm(10))
Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency in 2011.
pm25_2011_durham
pm25_2011_durham
A data frame with 449 observations on the following 20 variables.
Date
The numeric site ID.
A numeric vector, the Parameter Occurance Code.
A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.
A character vector with value ug/m3 LC
.
A numeric vector with the daily air quality index.
A numeric vector.
A numeric vector.
A numeric vector.
A factor with levels PM2.5 - Local Conditions
and Acceptable PM2.5 AQI & Speciation Mass
.
A numeric vector.
A character vector with value Durham, NC
.
A numeric vector.
A character vector with value North Carolina
.
A numeric vector.
A character vector with value Durham
.
A numeric vector of the latitude.
A numeric vector of the longitude.
a numeric vector
a factor with levels Raleigh-Durham-Cary, NC
US Environmental Protection Agency, AirData, 2011. http://www3.epa.gov/airdata/ad_data_daily.html
library(ggplot2) ggplot(pm25_2011_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) + geom_line()
library(ggplot2) ggplot(pm25_2011_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) + geom_line()
Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency in 2022.
pm25_2022_durham
pm25_2022_durham
A data frame with 356 observations on the following 20 variables.
Date.
The numeric site ID.
A numeric vector, the Parameter Occurance Code.
A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.
A character vector with value ug/m3 LC
.
A numeric vector with the daily air quality index.
A numeric vector.
A numeric vector.
A numeric vector.
A factor vector with level PM2.5 - Local Conditions
.
A numeric vector.
A character vector with value Durham-Chapel Hill, NC
.
A numeric vector.
A character vector with value North Carolina
.
A numeric vector.
A character vector with value Durham
.
A numeric vector of the latitude.
A numeric vector of the longitude.
A character vector with value Durham Armory
.
US Environmental Protection Agency, AirData, 2022. http://www3.epa.gov/airdata/ad_data_daily.html
library(ggplot2) ggplot(pm25_2022_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) + geom_line()
library(ggplot2) ggplot(pm25_2022_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) + geom_line()
Poker winnings (and losses) for 50 days by a professional poker player.
poker
poker
A data frame with 49 observations on the following variable.
Poker winnings and losses, in US dollars.
Anonymity has been requested by the player.
library(ggplot2) ggplot(poker, aes(x = winnings)) + geom_histogram(binwidth = 250)
library(ggplot2) ggplot(poker, aes(x = winnings)) + geom_histogram(binwidth = 250)
Data representing possums in Australia and New Guinea. This is a copy of the
dataset by the same name in the DAAG
package, however, the dataset
included here includes fewer variables.
possum
possum
A data frame with 104 observations on the following 8 variables.
The site number where the possum was trapped.
Population, either Vic
(Victoria) or other
(New South Wales or Queensland).
Gender, either m
(male) or f
(female).
Age.
Head length, in mm.
Skull width, in mm.
Total length, in cm.
Tail length, in cm.
Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
library(ggplot2) # Skull width vs. head length ggplot(possum, aes(x = head_l, y = skull_w)) + geom_point() # Total length vs. sex ggplot(possum, aes(x = total_l, fill = sex)) + geom_density(alpha = 0.5)
library(ggplot2) # Skull width vs. head length ggplot(possum, aes(x = head_l, y = skull_w)) + geom_point() # Total length vs. sex ggplot(possum, aes(x = total_l, fill = sex)) + geom_density(alpha = 0.5)
A poll of 691 people, with party affiliation collected, asked whether they think it's better to raise taxes on the rich or raise taxes on the poor.
ppp_201503
ppp_201503
A data frame with 691 observations on the following 2 variables.
Political party affiliation.
Support for who to raise taxes on.
Public Policy Polling, Americans on College Degrees, Classic Literature, the Seasons, and More, data collected Feb 20-22, 2015.
library(ggplot2) ggplot(ppp_201503, aes(x = party, fill = taxes)) + geom_bar(position = "fill") + labs(x = "Party", x = "Proportion", fill = "Taxes")
library(ggplot2) ggplot(ppp_201503, aes(x = party, fill = taxes)) + geom_bar(position = "fill") + labs(x = "Party", x = "Proportion", fill = "Taxes")
An updated version of the historical Arbuthnot dataset. Numbers of boys and girls born in the United States between 1940 and 2002.
present
present
A data frame with 63 observations on the following 3 variables.
Year.
Number of boys born.
Number of girls born.
Mathews, T. J., and Brady E. Hamilton. "Trend analysis of the sex ratio at birth in the United States." National vital statistics reports 53.20 (2005): 1-17.
library(ggplot2) ggplot(present, mapping = aes(x = year, y = boys / girls)) + geom_line()
library(ggplot2) ggplot(present, mapping = aes(x = year, y = boys / girls)) + geom_line()
Summary of the changes in the president and vice president for the United States of America.
president
president
A data frame with 67 observations on the following 5 variables.
President of the United States
Political party of the president
Start year
End year
Vice President of the United States
Presidents of the United States (table) – infoplease.com (visited: Nov 2nd, 2010)
https://www.infoplease.com/us/government/executive-branch/presidents and https://www.infoplease.com/us/government/executive-branch/vice-presidents
president
president
Data from the Prevention of REnal and Vascular END-stage Disease (PREVEND) study, which took place in the Netherlands. The study collected various demographic and cardiovascular risk factors. This dataset is from the third survey, which participants completed in 2003-2006; data are provided for 4,095 individuals who completed cognitive testing with RFFT.
prevend
prevend
A tibble with 4095 rows and 31 variables:
Casenr
case number, numeric
Age
Numeric, age in years, recorded at time of enrollment.
Gender
Numeric vector: 0 = males; 1 = females.
Ethnicity
Numeric vector: 0 = Western European; 1 = African; 2 = Asian; 3 = Other.
Education
Highest level of education. Numeric: 0 primary school; 1 = lower secondary education; 3 = university.
RFFT
Numeric, performance on the Ruff Figural Fluency Test. Scores range from 0 (worst) to 175 (best).
VAT
Numeric, Visual Association Test score. The VAT is a learning task based on image recognition. Scores may range from 0 (worst) to 12 (best)
CVD
History of cardiovascular event. Numeric vector: 0 = No; 1 = Yes.
DM
Diabetes mellitus (Type 2 diabetes) status at enrollment. Numeric vector: 0 = No; 1 = Yes.
Smoking
Smoking status at enrollment. numeric vector: 0 = No; 1 = Yes.
Hypertension
status of hypertension at enrollment. Numeric vector: 0 = No; 1 = Yes.
BMI
Numeric, body mass index, weight divided by height-squared, in kg/m^2
SBP
Numeric, systolic blood pressure, in mmHg
DBP
Numeric, diastolic blood pressure, in mmHg
MAP
Numeric, mean arterial pressure, in mmHg
eGFR
Numeric, estimated glomerular filtration rate, a measure of kidney function. Low values indicate possible kidney damage, in mL/min.
Albuminuria.1
Albuminuria (mg/24hr) in two categories. Numeric vector:
0 = (< 30); 1 = ( 30)
Albuminuria.2
Albuminuria (mg/24hr) in three categories. Numeric:
0 = (0 to < 10), 1 = (10 to < 30); 3 = ( 30).
Chol
Numeric, total cholesterol, in mmol/L.
HDL
Numeric, HDL cholesterol, in mmol/L.
Statin
Statin use at enrollment. Numeric vector: 0 = No; 1 = Yes.
Solubility
Statin solubility.Numeric vector: 0 = lipophilic; 1 = hydrophilic; 2 = no statin use. NA indicates statin solubility is missing.
Days
Numeric, total duration of statin use, in days. -1 indicates participant did not use statins
Years
Numeric, total duration of statin use, in years. -1 indicates participant did not use statins.
DDD
Defined daily dose of the statin. Numeric vector: From the PLOS One paper, "DDD is defined by the WHO as the drug units representing dosages with approximately similar efficacy. One DDD corresponds to the following dosage for each statin respectively: Simvastatin 30 mg, Pravastatin 30 mg, Fluvastatin 60 mg, Atorvastatin 20 mg and Rosuvastatin 10 mg."
FRS
Framingham risk score. Numeric vector. The score, a measure of risk for a cardivascular event within 10 years. Higher values imply increased use. For details see D’Agostino RBS, Vasan RS, Pencina MJ, Wolf PA, Cobain M, et al. (2008) General cardiovascular risk profile for use in primary care: The Framingham Heart Study. Circulation 117: 743–753.
PS
Propensity score of statin use. Numeric vector. See the PLOS One paper for the model used to calculate the score
PSquint
Quintile of PS. Numeric vector.
GRS
Indicator for random sample of 1638 Groningen residents in the study. Numeric vector.
Match_1
Numeric, statin users and non-users matched 1:1 on age and educational level. Matched pairs share a common integer label. -1 indicates participant not matched.
Match_2
Numeric, statin users and non-users matched 1:1 on Framingham risk score. Matched pairs share a common integer label. -1 indicates participant not matched
http://doi.org/10.5061/dryad.6qs53
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115755
Random sample of size 500 from the 4,095 cases in the prevend dataset with all 31 variables.
prevend.samp
prevend.samp
A tibble with 500 rows and 31 variables:
Casenr
case number, numeric
Age
Numeric, age in years, recorded at time of enrollment.
Gender
Numeric vector: 0 = males; 1 = females.
Ethnicity
Numeric vector: 0 = Western European; 1 = African; 2 = Asian; 3 = Other.
Education
Highest level of education. Numeric: 0 primary school; 1 = lower secondary education; 3 = university.
RFFT
Numeric, performance on the Ruff Figural Fluency Test. Scores range from 0 (worst) to 175 (best).
VAT
Numeric, Visual Association Test score. The VAT is a learning task based on image recognition. Scores may range from 0 (worst) to 12 (best)
CVD
History of cardiovascular event. Numeric vector: 0 = No; 1 = Yes.
DM
Diabetes mellitus (Type 2 diabetes) status at enrollment. Numeric vector: 0 = No; 1 = Yes.
Smoking
Smoking status at enrollment. numeric vector: 0 = No; 1 = Yes.
Hypertension
status of hypertension at enrollment. Numeric vector: 0 = No; 1 = Yes.
BMI
Numeric, body mass index, weight divided by height-squared, in kg/m^2
SBP
Numeric, systolic blood pressure, in mmHg
DBP
Numeric, diastolic blood pressure, in mmHg
MAP
Numeric, mean arterial pressure, in mmHg
eGFR
Numeric, estimated glomerular filtration rate, a measure of kidney function. Low values indicate possible kidney damage, in mL/min.
Albuminuria.1
Albuminuria (mg/24hr) in two categories. Numeric vector:
0 = (< 30); 1 = ( 30)
Albuminuria.2
Albuminuria (mg/24hr) in three categories. Numeric:
0 = (0 to < 10), 1 = (10 to < 30); 3 = ( 30).
Chol
Numeric, total cholesterol, in mmol/L.
HDL
Numeric, HDL cholesterol, in mmol/L.
Statin
Statin use at enrollment. Numeric vector: 0 = No; 1 = Yes.
Solubility
Statin solubility.Numeric vector: 0 = lipophilic; 1 = hydrophilic; 2 = no statin use. NA indicates statin solubility is missing.
Days
Numeric, total duration of statin use, in days. -1 indicates participant did not use statins
Years
Numeric, total duration of statin use, in years. -1 indicates participant did not use statins.
DDD
Defined daily dose of the statin. Numeric vector: From the PLOS One paper, "DDD is defined by the WHO as the drug units representing dosages with approximately similar efficacy. One DDD corresponds to the following dosage for each statin respectively: Simvastatin 30 mg, Pravastatin 30 mg, Fluvastatin 60 mg, Atorvastatin 20 mg and Rosuvastatin 10 mg."
FRS
Framingham risk score. Numeric vector. The score, a measure of risk for a cardivascular event within 10 years. Higher values imply increased use. For details see D’Agostino RBS, Vasan RS, Pencina MJ, Wolf PA, Cobain M, et al. (2008) General cardiovascular risk profile for use in primary care: The Framingham Heart Study. Circulation 117: 743–753.
PS
Propensity score of statin use. Numeric vector. See the PLOS One paper for the model used to calculate the score
PSquint
Quintile of PS. Numeric vector.
GRS
Indicator for random sample of 1638 Groningen residents in the study. Numeric vector.
Match_1
Numeric, statin users and non-users matched 1:1 on age and educational level. Matched pairs share a common integer label. -1 indicates participant not matched.
Match_2
Numeric, statin users and non-users matched 1:1 on Framingham risk score. Matched pairs share a common integer label. -1 indicates participant not matched
http://doi.org/10.5061/dryad.6qs53
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115755
Subjects from Central Prison in Raleigh, NC, volunteered for an experiment involving an "isolation" experience. The goal of the experiment was to find a treatment that reduces subjects' psychopathic deviant T scores. This score measures a person's need for control or their rebellion against control, and it is part of a commonly used mental health test called the Minnesota Multiphasic Personality Inventory (MMPI) test.
prison
prison
A data frame with 14 observations on the following 6 variables.
Pre-treatment 1.
Post-treatment 1.
Pre-treatment 2.
Post-treatment 2.
Pre-treatment 3.
Post-treatment 3.
https://stat.duke.edu/datasets/prison-isolation
prison
prison
Fueleconomy.gov, the official US government source for fuel economy information, allows users to share gas mileage information on their vehicles. These data come from 19 users sharing gas mileage on their 2017 Toyota Prius Prime. Note that these data are user estimates and since the sources data cannot be verified, the accuracy of these estimates are not guaranteed.
prius_mpg
prius_mpg
A data frame with 19 observations on the following 10 variables.
Average mileage as estimated by the user.
US State the user lives in.
Proportion of stop and go driving.
Proportion of highway driving.
Date estimate was last updated.
Fueleconomy.gov, https://www.fueleconomy.gov/mpg/MPG.do?action=mpgData&vehicleID=38531&browser=true&details=on, retrieved 2019-04-14.
library(ggplot2) library(dplyr) ggplot(prius_mpg, aes(x = average_mpg)) + geom_histogram(binwidth = 25)
library(ggplot2) library(dplyr) ggplot(prius_mpg, aes(x = average_mpg)) + geom_histogram(binwidth = 25)
Create a 3 x 3 grid of quantile-quantile plots, the first of which corresponds to the input data. The other eight plots arise from simulating random normal data with the same mean, standard deviation, and length as the data. For use in comparing known-normal qqplots to an observed qqplot to assess normality.
qqnormsim(sample, data)
qqnormsim(sample, data)
sample |
the variable to be plotted. |
data |
data frame to use. |
A 3 x 3 grid of qqplots.
Results from a Yahoo! News poll conducted by YouGov on May 29-31, 2020. In total 1060 U.S. adults were asked a series of questions regarding race and justice in the wake of the killing of George Floyd by a police officer. Results in this dataset are percentages for the question, "Do you think Blacks and Whites receive equal treatment from the police?" For this particular question there were 1059 respondents.
race_justice
race_justice
A data frame with 1,059 rows and 2 variables.
Race/ethnicity of respondent, with levels White
, Black
, Hispanic
, and Other
.
Response to the question "Do you think Black and White
people receive equal treatment from the police?", with levels Yes
, No
, and Not sure
.
Yahoo! News Race and Justice - May 31, 2020.
library(ggplot2) library(dplyr) # Conditional probabilities of response for each race/ethnicity race_justice |> count(race_eth, response) |> group_by(race_eth) |> mutate(prop = n / sum(n)) # Stacked bar plot of counts ggplot(race_justice, aes(x = race_eth, fill = response)) + geom_bar() + labs( x = "Race / ethnicity", y = "Count", title = "Do you think Black and White people receive equal treatment from the police?", fill = "Response" ) # Stacked bar plot of proportions ggplot(race_justice, aes(x = race_eth, fill = response)) + geom_bar(position = "fill") + labs( x = "Race / ethnicity", y = "Proportion", title = "Do you think Black and White people receive equal treatment from the police?", fill = "Response" )
library(ggplot2) library(dplyr) # Conditional probabilities of response for each race/ethnicity race_justice |> count(race_eth, response) |> group_by(race_eth) |> mutate(prop = n / sum(n)) # Stacked bar plot of counts ggplot(race_justice, aes(x = race_eth, fill = response)) + geom_bar() + labs( x = "Race / ethnicity", y = "Count", title = "Do you think Black and White people receive equal treatment from the police?", fill = "Response" ) # Stacked bar plot of proportions ggplot(race_justice, aes(x = race_eth, fill = response)) + geom_bar(position = "fill") + labs( x = "Race / ethnicity", y = "Proportion", title = "Do you think Black and White people receive equal treatment from the police?", fill = "Response" )
A reduced set of the official results of the 2020 FI Survey from Reddit (r/financialindependence). Only responses that represent the respondent (not other contributors in the household) are listed. Does not include retired individuals. As per instructed, respondents give dollar values in their native currency.
reddit_finance
reddit_finance
A data frame with 1998 rows and 65 variables.
How many individuals contribute to your household income?
As a result of the pandemic, did your earned income increase, decrease, or remain the same?
By how much did your earned income change?
As a result of the pandemic, did your expenses increase, decrease, or remain the same?
By how much did your expenses change?
As a result of the pandemic, did your FI (financially independent) number...
As a result of the pandemic, did your planned RE (retirement) date...
Overall, how would you characterize the pandemic's impact on your finances?
With which political party do you most closely identify? You do not need to be registered with a party to select it, answer based on your personal views.
What is your race/ethnicity? Select all that apply.
What is your gender?
What is your age?
What is the highest level of education you have completed?
What is your relationship status?
Do you have children?
What country are you in?
Are you financially independent? Meaning you do not need to work for money, regardless of whether you work for money.
At what amount invested will you consider yourself Financially Independent? (What is your FI number?)
What percent FI are you? (What percent of your FI number do you currently have?)
At what amount invested do you intend to retire? (What is your RE number)
What is your target safe withdrawal rate? (If your answer is 3.5%, enter it as 3.5)
How much annual income do you expect to have from the sources you selected in question T5 at the point where you are utilizing all of them (or a majority if you do not intend to use all at the same time)? Enter your answer as a dollar amount.
How much money (from your savings and other sources) do you intend to spend each year once you are retired? Enter your answer as a dollar amount.
At what amount invested did you consider yourself Financially Independent? (AKA what was your "FI number")
Which of the following would you have considered yourself at the time you reached Financial Independence:
At what age do you intend to retire?
Do you intend to stop working for money when you reach financial independence?
Which of the following best describes the industry in which you currently or most recently work(ed)?
Which of the following best describes your current or most recent employer?
Which of the following best describes your current or most recent job role?
What is your current employment status? - Full Time
What is your current employment status? - Part Time, Regular
What is your current employment status? -Side Gig, Intermittent
What is your current employment status? -Not Employed
What is your current educational status?
What is your current housing situation?
Primary residence value.
Brokerage accounts (Taxable).
Retirement accounts (Tax Advantaged).
Cash / cash equivalents (Savings, Checking, C.D.s, Money Market).
Dedicated Savings/Investment Accounts (Healthcare, Education).
Speculation (Crypto, P2P Lending, Gold, etc.).
investment properties / owned business(es).
Other assets.
Outstanding student loans.
Outstanding mortgage / HELOC.
Outstanding auto loans.
Outstanding credit cards / personal loans.
Outstanding medical debt.
Debt from investment properties / owned business.
Debt from other sources.
What was your 2020 gross (pre-tax, pre-deductions) annual household income?
Housing expenses(rent, mortgage, insurance, taxes, upkeep).
Utilities expenses(phone, internet, gas, electric, water, sewer).
Transportation expenses(car payment, bus / subway tickets, gas, insurance, maintenance).
Necessities expenses(Groceries, Clothing, Personal Care, Household Supplies).
Luxury expenses (Restaurants/Dining, Entertainment, Hobbies, Travel, Pets, Gifts).
Children expenses(child care, soccer team, etc.).
Debt repayment (excluding mortgage/auto).
Investments / savings.
Charity / Tithing.
Healthcare expenses(direct costs, co-pays, insurance you pay).
Taxes (the sum of all taxes paid, including amounts deducted from paychecks).
Education expenses.
Other expenses.
Reddit Official 2020 FI Survey Results, https://www.reddit.com/r/financialindependence/comments/m1q8ia/official_2020_fi_survey_results.
library(ggplot2) # Histogram of Expected Retirement Age. ggplot(reddit_finance, aes(retire_age)) + geom_bar(na.rm = TRUE) + labs( title = "At what age do you expect to retire?", x = "Age Bracket", y = "Number of Respondents" ) # Histogram of Dollar Amount at Which FI was reached. ggplot(reddit_finance, aes(whn_fin_indy_num)) + geom_histogram(na.rm = TRUE, bins = 20) + labs( title = "At what amount invested did you consider\nyourself Financially Independent?", x = "Dollar Amount (in local currency)", y = "Number of Respondents" )
library(ggplot2) # Histogram of Expected Retirement Age. ggplot(reddit_finance, aes(retire_age)) + geom_bar(na.rm = TRUE) + labs( title = "At what age do you expect to retire?", x = "Age Bracket", y = "Number of Respondents" ) # Histogram of Dollar Amount at Which FI was reached. ggplot(reddit_finance, aes(whn_fin_indy_num)) + geom_histogram(na.rm = TRUE, bins = 20) + labs( title = "At what amount invested did you consider\nyourself Financially Independent?", x = "Dollar Amount (in local currency)", y = "Number of Respondents" )
Simulated data for regression
res_demo_1
res_demo_1
A data frame with 100 observations on the following 3 variables.
a numeric vector
a numeric vector
a numeric vector
res_demo_1
res_demo_1
Simulated data for regression
res_demo_2
res_demo_2
A data frame with 300 observations on the following 3 variables.
a numeric vector
a numeric vector
a numeric vector
res_demo_2
res_demo_2
This experiment data comes from a study that sought to understand the influence of race and gender on job application callback rates. The study monitored job postings in Boston and Chicago for several months during 2001 and 2002 and used this to build up a set of test cases. Over this time period, the researchers randomly generating resumes to go out to a job posting, such as years of experience and education details, to create a realistic-looking resume. They then randomly assigned a name to the resume that would communicate the applicant's gender and race. The first names chosen for the study were selected so that the names would predominantly be recognized as belonging to black or white individuals. For example, Lakisha was a name that their survey indicated would be interpreted as a black woman, while Greg was a name that would generally be interpreted to be associated with a white male.
resume
resume
A data frame with 4870 observations, representing 4870 resumes, over
30 different variables that describe the job details, the outcome
(received_callback
), and attributes of the resume.
Unique ID associated with the advertisement.
City where the job was located.
Industry of the job.
Type of role.
Indicator for if the employer is a federal contractor.
Indicator for if the employer is an Equal Opportunity Employer.
The type of company, e.g. a nonprofit or a private company.
Indicator for if any job requirements are
listed. If so, the other job_req_*
fields give more detail.
Indicator for if communication skills are required.
Indicator for if some level of education is required.
Amount of experience required.
Indicator for if computer skills are required.
Indicator for if organization skills are required.
Level of education required.
Indicator for if there was a callback from the job posting for the person listed on this resume.
The first name used on the resume.
Inferred race associated with the first name on the resume.
Inferred gender associated with the first name on the resume.
Years of college education listed on the resume.
Indicator for if the resume listed a college degree.
Indicator for if the resume listed that the candidate has been awarded some honors.
Indicator for if the resume listed working while in school.
Years of experience listed on the resume.
Indicator for if computer skills were listed on the resume. These skills were adapted for listings, though the skills were assigned independently of other details on the resume.
Indicator for if any special skills were listed on the resume.
Indicator for if volunteering was listed on the resume.
Indicator for if military experience was listed on the resume.
Indicator for if there were holes in the person's employment history.
Indicator for if the resume lists an email address.
Each resume was generally classified as either lower or higher quality.
Because this is an experiment, where the race and gender attributes are being randomly assigned to the resumes, we can conclude that any statistically significant difference in callback rates is causally linked to these attributes.
Do you think it's reasonable to make a causal conclusion? You may have some health skepticism. However, do take care to appreciate that this was an experiment: the first name (and so the inferred race and gender) were randomly assigned to the resumes, and the quality and attributes of a resume were assigned independent of the race and gender. This means that any effects we observe are in fact causal, and the effects related to race are both statistically significant and very large: white applicants had about a 50\
Do you still have doubts lingering in the back of your mind about the validity of this study? Maybe a counterargument about why the standard conclusions from this study may not apply? The article summarizing the results was exceptionally well-written, and it addresses many potential concerns about the study's approach. So if you're feeling skeptical about the conclusions, please find the link below and explore!
Bertrand M, Mullainathan S. 2004. "Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination". The American Economic Review 94:4 (991-1013). doi:10.3386/w9873.
head(resume, 5) # Some checks to confirm balance between race and # other attributes of a resume. There should be # some minor differences due to randomness, but # each variable should be (and is) generally # well-balanced. table(resume$race, resume$years_college) table(resume$race, resume$college_degree) table(resume$race, resume$honors) table(resume$race, resume$worked_during_school) table(resume$race, resume$years_experience) table(resume$race, resume$computer_skills) table(resume$race, resume$special_skills) table(resume$race, resume$volunteer) table(resume$race, resume$military) table(resume$race, resume$employment_holes) table(resume$race, resume$has_email_address) table(resume$race, resume$resume_quality) # Regarding the callback outcome for race, # we observe a very large difference. tapply( resume$received_callback, resume[c("race", "gender")], mean ) # Natural question: is this statisticaly significant? # A proper analysis would take into account the # paired nature of the data. For each ad, let's # compute the following statistic: # <callback rate for white candidates> # - <callback rate for black candidates> # First contruct the callbacks for white and # black candidates by ad ID: table(resume$race) cb_white <- with( subset(resume, race == "white"), tapply(received_callback, job_ad_id, mean) ) cb_black <- with( subset(resume, race == "black"), tapply(received_callback, job_ad_id, mean) ) # Next, compute the differences, where the # names(cb_white) part ensures we matched up the # job ad IDs. diff <- cb_white - cb_black[names(cb_white)] # Finally, we can apply a t-test on the differences: t.test(diff) # There is very strong evidence of an effect. # Here's a similar check with gender. There are # more female-inferred candidates used on the resumes. table(resume$gender) cb_male <- with( subset(resume, gender == "m"), tapply(received_callback, job_ad_id, mean) ) cb_female <- with( subset(resume, gender == "f"), tapply(received_callback, job_ad_id, mean) ) diff <- cb_female - cb_male[names(cb_female)] # The `na.rm = TRUE` part ensures we limit to jobs # where both a male and female resume were sent. t.test(diff, na.rm = TRUE) # There is no statistically significant difference. # Was that the best analysis? Absolutely not! # However, the analysis was unbiased. To get more # precision on the estimates, we could build a # multivariate model that includes many characteristics # of the resumes sent, e.g. years of experience. # Since those other characteristics were assigned # independently of the race characteristics, this # means the race finding will almost certainy will # hold. However, it is possible that we'll find # more interesting results with the gender investigation.
head(resume, 5) # Some checks to confirm balance between race and # other attributes of a resume. There should be # some minor differences due to randomness, but # each variable should be (and is) generally # well-balanced. table(resume$race, resume$years_college) table(resume$race, resume$college_degree) table(resume$race, resume$honors) table(resume$race, resume$worked_during_school) table(resume$race, resume$years_experience) table(resume$race, resume$computer_skills) table(resume$race, resume$special_skills) table(resume$race, resume$volunteer) table(resume$race, resume$military) table(resume$race, resume$employment_holes) table(resume$race, resume$has_email_address) table(resume$race, resume$resume_quality) # Regarding the callback outcome for race, # we observe a very large difference. tapply( resume$received_callback, resume[c("race", "gender")], mean ) # Natural question: is this statisticaly significant? # A proper analysis would take into account the # paired nature of the data. For each ad, let's # compute the following statistic: # <callback rate for white candidates> # - <callback rate for black candidates> # First contruct the callbacks for white and # black candidates by ad ID: table(resume$race) cb_white <- with( subset(resume, race == "white"), tapply(received_callback, job_ad_id, mean) ) cb_black <- with( subset(resume, race == "black"), tapply(received_callback, job_ad_id, mean) ) # Next, compute the differences, where the # names(cb_white) part ensures we matched up the # job ad IDs. diff <- cb_white - cb_black[names(cb_white)] # Finally, we can apply a t-test on the differences: t.test(diff) # There is very strong evidence of an effect. # Here's a similar check with gender. There are # more female-inferred candidates used on the resumes. table(resume$gender) cb_male <- with( subset(resume, gender == "m"), tapply(received_callback, job_ad_id, mean) ) cb_female <- with( subset(resume, gender == "f"), tapply(received_callback, job_ad_id, mean) ) diff <- cb_female - cb_male[names(cb_female)] # The `na.rm = TRUE` part ensures we limit to jobs # where both a male and female resume were sent. t.test(diff, na.rm = TRUE) # There is no statistically significant difference. # Was that the best analysis? Absolutely not! # However, the analysis was unbiased. To get more # precision on the estimates, we could build a # multivariate model that includes many characteristics # of the resumes sent, e.g. years of experience. # Since those other characteristics were assigned # independently of the race characteristics, this # means the race finding will almost certainy will # hold. However, it is possible that we'll find # more interesting results with the gender investigation.
Public health has improved and evolved, but has the public's knowledge changed with it? This dataset explores sample responses for two survey questions posed by Hans Rosling during lectures to a wide array of well-educated audiences.
rosling_responses
rosling_responses
A data frame with 278 rows and 3 variables:
ID for the question being posed.
Noting whether the response was correct
or incorrect
.
The probability the person would have guessed the answer correctly if they were guessing completely randomly.
The samples we describe are plausible based on the exact rates observed in larger samples. For more info on the actual rates observed, visit https://www.gapminder.org.
Another relevant reference is a book by Hans Rosling, Anna Rosling Ronnlund, and Ola Rosling called Factfulness.
frac_correct <- tapply( rosling_responses$response == "correct", rosling_responses$question, mean ) frac_correct n <- table(rosling_responses$question) n expected <- tapply( rosling_responses$prob_random_correct, rosling_responses$question, mean ) # Construct confidence intervals. se <- sqrt(frac_correct * (1 - frac_correct) / n) # Lower bounds. frac_correct - 1.96 * se # Upper bounds. frac_correct + 1.96 * se # Construct Z-scores and p-values. z <- (frac_correct - expected) / se pt(z, df = n - 1)
frac_correct <- tapply( rosling_responses$response == "correct", rosling_responses$question, mean ) frac_correct n <- table(rosling_responses$question) n expected <- tapply( rosling_responses$prob_random_correct, rosling_responses$question, mean ) # Construct confidence intervals. se <- sqrt(frac_correct * (1 - frac_correct) / n) # Lower bounds. frac_correct - 1.96 * se # Upper bounds. frac_correct + 1.96 * se # Construct Z-scores and p-values. z <- (frac_correct - expected) / se pt(z, df = n - 1)
Survey of Russian citizens on whether they believed their government tried to influence the 2016 US election. The survey was taken in Spring 2018 by Pew Research.
russian_influence_on_us_election_2016
russian_influence_on_us_election_2016
A data frame with 506 observations on the following variable.
Response of the Russian survey participant to the question of whether their government tried to influence the 2016 election in the United States.
The actual sample size was 1000. However, the original data were not from a simple random sample; after accounting for the design, the equivalent sample size was 506, which was what was used for the dataset here to keep things simpler for intro stat analyses.
table(russian_influence_on_us_election_2016)
table(russian_influence_on_us_election_2016)
Includes yearly data on gdp, gni, co2 emissions, start up costs.
sa_gdp_elec
sa_gdp_elec
A data frame with 16 rows and 7 variables.
Year data collected.
Access to electricity as a percentage of the population.
cost of business startup procedures as a percent of GNI.
CO2 emission in kt (kiloton).
GDP per capita, PPP in constant 2017 international dollars.
GNI per capita, PPP in constant 2017 international dollars.
kg per 2017 PPP dollars of GDP.
library(ggplot2) ggplot(sa_gdp_elec, aes(year, access_elec)) + geom_point(alpha = 0.3) + labs( x = "Year", y = "Percent of Population", title = "Access to Electricity in South Africa 2003 - 2018" )
library(ggplot2) ggplot(sa_gdp_elec, aes(year, access_elec)) + geom_point(alpha = 0.3) + labs( x = "Year", y = "Percent of Population", title = "Access to Electricity in South Africa 2003 - 2018" )
Data collected at three different water masses in the Bimini Lagoon, Bahamas.
salinity
salinity
A data frame with 30 rows and 2 variables.
Location where measurements were taken.
Salinity value in parts per thousand.
Till, R. (1974) Statistical Methods for the Earth Scientist: An Introduction. London: Macmillon, 104.
library(ggplot2) library(broom) ggplot(salinity, aes(x = salinity_ppt)) + geom_dotplot() + facet_wrap(~site_number, ncol = 1) tidy(aov(salinity_ppt ~ site_number, data = salinity))
library(ggplot2) library(broom) ggplot(salinity, aes(x = salinity_ppt)) + geom_dotplot() + facet_wrap(~site_number, ncol = 1) tidy(aov(salinity_ppt ~ site_number, data = salinity))
Fake data for score improvements from students who took a course from an SAT score improvement company.
sat_improve
sat_improve
A data frame with 30 observations on the following variable.
a numeric vector
sat_improve
sat_improve
SAT and GPA data for 1000 students at an unnamed college.
satgpa
satgpa
A data frame with 1000 observations on the following 6 variables.
Gender of the student.
Verbal SAT percentile.
Math SAT percentile.
Total of verbal and math SAT percentiles.
High school grade point average.
First year (college) grade point average.
Educational Testing Service originally collected the data.
https://chance.dartmouth.edu/course/Syllabi/Princeton96/ETSValidation.html
library(ggplot2) library(broom) # Verbal scores ggplot(satgpa, aes(x = sat_v, fy_gpa)) + geom_point() + geom_smooth(method = "lm") + labs( x = "Verbal SAT percentile", y = "First year (college) grade point average" ) mod <- lm(fy_gpa ~ sat_v, data = satgpa) tidy(mod) # Math scores ggplot(satgpa, aes(x = sat_m, fy_gpa)) + geom_point() + geom_smooth(method = "lm") + labs( x = "Math SAT percentile", y = "First year (college) grade point average" ) mod <- lm(fy_gpa ~ sat_m, data = satgpa) tidy(mod)
library(ggplot2) library(broom) # Verbal scores ggplot(satgpa, aes(x = sat_v, fy_gpa)) + geom_point() + geom_smooth(method = "lm") + labs( x = "Verbal SAT percentile", y = "First year (college) grade point average" ) mod <- lm(fy_gpa ~ sat_v, data = satgpa) tidy(mod) # Math scores ggplot(satgpa, aes(x = sat_m, fy_gpa)) + geom_point() + geom_smooth(method = "lm") + labs( x = "Math SAT percentile", y = "First year (college) grade point average" ) mod <- lm(fy_gpa ~ sat_m, data = satgpa) tidy(mod)
Color scale constructor for OpenIntro IMS colors
scale_color_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
scale_color_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
palette |
Character name of palette in openintro_palettes |
discrete |
Boolean indicating whether color aesthetic is discrete or not |
reverse |
Boolean indicating whether the palette should be reversed |
... |
Additional arguments passed to |
library(ggplot2) # Categorical variable with three levels ggplot(evals, aes( x = bty_avg, y = score, color = rank, shape = rank )) + geom_jitter(size = 2, alpha = 0.6) + scale_color_openintro("three") # Categorical variable with two levels ggplot(evals, aes( x = bty_avg, y = score, color = language, shape = language )) + geom_jitter(size = 2, alpha = 0.6) + scale_color_openintro("two") # Continuous variable # Generates a palette, but not recommended ggplot(evals, aes( x = bty_avg, y = score, color = score )) + geom_jitter(size = 2, alpha = 0.8) + scale_color_openintro(discrete = FALSE) # For continous palettes # use scale_color_gradient instead ggplot(evals, aes( x = bty_avg, y = score, color = score )) + geom_jitter(size = 2) + scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"]) ggplot(evals, aes( x = bty_avg, y = score, color = cls_perc_eval )) + geom_jitter(size = 2) + scale_color_gradient(low = COL["red", "full"], high = COL["red", "f8"])
library(ggplot2) # Categorical variable with three levels ggplot(evals, aes( x = bty_avg, y = score, color = rank, shape = rank )) + geom_jitter(size = 2, alpha = 0.6) + scale_color_openintro("three") # Categorical variable with two levels ggplot(evals, aes( x = bty_avg, y = score, color = language, shape = language )) + geom_jitter(size = 2, alpha = 0.6) + scale_color_openintro("two") # Continuous variable # Generates a palette, but not recommended ggplot(evals, aes( x = bty_avg, y = score, color = score )) + geom_jitter(size = 2, alpha = 0.8) + scale_color_openintro(discrete = FALSE) # For continous palettes # use scale_color_gradient instead ggplot(evals, aes( x = bty_avg, y = score, color = score )) + geom_jitter(size = 2) + scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"]) ggplot(evals, aes( x = bty_avg, y = score, color = cls_perc_eval )) + geom_jitter(size = 2) + scale_color_gradient(low = COL["red", "full"], high = COL["red", "f8"])
Fill scale constructor for OpenIntro IMS colors
scale_fill_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
scale_fill_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
palette |
Character name of palette in openintro_palettes |
discrete |
Boolean indicating whether color aesthetic is discrete or not |
reverse |
Boolean indicating whether the palette should be reversed |
... |
Additional arguments passed to |
library(ggplot2) library(dplyr) # Categorical variable with two levels ggplot(evals, aes(x = ethnicity, fill = ethnicity)) + geom_bar() + scale_fill_openintro("two") # Categorical variable with three levels ggplot(evals, aes(x = rank, fill = rank)) + geom_bar() + scale_fill_openintro("three") # Continuous variable with levels # Generates a palette, but may not be the best palette # in terms of color-blind and grayscale friendliness ggplot(diamonds, aes(x = clarity, fill = clarity)) + geom_bar() + scale_fill_openintro() # For continuous palettes # use scale_color_gradient instead ggplot(evals, aes( x = bty_avg, y = score, color = score )) + geom_jitter(size = 2) + scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"]) ggplot(evals, aes( x = bty_avg, y = score, color = cls_perc_eval )) + geom_jitter(size = 2) + scale_color_gradient(low = IMSCOL["green", "full"], high = IMSCOL["green", "f6"])
library(ggplot2) library(dplyr) # Categorical variable with two levels ggplot(evals, aes(x = ethnicity, fill = ethnicity)) + geom_bar() + scale_fill_openintro("two") # Categorical variable with three levels ggplot(evals, aes(x = rank, fill = rank)) + geom_bar() + scale_fill_openintro("three") # Continuous variable with levels # Generates a palette, but may not be the best palette # in terms of color-blind and grayscale friendliness ggplot(diamonds, aes(x = clarity, fill = clarity)) + geom_bar() + scale_fill_openintro() # For continuous palettes # use scale_color_gradient instead ggplot(evals, aes( x = bty_avg, y = score, color = score )) + geom_jitter(size = 2) + scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"]) ggplot(evals, aes( x = bty_avg, y = score, color = cls_perc_eval )) + geom_jitter(size = 2) + scale_color_gradient(low = IMSCOL["green", "full"], high = IMSCOL["green", "f6"])
On June 28, 2012 the U.S. Supreme Court upheld the much debated 2010 healthcare law, declaring it constitutional. A Gallup poll released the day after this decision indicates that 46% of 1,012 Americans agree with this decision.
scotus_healthcare
scotus_healthcare
A data frame with 1012 observations on the following variable.
Response values reported are agree
and other
.
Gallup, Americans Issue Split Decision on Healthcare Ruling, retrieved 2012-06-28.
table(scotus_healthcare)
table(scotus_healthcare)
Names of registered pets in Seattle, WA, between 2003 and 2018, provided by the city's Open Data Portal.
seattlepets
seattlepets
A data frame with 52,519 rows and 7 variables:
Date the animal was registered with Seattle
Unique license number
Animal's name
Animal's species (dog, cat, goat, etc.)
Primary breed of the animal
Secondary breed if mixed
Zip code animal is registered in
These data come from Seattle's Open Data Portal, https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb
Study from the 1970s about whether sex influences hiring recommendations.
sex_discrimination
sex_discrimination
A data frame with 48 observations on the following 2 variables.
a factor with levels female
and male
a factor with levels not promoted
and promoted
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
library(ggplot2) table(sex_discrimination) ggplot(sex_discrimination, aes(y = sex, fill = decision)) + geom_bar(position = "fill")
library(ggplot2) table(sex_discrimination) ggplot(sex_discrimination, aes(y = sex, fill = decision)) + geom_bar(position = "fill")
A dataset on Delta Variant Covid-19 cases in the UK. This dataset gives a great example of Simpson's Paradox. When aggregating results without regard to age group, the death rate for vaccinated individuals is higher – but they have a much higher risk population. Once we look at populations with more comparable risks (breakout age groups), we see that the vaccinated group tends to be lower risk in each risk-bucketed group and that many of the higher risk patients had gotten vaccinated. The dataset was brought to OpenIntro's attention by Matthew T. Brenneman of Embry-Riddle Aeronautical University. Note: some totals in the original source differ as there were some cases that did not have ages associated with them.
simpsons_paradox_covid
simpsons_paradox_covid
A data frame with 286,166 rows and 3 variables:
Age of the person. Levels: under 50
, 50 +
.
Vaccination status of the person. Note: the vaccinated group includes those who were only partially vaccinated. Levels: vaccinated
, unvaccinated
Did the person die from the Delta variant? Levels: death
and survived
.
Public Health England: Technical briefing 20
library(dplyr) library(scales) # Calculate the mortality rate for all cases by vaccination status simpsons_paradox_covid |> group_by(vaccine_status, outcome) |> summarize(count = n()) |> ungroup() |> group_by(vaccine_status) |> mutate(total = sum(count)) |> filter(outcome == "death") |> select(c(vaccine_status, count, total)) |> mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |> select(-c(count, total)) # Calculate mortality rate by age group and vaccination status simpsons_paradox_covid |> group_by(age_group, vaccine_status, outcome) |> summarize(count = n()) |> ungroup() |> group_by(age_group, vaccine_status) |> mutate(total = sum(count)) |> filter(outcome == "death") |> select(c(age_group, vaccine_status, count, total)) |> mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |> select(-c(count, total))
library(dplyr) library(scales) # Calculate the mortality rate for all cases by vaccination status simpsons_paradox_covid |> group_by(vaccine_status, outcome) |> summarize(count = n()) |> ungroup() |> group_by(vaccine_status) |> mutate(total = sum(count)) |> filter(outcome == "death") |> select(c(vaccine_status, count, total)) |> mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |> select(-c(count, total)) # Calculate mortality rate by age group and vaccination status simpsons_paradox_covid |> group_by(age_group, vaccine_status, outcome) |> summarize(count = n()) |> ungroup() |> group_by(age_group, vaccine_status) |> mutate(total = sum(count)) |> filter(outcome == "death") |> select(c(age_group, vaccine_status, count, total)) |> mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |> select(-c(count, total))
Data were simulated in R, and some of the simulations do not represent data from actual normal distributions.
simulated_dist
simulated_dist
The format is: List of 4 $ d1: dataset of 100 observations. $ d2: dataset of 50 observations. $ d3: num dataset of 500 observations. $ d4: dataset of 15 observations. $ d5: num dataset of 25 observations. $ d6: dataset of 50 observations.
data(simulated_dist) lapply(simulated_dist, qqnorm)
data(simulated_dist) lapply(simulated_dist, qqnorm)
Data were simulated using rnorm
.
simulated_normal
simulated_normal
The format is: List of 3 $ n40 : 40 observations from a standard normal distribution. $ n100: 100 observations from a standard normal distribution. $ n400: 400 observations from a standard normal distribution.
data(simulated_normal) lapply(simulated_normal, qqnorm)
data(simulated_normal) lapply(simulated_normal, qqnorm)
Fake data.
simulated_scatter
simulated_scatter
A data frame with 500 observations on the following 3 variables.
Group, representing data for a specific plot.
x-value.
y-value.
library(ggplot2) ggplot(simulated_scatter, aes(x = x, y = y)) + geom_point() + facet_wrap(~group)
library(ggplot2) ggplot(simulated_scatter, aes(x = x, y = y)) + geom_point() + facet_wrap(~group)
Researchers studying the effect of antibiotic treatment for acute sinusitis to one of two groups: treatment or control.
sinusitis
sinusitis
A data frame with 166 observations on the following 2 variables.
a factor with levels control
and treatment
a factor with levels no
and yes
J.M. Garbutt et al. Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial. In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685-692.
sinusitis
sinusitis
The National Sleep Foundation conducted a survey on the sleep habits of randomly sampled transportation workers and a control sample of non-transportation workers.
sleep_deprivation
sleep_deprivation
A data frame with 1087 observations on the following 2 variables.
a factor with levels <6
, 6-8
, and >8
a factor with levels bus / taxi / limo drivers
,
control
, pilots
, train operators
, truck drivers
National Sleep Foundation, 2012 Sleep in America Poll: Transportation Workers' Sleep, 2012. https://www.sleepfoundation.org/professionals/sleep-americar-polls/2012-sleep-america-poll-transportation-workers-sleep
sleep_deprivation
sleep_deprivation
A sample of 6,224 individuals from the year 1721 who were exposed to smallpox in Boston. Some of them had received a vaccine (inoculated) while others had not. Doctors at the time believed that inoculation, which involves exposing a person to the disease in a controlled form, could reduce the likelihood of death.
smallpox
smallpox
A data frame with 6224 observations on the following 2 variables.
Whether the person died
or lived
.
Whether the person received inoculated.
Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.
data(smallpox) table(smallpox)
data(smallpox) table(smallpox)
Survey data on smoking habits from the UK. The dataset can be used for analyzing the demographic characteristics of smokers and types of tobacco consumed.
smoking
smoking
A data frame with 1691 observations on the following 12 variables.
Gender with levels Female
and Male
.
Age.
Marital status with levels Divorced
,
Married
, Separated
, Single
and Widowed
.
Highest education level with levels
A Levels
, Degree
, GCSE/CSE
, GCSE/O Level
,
Higher/Sub Degree
, No Qualification
, ONC/BTEC
and
Other/Sub Degree
Nationality with levels British
, English
,
Irish
, Scottish
, Welsh
, Other
, Refused
and Unknown
.
Ethnicity with levels Asian
, Black
,
Chinese
, Mixed
, White
and Refused
Unknown
.
Gross income with levels Under 2,600
,
2,600 to 5,200
, 5,200 to 10,400
, 10,400 to 15,600
,
15,600 to 20,800
, 20,800 to 28,600
, 28,600 to 36,400
,
Above 36,400
, Refused
and Unknown
.
Region with levels London
, Midlands & East Anglia
,
Scotland
, South East
, South West
, The North
and Wales
Smoking status with levels No
and Yes
Number of cigarettes smoked per day on weekends.
Number of cigarettes smoked per day on weekdays.
Type of cigarettes smoked with levels Packets
,
Hand-Rolled
, Both/Mainly Packets
and Both/Mainly Hand-Rolled
National STEM Centre, Large Datasets from stats4schools, https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools.
library(ggplot2) ggplot(smoking, aes(x = amt_weekends)) + geom_histogram(binwidth = 5) ggplot(smoking, aes(x = amt_weekdays)) + geom_histogram(binwidth = 5) ggplot(smoking, aes(x = gender, fill = smoke)) + geom_bar(position = "fill") ggplot(smoking, aes(x = marital_status, fill = smoke)) + geom_bar(position = "fill")
library(ggplot2) ggplot(smoking, aes(x = amt_weekends)) + geom_histogram(binwidth = 5) ggplot(smoking, aes(x = amt_weekdays)) + geom_histogram(binwidth = 5) ggplot(smoking, aes(x = gender, fill = smoke)) + geom_bar(position = "fill") ggplot(smoking, aes(x = marital_status, fill = smoke)) + geom_bar(position = "fill")
Annual snowfall data for Paradise, Mt. Rainier National Park. To include a full winter season, snowfall is recorded from July 1 to June 30. Data from 1943-1946 not available due to road closure during World War II. Records also unavailable from 1948-1954.
snowfall
snowfall
A data frame with 100 rows and 3 variables.
The year snowfall measurement began on July 1.
The year snowfall measurement ended on June 30.
Snowfall measured in inches.
library(ggplot2) ggplot(snowfall, aes(x = total_snow)) + geom_histogram(binwidth = 50) + labs( title = "Annual Snowfall", subtitle = "Paradise, Mt. Rainier National Park", x = "Snowfall (in.)", y = "Number of Years", caption = "Source: National Parks Services" ) ggplot(snowfall, aes(x = year_start, y = total_snow, group = 1)) + geom_line() + labs( title = "Annual Snowfall", subtitle = "Paradise, Mt. Rainier National Park", y = "Snowfall (in.)", x = "Year", caption = "Source: National Parks Services" )
library(ggplot2) ggplot(snowfall, aes(x = total_snow)) + geom_histogram(binwidth = 50) + labs( title = "Annual Snowfall", subtitle = "Paradise, Mt. Rainier National Park", x = "Snowfall (in.)", y = "Number of Years", caption = "Source: National Parks Services" ) ggplot(snowfall, aes(x = year_start, y = total_snow, group = 1)) + geom_line() + labs( title = "Annual Snowfall", subtitle = "Paradise, Mt. Rainier National Park", y = "Snowfall (in.)", x = "Year", caption = "Source: National Parks Services" )
A randomly generated dataset of soda preference (cola or orange) based on location.
soda
soda
A data frame with 60 observations on the following 2 variables.
Soda preference, cola or orange.
Is the person from the West coast or East coast?
library(dplyr) soda |> count(location, drink)
library(dplyr) soda |> count(location, drink)
The data provide the energy output for several months from two roof-top solar arrays in San Francisco. This city is known for having highly variable weather, so while these two arrays are only about 1 mile apart from each other, the Inner Sunset location tends to have more fog.
solar
solar
A data frame with 284 observations on the following 3 variables. Each row represents a single day for one of the arrays.
Location for the array.
Date.
Number of kWh
The Haight-Ashbury array is a 10.4 kWh array, while the Inner Sunset array is a 2.8 kWh array. The kWh units represents kilowatt-hours, which is the unit of energy that typically is used for electricity bills. The cost per kWh in San Francisco was about $0.25 in 2016.
These data were provided by Larry Rosenfeld, a resident in San Francisco.
solar.is <- subset(solar, location == "Inner_Sunset") solar.ha <- subset(solar, location == "Haight_Ashbury") plot(solar.is$date, solar.is$kwh, type = "l", ylim = c(0, max(solar$kwh))) lines(solar.ha$date, solar.ha$kwh, col = 4) d <- merge(solar.ha, solar.is, by = "date") plot(d$date, d$kwh.x / d$kwh.y, type = "l")
solar.is <- subset(solar, location == "Inner_Sunset") solar.ha <- subset(solar, location == "Haight_Ashbury") plot(solar.is$date, solar.is$kwh, type = "l", ylim = c(0, max(solar$kwh))) lines(solar.ha$date, solar.ha$kwh, col = 4) d <- merge(solar.ha, solar.is, by = "date") plot(d$date, d$kwh.x / d$kwh.y, type = "l")
Child mortality data from UNICEF's State of the World's Children 2019 Statistical Tables.
sowc_child_mortality
sowc_child_mortality
A data frame with 195 rows and 19 variables.
Country or area name.
Under-5 mortality rate (deaths per 1,000 live births) in 1990.
Under-5 mortality rate (deaths per 1,000 live births) in 2000.
Under-5 mortality rate (deaths per 1,000 live births) in 2018.
Annual rate of reduction in under-5 mortality rate (%)2000–2018.
Under-5 mortality rate male (deaths per 1,000 live births) 2018.
Under-5 mortality rate female (deaths per 1,000 live births) 2018.
Infant mortality rate (deaths per 1,000 live births) 1990
Infant mortality rate (deaths per 1,000 live births) 2018
Neonatal mortality rate (deaths per 1,000 live births) 1990.
Neonatal mortality rate (deaths per 1,000 live births) 2000.
Neonatal mortality rate (deaths per 1,000 live births) 2018.
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 1990.
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 2018.
Annual number of under-5 deaths (thousands) 2018.
Annual number of neonatal deaths (thousands) 2018.
Neonatal deaths as proportion of all under-5 deaths (%) 2018.
Number of deaths among children aged 5–14 (thousands) 2018.
United Nations Children's Emergency Fund (UNICEF)
library(dplyr) library(ggplot2) # List countries and areas whose children aged 5 and under have a higher probability of dying in # 2018 than they did in 1990 sowc_child_mortality |> mutate(decrease_prob_dying = prob_dying_age5to14_1990 - prob_dying_age5to14_2018) |> select(countries_and_areas, decrease_prob_dying) |> filter(decrease_prob_dying < 0) |> arrange(decrease_prob_dying) # List countries and areas and their relative rank for neonatal mortality in 2018 sowc_child_mortality |> mutate(rank = round(rank(-neonatal_mortality_2018))) |> select(countries_and_areas, rank, neonatal_mortality_2018) |> arrange(rank)
library(dplyr) library(ggplot2) # List countries and areas whose children aged 5 and under have a higher probability of dying in # 2018 than they did in 1990 sowc_child_mortality |> mutate(decrease_prob_dying = prob_dying_age5to14_1990 - prob_dying_age5to14_2018) |> select(countries_and_areas, decrease_prob_dying) |> filter(decrease_prob_dying < 0) |> arrange(decrease_prob_dying) # List countries and areas and their relative rank for neonatal mortality in 2018 sowc_child_mortality |> mutate(rank = round(rank(-neonatal_mortality_2018))) |> select(countries_and_areas, rank, neonatal_mortality_2018) |> arrange(rank)
Demographic data from UNICEF's State of the World's Children 2019 Statistical Tables.
sowc_demographics
sowc_demographics
A data frame with 202 rows and 18 variables.
Country or area name.
Population in 2018 in thousands.
Population under age 18 in 2018 in thousands.
Population under age 5 in 2018 in thousands.
Rate at which population is growing in 2018.
Rate at which population is estimated to grow in 2030.
Number of births in 2018 in thousands.
Number of live births per woman in 2018.A total fertility level of 2.1 is called replacement level and represents a level at which the population would remain the same size.
Life expectancy at birth in 1970.
Life expectancy at birth in 2000.
Life expectancy at birth in 2018.
The ratio of the not-working-age population to the working-age population of 15 - 64 years.
The ratio of the under 15 population to the working-age population of 15 - 64 years.
The ratio of the over 64 population to the working-age population of 15 - 64 years.
Percent of population living in urban areas.
Annual urban population growth rate from 2000 to 2018.
Estimated annual urban population growth rate from 2018 to 2030.
Net migration rate per 1000 population from 2015 to 2020.
United Nations Children's Emergency Fund (UNICEF)
library(dplyr) library(ggplot2) # List countries and areas' life expectancy, ordered by rank of life expectancy in 2018 sowc_demographics |> mutate(life_expectancy_change = life_expectancy_2018 - life_expectancy_1970) |> mutate(rank_life_expectancy = round(rank(-life_expectancy_2018), 0)) |> select( countries_and_areas, rank_life_expectancy, life_expectancy_2018, life_expectancy_change ) |> arrange(rank_life_expectancy) # List countries and areas' migration rate and population, ordered by rank of migration rate sowc_demographics |> mutate(rank = round(rank(migration_rate))) |> mutate(population_millions = total_pop_2018 / 1000) |> select(countries_and_areas, rank, migration_rate, population_millions) |> arrange(rank) # Scatterplot of life expectancy v population in 2018 ggplot(sowc_demographics, aes(life_expectancy_1970, life_expectancy_2018, size = total_pop_2018)) + geom_point(alpha = 0.5) + labs( title = "Life Expectancy", subtitle = "1970 v. 2018", x = "Life Expectancy in 1970", y = "Life Expectancy in 2018", size = "2018 Total Population" )
library(dplyr) library(ggplot2) # List countries and areas' life expectancy, ordered by rank of life expectancy in 2018 sowc_demographics |> mutate(life_expectancy_change = life_expectancy_2018 - life_expectancy_1970) |> mutate(rank_life_expectancy = round(rank(-life_expectancy_2018), 0)) |> select( countries_and_areas, rank_life_expectancy, life_expectancy_2018, life_expectancy_change ) |> arrange(rank_life_expectancy) # List countries and areas' migration rate and population, ordered by rank of migration rate sowc_demographics |> mutate(rank = round(rank(migration_rate))) |> mutate(population_millions = total_pop_2018 / 1000) |> select(countries_and_areas, rank, migration_rate, population_millions) |> arrange(rank) # Scatterplot of life expectancy v population in 2018 ggplot(sowc_demographics, aes(life_expectancy_1970, life_expectancy_2018, size = total_pop_2018)) + geom_point(alpha = 0.5) + labs( title = "Life Expectancy", subtitle = "1970 v. 2018", x = "Life Expectancy in 1970", y = "Life Expectancy in 2018", size = "2018 Total Population" )
Data from UNICEF's State of the World's Children 2019 Statistical Tables.
sowc_maternal_newborn
sowc_maternal_newborn
A data frame with 202 rows and 18 variables.
Country or area name.
Life expectancy: female in 2018.
Demand for family planning satisfied with modern methods (%) 2013–2018 Women aged 15 to 49.
Demand for family planning satisfied with modern methods (%) 2013–2018 Women aged 15 to 19.
Adolescent birth rate 2013 to 2018.
Births by age 18 (%) 2013 to 2018.
Antenatal care (%) 2013 to 2018 At least one visit.
Antenatal care (%) 2013 to 2018 At least four visits Women aged 15 to 49.
Antenatal care (%) 2013 to 2018 At least four visits Women aged 15 to 19.
Delivery care (%) 2013 to 2018 Skilled birth attendant Women aged 15 to 49.
Delivery care (%) 2013 to 2018 Skilled birth attendant Women aged 15 to 19.
Delivery care (%) 2013 to 2018 Institutional delivery.
Delivery care (%) 2013–2018 C-section.
Postnatal health check(%) 2013 to 2018 For newborns.
Postnatal health check(%) 2013 to 2018 For mothers.
Maternal mortality 2017 Number of maternal deaths.
Maternal mortality 2017 Maternal Mortality Ratio.
Maternal mortality 2017 Lifetime risk of maternal death (1 in X).
United Nations Children's Emergency Fund (UNICEF)
library(dplyr) library(ggplot2) # List countries and lifetime risk of maternal death (1 in X), ranked sowc_maternal_newborn |> mutate(rank = round(rank(risk_maternal_death_2017), 0)) |> select(countries_and_areas, rank, risk_maternal_death_2017) |> arrange(rank) # Graph scatterplot of Maternal Mortality Ratio 2017 and Antenatal Care 4+ Visits % sowc_maternal_newborn |> select(antenatal_care_4_1549, maternal_mortality_ratio_2017) |> remove_missing(na.rm = TRUE) |> ggplot(aes(antenatal_care_4_1549, maternal_mortality_ratio_2017)) + geom_point(alpha = 0.5) + labs( title = "Antenatal Care and Mortality", x = "Antenatal Care 4+ visits %", y = "Maternal Mortality Ratio" )
library(dplyr) library(ggplot2) # List countries and lifetime risk of maternal death (1 in X), ranked sowc_maternal_newborn |> mutate(rank = round(rank(risk_maternal_death_2017), 0)) |> select(countries_and_areas, rank, risk_maternal_death_2017) |> arrange(rank) # Graph scatterplot of Maternal Mortality Ratio 2017 and Antenatal Care 4+ Visits % sowc_maternal_newborn |> select(antenatal_care_4_1549, maternal_mortality_ratio_2017) |> remove_missing(na.rm = TRUE) |> ggplot(aes(antenatal_care_4_1549, maternal_mortality_ratio_2017)) + geom_point(alpha = 0.5) + labs( title = "Antenatal Care and Mortality", x = "Antenatal Care 4+ visits %", y = "Maternal Mortality Ratio" )
Fifty companies were randomly sampled from the 500 companies in the S&P 500, and their financial information was collected on March 8, 2012.
sp500
sp500
A data frame with 50 observations on the following 12 variables.
Total value of all company shares, in millions of dollars.
The name of the stock (e.g. AAPL
for Apple).
Enterprise value, which is an alternative to market cap that also accounts for things like cash and debt, in millions of dollars.
The market cap divided by the earnings (profits) over the last year.
The market cap divided by the forecasted earnings (profits) over the next year.
Enterprise value divided by the company's revenue.
Percent of earnings that are profits.
Revenue, in millions of dollars.
Quartly revenue growth (year over year), in millions of dollars.
Earnings before interest, taxes, depreciation, and amortization, in millions of dollars.
Total cash, in millions of dollars.
Total debt, in millions of dollars.
Yahoo! Finance, retrieved 2012-03-08.
library(ggplot2) ggplot(sp500, aes(x = ent_value, y = earn_before)) + geom_point() + labs(x = "Enterprise value", y = "Earnings") ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) + geom_point() + labs( x = "Enterprise value / revenue, logged", y = "Market cap / forecasted earnings, logged" ) ggplot(sp500, aes(x = ent_value, y = earn_before)) + geom_point() + scale_x_log10() + scale_y_log10() + labs(x = "Enterprise value", y = "Earnings") ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) + geom_point() + scale_x_log10() + scale_y_log10() + labs( x = "Enterprise value / revenue, logged", y = "Market cap / forecasted earnings, logged" )
library(ggplot2) ggplot(sp500, aes(x = ent_value, y = earn_before)) + geom_point() + labs(x = "Enterprise value", y = "Earnings") ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) + geom_point() + labs( x = "Enterprise value / revenue, logged", y = "Market cap / forecasted earnings, logged" ) ggplot(sp500, aes(x = ent_value, y = earn_before)) + geom_point() + scale_x_log10() + scale_y_log10() + labs(x = "Enterprise value", y = "Earnings") ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) + geom_point() + scale_x_log10() + scale_y_log10() + labs( x = "Enterprise value / revenue, logged", y = "Market cap / forecasted earnings, logged" )
Data runs from 1950 to near the end of 2018.
sp500_1950_2018
sp500_1950_2018
A data frame with 17346 observations on the following 7 variables.
Date of the form "YYYY-MM-DD"
.
Opening price.
Highest price of the day.
Lowest price of the day.
Closing price of the day.
Adjusted price at close after accounting for dividends paid out.
Trading volume.
Yahoo! Finance
data(sp500_1950_2018) sp500.ten.years <- subset( sp500_1950_2018, "2009-01-01" <= as.Date(Date) & as.Date(Date) <= "2018-12-31" ) d <- diff(sp500.ten.years$Adj.Close) mean(d > 0)
data(sp500_1950_2018) sp500.ten.years <- subset( sp500_1950_2018, "2009-01-01" <= as.Date(Date) & as.Date(Date) <= "2018-12-31" ) d <- diff(sp500.ten.years$Adj.Close) mean(d > 0)
Daily stock returns from the S&P500 for 1990-2011 can be used to assess whether stock activity each day is independent of the stock's behavior on previous days. We label each day as Up or Down (D) depending on whether the market was up or down that day. For example, consider the following changes in price, their new labels of up and down, and then the number of days that must be observed before each Up day.
sp500_seq
sp500_seq
A data frame with 2948 observations on the following variable.
a factor with levels 1
, 2
, 3
, 4
,
5
, 6
, and 7+
sp500_seq
sp500_seq
1,325 UCLA students were asked to fill out a survey where they were asked about their height, fastest speed they have ever driven, and gender.
speed_gender_height
speed_gender_height
A data frame with 1325 observations on the following 3 variables.
a numeric vector
a factor with levels female
and male
a numeric vector
speed_gender_height
speed_gender_height
User submitted data on 1TB solid state drives (SSD).
ssd_speed
ssd_speed
A data frame with 54 rows and 7 variables.
Brand name of the drive.
Model name of the drive.
Number of user submitted benchmarks.
Physical form of the drive with levels 2.5
, m.2
, and mSATA
.
If a drive uses the nvme protocol this value is 1, 0 if it does not.
Average read speed from user benchmarks in MB/s.
Average write speed from user benchmarks in MB/s.
UserBenchmark, retrieved September 1, 2020.
library(ggplot2) library(dplyr) ssd_speed |> count(form_factor) ssd_speed |> filter(form_factor != "mSATA") |> ggplot(aes(x = read, y = write, color = form_factor)) + geom_point() + labs( title = "Average read vs. write speed of SSDs", x = "Read speed (MB/s)", y = "Write speed (MB/s)" ) + facet_wrap(~form_factor, ncol = 1, scales = "free") + guides(color = FALSE)
library(ggplot2) library(dplyr) ssd_speed |> count(form_factor) ssd_speed |> filter(form_factor != "mSATA") |> ggplot(aes(x = read, y = write, color = form_factor)) + geom_point() + labs( title = "Average read vs. write speed of SSDs", x = "Read speed (MB/s)", y = "Write speed (MB/s)" ) + facet_wrap(~form_factor, ncol = 1, scales = "free") + guides(color = FALSE)
Nutrition facts for several Starbucks food items
starbucks
starbucks
A data frame with 77 observations on the following 7 variables.
Food item.
Calories.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a factor with levels bakery
, bistro box
, hot breakfast
, parfait
, petite
, salad
, and sandwich
https://www.starbucks.com/menu, retrieved 2011-03-10.
starbucks
starbucks
Scores range from 57 to 94.
stats_scores
stats_scores
A data frame with 20 observations on the following variable.
a numeric vector
stats_scores
stats_scores
Does treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack? Each sheep in the study was randomly assigned to the ESC or control group, and the change in their hearts' pumping capacity was measured in the study. A positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery.
stem_cell
stem_cell
A data frame with 18 observations on the following 3 variables.
a factor with levels ctrl
esc
a numeric vector
a numeric vector
doi:10.1016/S0140-6736(05)67380-1
stem_cell
stem_cell
An experiment that studies effectiveness of stents in treating patients at
risk of stroke with some unexpected results. stent30
represents
the results 30 days after stroke and stent365
represents the results
365 days after stroke.
stent30
stent30
A data frame with 451 observations on the following 2 variables.
a factor with levels control
and treatment
a factor with levels no event
and stroke
Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Med- ical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993- 1003. doi:10.1056/NEJMoa1105335. NY Times article reporting on the study: https://www.nytimes.com/2011/09/08/health/research/08stent.html.
# 30-day results table(stent30) # 365-day results table(stent365)
# 30-day results table(stent30) # 365-day results table(stent365)
Monthly return data for a few stocks, which covers stock prices from November 2015 through October 2018.
stocks_18
stocks_18
A data frame with 36 observations on the following 3 variables.
First day of the month corresponding to the returns.
Google stock price change.
Caterpillar stock price change.
Exxon Mobil stock price change.
Yahoo! Finance, direct download.
d <- stocks_18 dim(d) apply(d[, 2:3], 2, mean) apply(d[, 2:3], 2, sd)
d <- stocks_18 dim(d) apply(d[, 2:3], 2, mean) apply(d[, 2:3], 2, sd)
These are simulated data and intended to represent housing prices of students at a college.
student_housing
student_housing
A data frame with 175 observations on the following variable.
Monthly housing price, simulated.
set.seed(5) generate_student_housing <- data.frame( price = round(rnorm(175, 515, 65) + exp(rnorm(175, 4.2, 1))) ) hist(student_housing$price, 20) t.test(student_housing$price) mean(student_housing$price) sd(student_housing$price) identical(student_housing, generate_student_housing)
set.seed(5) generate_student_housing <- data.frame( price = round(rnorm(175, 515, 65) + exp(rnorm(175, 4.2, 1))) ) hist(student_housing$price, 20) t.test(student_housing$price) mean(student_housing$price) sd(student_housing$price) identical(student_housing, generate_student_housing)
A simulated dataset for how much 110 college students each slept in a single night.
student_sleep
student_sleep
A data frame with 110 observations on the following variable.
Number of hours slept by this student (simulated).
Simulated data.
set.seed(2) x <- exp(c( rnorm(100, log(7.5), 0.15), rnorm(10, log(10), 0.196) )) x <- round(x - mean(x) + 7.42, 2) identical(x, student_sleep$hours)
set.seed(2) x <- exp(c( rnorm(100, log(7.5), 0.15), rnorm(10, log(10), 0.196) )) x <- round(x - mean(x) + 7.42, 2) identical(x, student_sleep$hours)
Simulated individual fasting blood sugar levels (nmol/L) drawn from a normal distribution. Generally, normal fasting blood sugar level are between 3.0 - 5.6 nmol/L; levels in the range 5.6 - 6.9 nmol/L are considered pre-diabetes. These data and sugar.levels.B are used in the Unit 4 lab of Introductory Statistics for the Life and Biomedical Sciences (ISLBS). See https://github.com/OI-Biostat/oi_biostat_labs for the full set of labs.
sugar.levels.A
sugar.levels.A
A tibble with 100 rows and 1 variable
fasting.blood.sugar
Numeric, simulated fasting blood sugar in nmol/L
Simulated individual fasting blood sugar levels (nmol/L) drawn from a normal distribution. Generally, normal fasting blood sugar level are between 3.0 - 5.6 nmol/L; levels in the range 5.6 - 6.9 nmol/L are considered pre-diabetes. These data and sugar.levels.B are used in the Unit 4 lab of Introductory Statistics for the Life and Biomedical Sciences (ISLBS). See https://github.com/OI-Biostat/oi_biostat_labs for the full set of labs.
sugar.levels.B
sugar.levels.B
A tibble with 100 rows and 1 variable
fasting.blood.sugar
Numeric, simulated fasting blood sugar in nmol/L
Experiment data for studying the efficacy of treating patients who have had a heart attack with Sulphinpyrazone.
sulphinpyrazone
sulphinpyrazone
A data frame with 1475 observations on the following 2 variables.
a factor with levels control
treatment
a factor with levels died
lived
Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256.
sulphinpyrazone
sulphinpyrazone
Summary of a random survey of 976 people.
supreme_court
supreme_court
A data frame with 976 observations on the following variable.
a factor with levels approve
and not
supreme_court
supreme_court
Data from an experiment comparing maximum swim velocities when swimmers are wearing a wetsuit versus a regular swimsuit. Paired measurements on the velocities on each of 12 participants. Data includes swimmer's biological sex and indication of whether the swimmer was a triathlete or just a swimmer. These data are also contained in the package Lock5Data
swim
swim
A dataframe with 12 rows and 6 columns
swimmer.number
Numeric, index of a swimmer
swimmer.sex
Factor, with levels male
, female
swimmer.class
Factor, classification of swimmer,
with levels swimmer
, triathlete
wet.suit.velocity
Numeric, maximum velocity wearing a wet suit, in meters/sec
swim.suit.velocity
Numeric, maximum velocity wearing a swim suit, in meters/sec
velocity.diff
Numeric, wet.suit.velocity
- swim.suit.velocity
#' @source https://doi.org/10.1016/S1440-2440(00)80042-0
Table 3 of De Lucas, Ricardo Dantas, et al. The effects of wet suits on physiological and biomechanical indices during swimming. Journal of Science and Medicine in Sport 3.1 (2000): 1-8.
The Lackey study was a prospective cohort study of adult smear-positive tuberculosis (TB) patients enrolled between January 2010 and December 2011 with no prior TB disease. Data from the cohort was used to model the association of several predictors with a treatment interruption before the complete courses of therapy. The analysis of treatment outcome in original article uses methods for binary data. A time-to-event analysis might be more appropriate but the dataset does not have data sufficient for that analysis.
tb.interruption
tb.interruption
A tibble with 1293 rows and 18 variables:
id
Character vector, unique participant ID
age.group
A factor with 4 levels: 21 and younger
; 22 to 26
;
27 to 37
; 38 and older
bmi
a factor with 3 levels: Normal
; Overweight/Obese
;
Underweight
.
These categories reflect older WHO coding and do not apply to all populations.
chronic.disease
a factor with two levels: No
,
no other chronic disease; Yes
, other
chronic diseases present in the participant
hiv.test
Outcome of HIV test, a factor with 3 levels: Negative
;
Positive
; Test not Done
marital.status
a factor with 4 levels: Divorced/separated
;
Married/cohabitating
; Single
; Widowed
poverty
socioeconomic status, a factor with two levels: No
, not living in
extreme poverty; Yes
, living in extreme poverty
prison.history
a factor with 2 levels: No
, no history of having been
incarcerated; Yes
, participant has been incarcerated
education
a factor with 2 levels: No
, participant does not have at least a
secondary school education; Yes
, participant does have a secondary school education
tobacco.use
a factor with 3 levels: Currently smokes
;
Never smoked
; Used to Smoke
alcohol.use
a factor with 2 levels: No
, participant does not use alcohol
at least weekly; Yes
, participant does use alcohol at least weekly
drug.use
a factor with 2 levels: No
, history of illicit drug use;
Yes
, a history of illicit drug use
rehab.history
a factor with 2 levels: No
, no history of residence in a
rehabilitation facility; Yes
, prior residence in a rehabilitation facility
mdr.tb
a factor with two levels: No
, participant has not been treated for
multi-drug resistant TB; Yes
, participant has been treated for MDR TB
diabetes
a factor with 2 levels: No
, participant does not have type 2
diabetes; Yes
, participant does have diabetes
trt.outcome
a factor with 4 levels denoting treatment outcome: Cured
;
Default
(treatment was interrupted before 2 months); Died
;
Still in treatment
; Transferred out
doi:10.5061/dryad.fp94d
Lackey, Brian, et al. "Patient characteristics associated with tuberculosis treatment default: a cohort study in a high-incidence area of Lima, Peru." PLoS One 10.6 (2015): e0128541. doi:10.1371/journal.pone.0128541
This dataset contains teacher salaries from 2009-2010 for 71 teachers employed by the St. Louis Public School in Michigan, as well as several covariates.
teacher
teacher
A data frame with 71 observations on the following 8 variables.
Identification code for each teacher, assigned randomly.
Highest educational degree attained: BA
(bachelor's degree) or MA
(master's degree).
Full-time enrollment status: full-time 1
or part-time 0.5
.
Number of years employed by the school district.
Base annual salary, in dollars.
Amount paid into Social Security and Medicare per year through the Federal Insurance Contribution Act (FICA), in dollars.
Amount paid into the retirement fund of the teacher per year, in dollars.
Total annual salary of the teacher, resulting from the sum of base salary + fica + retirement, in dollars.
Originally posted on SODA Developers (dev.socrata.com/data), removed in 2020.
library(ggplot2) # Salary and education level ggplot(teacher, aes(x = degree, y = base)) + geom_boxplot() + labs( x = "Highest educational degree attained", y = "Base annual salary, in $", color = "Degree", title = "Salary and education level" ) # Salary and years of employment ggplot(teacher, aes(x = years, y = base, color = degree)) + geom_point() + labs( x = "Number of years employed by the school district", y = "Base annual salary, in $", color = "Degree", title = "Salary and years of employment" )
library(ggplot2) # Salary and education level ggplot(teacher, aes(x = degree, y = base)) + geom_boxplot() + labs( x = "Highest educational degree attained", y = "Base annual salary, in $", color = "Degree", title = "Salary and education level" ) # Salary and years of employment ggplot(teacher, aes(x = years, y = base, color = degree)) + geom_point() + labs( x = "Number of years employed by the school district", y = "Base annual salary, in $", color = "Degree", title = "Salary and years of employment" )
A random sample was taken of nearly 10\ textbook for each course was identified, and its new price at the UCLA Bookstore and on Amazon.com were recorded.
textbooks
textbooks
A data frame with 73 observations on the following 7 variables.
Course department (abbreviated).
Course number.
Book ISBN.
New price at the UCLA Bookstore.
New price on Amazon.com.
Whether additional books were required for the course (Y
means "yes, additional books were required").
The UCLA Bookstore price minus the Amazon.com price for each book.
The sample represents only courses where textbooks were listed online
through UCLA Bookstore's website. The most expensive textbook was selected
based on the UCLA Bookstore price, which may insert bias into the data; for
this reason, it may be beneficial to analyze only the data where more
is "N"
.
Collected by David Diez.
library(ggplot2) ggplot(textbooks, aes(x = diff)) + geom_histogram(binwidth = 5) t.test(textbooks$diff)
library(ggplot2) ggplot(textbooks, aes(x = diff)) + geom_histogram(binwidth = 5) t.test(textbooks$diff)
This entry gives simulated spending data for Americans during Thanksgiving in 2009 based on findings of a Gallup poll.
thanksgiving_spend
thanksgiving_spend
A data frame with 436 observations on the following 1 variable.
Amount of spending, in US dollars.
library(ggplot2) ggplot(thanksgiving_spend, aes(x = spending)) + geom_histogram(binwidth = 20)
library(ggplot2) ggplot(thanksgiving_spend, aes(x = spending)) + geom_histogram(binwidth = 20)
Data derived from a study examining whether population mean body temperature is 98.6 degrees Fahrenheit. Participant level data was constructed from histograms in the cited reference
thermometry
thermometry
A tibble with 130 rows and 3 variables:
body.temp
Numeric, body temperature in degrees Fahrenheit
gender
Factor, recorded gender of participant, with levels
female
, male
heart.rate
Numeric, heart rate, in beats per minute
http://jse.amstat.org/v4n2/datasets.shoemaker.html
Mackowiak, P. A., Wasserman, S. S., and Levine, M. M. (1992), A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich, Journal of the American Medical Association, 268, 1578-1580. Shoemaker, A.L., College, C. (1996) What's Normal? – Temperature, Gender, and Heart Rate Journal of Statistics Education, 4 (2)
A simulated dataset of tips over a few weeks on a couple days per week. Each tip is associated with a single group, which may include several bills and tables (i.e. groups paid in one lump sum in simulations).
tips
tips
A data frame with 95 observations on the following 5 variables.
Week number.
Day, either Friday
or Tuesday
.
Number of people associated with the group.
Total bill for the group.
Total tip from the group.
This dataset was built using simulations of tables, then bills, then tips based on the bills. Large groups were assumed to only pay the gratuity, which is evident in the data. Tips were set to be plausible round values; they were often (but not always) rounded to dollars, quarters, etc.
Simulated dataset.
library(ggplot2) ggplot(tips, aes(x = day, y = tip)) + geom_boxplot() ggplot(tips, aes(x = tip, fill = factor(week))) + geom_density(alpha = 0.5) + labs(x = "Tip", y = "Density", fill = "Week") ggplot(tips, aes(x = tip)) + geom_dotplot() ggplot(tips, aes(x = tip, fill = factor(day))) + geom_density(alpha = 0.5) + labs(x = "Tip", y = "Density", fill = "Day")
library(ggplot2) ggplot(tips, aes(x = day, y = tip)) + geom_boxplot() ggplot(tips, aes(x = tip, fill = factor(week))) + geom_density(alpha = 0.5) + labs(x = "Tip", y = "Density", fill = "Week") ggplot(tips, aes(x = tip)) + geom_dotplot() ggplot(tips, aes(x = tip, fill = factor(day))) + geom_density(alpha = 0.5) + labs(x = "Tip", y = "Density", fill = "Day")
Simulated data for a fake political candidate.
toohey
toohey
A data frame with 500 observations on the following variable.
a factor with levels no
yes
toohey
toohey
Summary of tourism in Turkey.
tourism
tourism
A data frame with 47 observations on the following 3 variables.
a numeric vector
a numeric vector
a numeric vector
Association of Turkish Travel Agencies, Foreign Visitors Figure & Tourist Spendings By Years. http://www.tursab.org.tr/en/statistics/foreign-visitors-figure-tourist-spendings-by-years_1083.html
tourism
tourism
Simulated dataset for getting a better understanding of intuition that ANOVA is based off of.
toy_anova
toy_anova
A data frame with 70 observations on the following 3 variables.
a factor with levels I
II
III
a numeric vector
toy_anova
toy_anova
Summarizing whether there was or was not a complication for 62 patients who used a particular medical consultant.
transplant
transplant
A data frame with 62 observations on the following variable.
a factor with levels complications
okay
transplant
transplant
Construct beautiful tree diagrams
treeDiag( main, p1, p2, out1 = c("Yes", "No"), out2 = c("Yes", "No"), textwd = 0.15, solwd = 0.2, SBS = c(TRUE, TRUE), showSol = TRUE, solSub = NULL, digits = 4, textadj = 0.015, cex.main = 1.3, col.main = "#999999", showWork = FALSE )
treeDiag( main, p1, p2, out1 = c("Yes", "No"), out2 = c("Yes", "No"), textwd = 0.15, solwd = 0.2, SBS = c(TRUE, TRUE), showSol = TRUE, solSub = NULL, digits = 4, textadj = 0.015, cex.main = 1.3, col.main = "#999999", showWork = FALSE )
main |
Character vector with two variable names, descriptions, or questions |
p1 |
Vector of probabilities for the primary branches |
p2 |
List for the secondary branches, where each list item should be a
numerical vector of probabilities corresponding to the primary branches of
|
out1 |
Character vector of the outcomes corresponding to the primary branches |
out2 |
Character vector of the outcomes corresponding to the secondary branches |
textwd |
The width provided for text with a default of |
solwd |
The with provided for the solution with a default of |
SBS |
A boolean vector indicating whether to place text and probability side-by-side for the primary and secondary branches |
showSol |
Boolean indicating whether to show the solution in the tree diagram |
solSub |
An optional list of vectors corresponding to |
digits |
The number of digits to show in the solution |
textadj |
Vertical adjustment of text |
cex.main |
Size of |
col.main |
Color of |
showWork |
Whether work should be shown for the solutions |
David Diez, Christopher Barr
treeDiag( c("Flight on time?", "Luggage on time?"), c(0.8, 0.2), list(c(0.97, 0.03), c(0.15, 0.85)) ) treeDiag(c("Breakfast?", "Go to class"), c(.4, .6), list(c(0.4, 0.36, 0.34), c(0.6, 0.3, 0.1)), c("Yes", "No"), c("Statistics", "English", "Sociology"), showWork = TRUE ) treeDiag( c("Breakfast?", "Go to class"), c(0.4, 0.11, 0.49), list(c(0.4, 0.36, 0.24), c(0.6, 0.3, 0.1), c(0.1, 0.4, 0.5)), c("one", "two", "three"), c("Statistics", "English", "Sociology") ) treeDiag(c("Dow Jones rise?", "NASDAQ rise?"), c(0.53, 0.47), list(c(0.75, 0.25), c(0.72, 0.28)), solSub = list(c("(a)", "(b)"), c("(c)", "(d)")), solwd = 0.08 )
treeDiag( c("Flight on time?", "Luggage on time?"), c(0.8, 0.2), list(c(0.97, 0.03), c(0.15, 0.85)) ) treeDiag(c("Breakfast?", "Go to class"), c(.4, .6), list(c(0.4, 0.36, 0.34), c(0.6, 0.3, 0.1)), c("Yes", "No"), c("Statistics", "English", "Sociology"), showWork = TRUE ) treeDiag( c("Breakfast?", "Go to class"), c(0.4, 0.11, 0.49), list(c(0.4, 0.36, 0.24), c(0.6, 0.3, 0.1), c(0.1, 0.4, 0.5)), c("one", "two", "three"), c("Statistics", "English", "Sociology") ) treeDiag(c("Dow Jones rise?", "NASDAQ rise?"), c(0.53, 0.47), list(c(0.75, 0.25), c(0.72, 0.28)), solSub = list(c("(a)", "(b)"), c("(c)", "(d)")), solwd = 0.08 )
A data frame containing data collected in the mid 20th century by Cyril Burt from a study tracked down identical twins who were separated at birth: one child was raised in the home of their biological parents and the other in a foster home. In an attempt to answer the question of whether intelligence is the result of nature or nurture, both children were given IQ tests.
twins
twins
A data frame with 27 observations on the following 2 variables.
IQ score of the twin raised by Foster parents.
IQ score of the twin raised by Biological parents.
library(ggplot2) library(dplyr) library(tidyr) plot_data <- twins |> pivot_longer(cols = c(foster, biological), names_to = "twin", values_to = "iq") ggplot(plot_data, aes(iq, fill = twin)) + geom_histogram(color = "white", binwidth = 5) + facet_wrap(~twin) + theme_minimal() + labs( title = "IQ of identical twins", subtitle = "Separated at birth", x = "IQ", y = "Count", fill = "" )
library(ggplot2) library(dplyr) library(tidyr) plot_data <- twins |> pivot_longer(cols = c(foster, biological), names_to = "twin", values_to = "iq") ggplot(plot_data, aes(iq, fill = twin)) + geom_histogram(color = "white", binwidth = 5) + facet_wrap(~twin) + theme_minimal() + labs( title = "IQ of identical twins", subtitle = "Separated at birth", x = "IQ", y = "Count", fill = "" )
List of all courses at UCLA during Fall 2018.
ucla_f18
ucla_f18
A data frame with 3950 observations on the following 14 variables.
Year the course was offered
Term the course was offered
Subject
Subject abbreviation, if any
Course name
Course number, complete
Course number, numeric only
Boolean for if this is a seminar course
Boolean for if this is some form of independent study
Boolean for if this is an apprenticeship
Boolean for if this is an internship
Boolean for if this is an honors contracts course
Boolean for if this is a lab
Boolean for if this is any of the special types of courses listed
https://sa.ucla.edu/ro/public/soc, retrieved 2018-11-22.
nrow(ucla_f18) table(ucla_f18$special_topic) subset(ucla_f18, is.na(course_numeric)) table(subset(ucla_f18, !special_topic)$course_numeric < 100) elig_courses <- subset(ucla_f18, !special_topic & course_numeric < 100) set.seed(1) ucla_textbooks_f18 <- elig_courses[sample(nrow(elig_courses), 100), ] tmp <- order( ucla_textbooks_f18$subject, ucla_textbooks_f18$course_numeric ) ucla_textbooks_f18 <- ucla_textbooks_f18[tmp, ] rownames(ucla_textbooks_f18) <- NULL head(ucla_textbooks_f18)
nrow(ucla_f18) table(ucla_f18$special_topic) subset(ucla_f18, is.na(course_numeric)) table(subset(ucla_f18, !special_topic)$course_numeric < 100) elig_courses <- subset(ucla_f18, !special_topic & course_numeric < 100) set.seed(1) ucla_textbooks_f18 <- elig_courses[sample(nrow(elig_courses), 100), ] tmp <- order( ucla_textbooks_f18$subject, ucla_textbooks_f18$course_numeric ) ucla_textbooks_f18 <- ucla_textbooks_f18[tmp, ] rownames(ucla_textbooks_f18) <- NULL head(ucla_textbooks_f18)
A sample of courses were collected from UCLA from Fall 2018, and the corresponding textbook prices were collected from the UCLA bookstore and also from Amazon.
ucla_textbooks_f18
ucla_textbooks_f18
A data frame with 201 observations on the following 20 variables.
Year the course was offered
Term the course was offered
Subject
Subject abbreviation, if any
Course name
Course number, complete
Course number, numeric only
Boolean for if this is a seminar course.
Boolean for if this is some form of independent study
Boolean for if this is an apprenticeship
Boolean for if this is an internship
Boolean for if this is an honors contracts course
Boolean for if this is a lab
Boolean for if this is any of the special types of courses listed
Textbook ISBN
New price at the UCLA bookstore
Used price at the UCLA bookstore
New price sold by Amazon
Used price sold by Amazon
Any relevant notes
A past dataset was collected from UCLA courses in Spring 2010, and Amazon at that time was found to be almost uniformly lower than those of the UCLA bookstore's. Now in 2018, the UCLA bookstore is about even with Amazon on the vast majority of titles, and there is no statistical difference in the sample data.
The most expensive book required for the course was generally used.
The reason why we advocate for using raw amount differences instead of percent differences is that a 20\ to a 20\ price difference on low-priced books would balance numerically (but not in a practical sense) a moderate but important price difference on more expensive books. So while this tends to result in a bit less sensitivity in detecting some effect, we believe the absolute difference compares prices in a more meaningful way.
Used prices contain the shipping cost but do not contain tax. The used prices are a more nuanced comparison, since these are all 3rd party sellers. Amazon is often more a marketplace than a retail site at this point, and many people buy from 3rd party sellers on Amazon now without realizing it. The relationship Amazon has with 3rd party sellers is also challenging. Given the frequently changing dynamics in this space, we don't think any analysis here will be very reliable for long term insights since products from these sellers changes frequently in quantity and price. For this reason, we focus only on new books sold directly by Amazon in our comparison. In a future round of data collection, it may be interesting to explore whether the dynamics have changed in the used market.
https://sa.ucla.edu/ro/public/soc
library(ggplot2) library(dplyr) ggplot(ucla_textbooks_f18, aes(x = bookstore_new, y = amazon_new)) + geom_point() + geom_abline(slope = 1, intercept = 0, color = "orange") + labs( x = "UCLA Bookstore price", y = "Amazon price", title = "Amazon vs. UCLA Bookstore prices of new textbooks", subtitle = "Orange line represents y = x" ) # The following outliers were double checked for accuracy ucla_textbooks_f18_with_diff <- ucla_textbooks_f18 |> mutate(diff = bookstore_new - amazon_new) ucla_textbooks_f18_with_diff |> filter(diff > 20 | diff < -20) # Distribution of price differences ggplot(ucla_textbooks_f18_with_diff, aes(x = diff)) + geom_histogram(binwidth = 5) # t-test of price differences t.test(ucla_textbooks_f18_with_diff$diff)
library(ggplot2) library(dplyr) ggplot(ucla_textbooks_f18, aes(x = bookstore_new, y = amazon_new)) + geom_point() + geom_abline(slope = 1, intercept = 0, color = "orange") + labs( x = "UCLA Bookstore price", y = "Amazon price", title = "Amazon vs. UCLA Bookstore prices of new textbooks", subtitle = "Orange line represents y = x" ) # The following outliers were double checked for accuracy ucla_textbooks_f18_with_diff <- ucla_textbooks_f18 |> mutate(diff = bookstore_new - amazon_new) ucla_textbooks_f18_with_diff |> filter(diff > 20 | diff < -20) # Distribution of price differences ggplot(ucla_textbooks_f18_with_diff, aes(x = diff)) + geom_histogram(binwidth = 5) # t-test of price differences t.test(ucla_textbooks_f18_with_diff$diff)
This dataset comes from the Guardian's Data Blog and includes five financial demographic variables.
ukdemo
ukdemo
A data frame with 12 observations on the following 6 variables.
Region in the United Kingdom
Average regional debt, not including mortgages, in pounds
Percent unemployment
Average house price, in pounds
Average hourly pay, in pounds
Retail price index, which is standardized to 100 for the entire UK, and lower index scores correspond to lower prices
The data was described in the Guardian Data Blog: https://www.theguardian.com/news/datablog/interactive/2011/oct/27/debt-money-expert-facts, retrieved 2011-11-01.
Guardian Data Blog
library(ggplot2) ggplot(ukdemo, aes(x = pay, y = rpi)) + geom_point() + labs(x = "Average hourly pay", y = "Retail price index")
library(ggplot2) ggplot(ukdemo, aes(x = pay, y = rpi)) + geom_point() + labs(x = "Average hourly pay", y = "Retail price index")
A compilation of two datasets that provides an estimate of unemployment from 1890 to 2010.
unempl
unempl
A data frame with 121 observations on the following 3 variables.
Year
Unemployment rate, in percent
1
if from the Bureau of Labor Statistics, 0
otherwise
The data are from Wikipedia at the following URL accessed on November 1st, 2010:
https://en.wikipedia.org/wiki/File:US_Unemployment_1890-2009.gif
Below is a direct quote from Wikipedia describing the sources of the data:
Own work by Peace01234 Complete raw data are on Peace01234. 1930-2009 data are from Bureau of Labor Statistics (BLS), Employment status of the civilian noninstitutional population, 1940 to date retrieved on March 6, 2009 and February 12, 2010 from the BLS' FTP server. Data prior to 1948 are for persons age 14 and over. Data beginning in 1948 are for persons age 16 and over. See also "Historical Comparability" under the Household Data section of the Explanatory Notes at https://www.bls.gov/cps/eetech_methods.pdf. 1890-1930 data are from Christina Romer (1986). "Spurious Volatility in Historical Unemployment Data", The Journal of Political Economy, 94(1): 1-37. 1930-1940 data are from Robert M. Coen (1973). "Labor Force and Unemployment in the 1920's and 1930's: A Re-Examination Based on Postwar Experience", The Review of Economics and Statistics, 55(1): 46-55. Unemployment data was only surveyed once each decade until 1940 when yearly surveys were begun. The yearly data estimates before 1940 are based on the decade surveys combined with other relevant surveys that were collected during those years. The methods are described in detail by Coen and Romer.
# =====> Time Series Plot of Data <=====# COL <- c("#DDEEBB", "#EEDDBB", "#BBDDEE", "#FFD5DD", "#FFC5CC") plot(unempl$year, unempl$unemp, type = "n") rect(0, -50, 3000, 100, col = "#E2E2E2") rect(1914.5, -1000, 1918.9, 1000, col = COL[1], border = "#E2E2E2") rect(1929, -1000, 1939, 1000, col = COL[2], border = "#E2E2E2") rect(1939.7, -1000, 1945.6, 1000, col = COL[3], border = "#E2E2E2") rect(1955.8, -1000, 1965.3, 1000, col = COL[4], border = "#E2E2E2") rect(1965.3, -1000, 1975.4, 1000, col = COL[5], border = "#E2E2E2") abline(h = seq(0, 50, 5), col = "#F8F8F8", lwd = 2) abline(v = seq(1900, 2000, 20), col = "#FFFFFF", lwd = 1.3) lines(unempl$year, unempl$unemp) points(unempl$year, unempl$unemp, pch = 20) legend("topright", fill = COL, c( "World War I", "Great Depression", "World War II", "Vietnam War Start", "Vietnam War Escalated" ), bg = "#FFFFFF", border = "#FFFFFF" )
# =====> Time Series Plot of Data <=====# COL <- c("#DDEEBB", "#EEDDBB", "#BBDDEE", "#FFD5DD", "#FFC5CC") plot(unempl$year, unempl$unemp, type = "n") rect(0, -50, 3000, 100, col = "#E2E2E2") rect(1914.5, -1000, 1918.9, 1000, col = COL[1], border = "#E2E2E2") rect(1929, -1000, 1939, 1000, col = COL[2], border = "#E2E2E2") rect(1939.7, -1000, 1945.6, 1000, col = COL[3], border = "#E2E2E2") rect(1955.8, -1000, 1965.3, 1000, col = COL[4], border = "#E2E2E2") rect(1965.3, -1000, 1975.4, 1000, col = COL[5], border = "#E2E2E2") abline(h = seq(0, 50, 5), col = "#F8F8F8", lwd = 2) abline(v = seq(1900, 2000, 20), col = "#FFFFFF", lwd = 1.3) lines(unempl$year, unempl$unemp) points(unempl$year, unempl$unemp, pch = 20) legend("topright", fill = COL, c( "World War I", "Great Depression", "World War II", "Vietnam War Start", "Vietnam War Escalated" ), bg = "#FFFFFF", border = "#FFFFFF" )
Covers midterm elections.
unemploy_pres
unemploy_pres
A data frame with 29 observations on the following 5 variables.
Year.
The president in office.
President's party.
Unemployment rate.
Change in House seats for the president's party.
Wikipedia.
unemploy_pres
unemploy_pres
A representative set of monitoring locations were taken from NOAA data in 1950 and 2022 such that the locations are sampled roughly geographically across the continental US (the observations do not represent a random sample of geographical locations).
us_temperature
us_temperature
A data frame with 18759 observations on the following 9 variables.
Location of the NOAA weather station.
Formal ID of the NOAA weather station.
Latitude of the NOAA weather station.
Longitude of the NOAA weather station.
Elevation of the NOAA weather station.
Date the measurement was taken (Y-m-d).
Maximum daily temperature (Farenheit).
Minimum daily temperature (Farenheit).
Year of the measurement.
Please keep in mind that the data represent two annual snapshots, and a complete analysis would consider more than two years of data and a random or more complete sampling of weather stations across the US.
NOAA Climate Data Online. Retrieved 23 September, 2023.
library(dplyr) library(ggplot2) library(maps) summarized_temp <- us_temperature |> group_by(station, year, latitude, longitude) |> summarize(tmax_med = median(tmax, na.rm = TRUE)) |> mutate(plot_shift = ifelse(year == "1950", 0, 1)) |> mutate(year = as.factor(year)) usa <- map_data("state") ggplot(data = usa, aes(x = long, y = lat)) + geom_polygon(aes(group = group), color = "black", fill = "white") + geom_point( data = summarized_temp, aes( x = longitude + plot_shift, y = latitude, color = tmax_med, shape = year ) ) + scale_color_gradient(high = IMSCOL["red", 1], low = IMSCOL["yellow", 1]) + ggtitle("Median of the daily high temp, 1950 & 2022") + labs( x = "longitude", color = "median high temp" ) + guides(shape = guide_legend(override.aes = list(color = "black")))
library(dplyr) library(ggplot2) library(maps) summarized_temp <- us_temperature |> group_by(station, year, latitude, longitude) |> summarize(tmax_med = median(tmax, na.rm = TRUE)) |> mutate(plot_shift = ifelse(year == "1950", 0, 1)) |> mutate(year = as.factor(year)) usa <- map_data("state") ggplot(data = usa, aes(x = long, y = lat)) + geom_polygon(aes(group = group), color = "black", fill = "white") + geom_point( data = summarized_temp, aes( x = longitude + plot_shift, y = latitude, color = tmax_med, shape = year ) ) + scale_color_gradient(high = IMSCOL["red", 1], low = IMSCOL["yellow", 1]) + ggtitle("Median of the daily high temp, 1950 & 2022") + labs( x = "longitude", color = "median high temp" ) + guides(shape = guide_legend(override.aes = list(color = "black")))
Data from a study carried out by the graduate Division of the University of California, Berkeley in the early 1970's to evaluate whether there was a sex bias in graduate admissions.
ucb_admit
ucb_admit
A data frame with 4526 observations on the following 3 variables.
Was the applicant admitted to the university?
Whether the applicant identified as male or female.
What department did the applicant apply to, noted as A through F for confidentiality.
library(ggplot2) library(dplyr) plot_data <- ucb_admit |> count(dept, gender, admit) ggplot(plot_data, aes(dept, n, fill = gender)) + geom_col(position = "dodge") + facet_wrap(~admit) + theme_minimal() + labs( title = "Does gender discrimination play a role in college admittance?", x = "Department", y = "Number of Students", fill = "Gender", caption = "Source: UC Berkeley, 1970's" )
library(ggplot2) library(dplyr) plot_data <- ucb_admit |> count(dept, gender, admit) ggplot(plot_data, aes(dept, n, fill = gender)) + geom_col(position = "dodge") + facet_wrap(~admit) + theme_minimal() + labs( title = "Does gender discrimination play a role in college admittance?", x = "Department", y = "Number of Students", fill = "Gender", caption = "Source: UC Berkeley, 1970's" )
Adata frame with 217 rows and 11 variables from the World Development Indicators (WDI) available from the World Bank. The rows contain only country level data. Regional groupings such as the European Union (EU) and financial groupings such as low income countries have been eliminated. World Bank Country codes (iso2c, iso3c) have been dropped. The data were downloaded from the World Bank on 17 July 2024 using the R package WDI, version 2.8.8, Arel-Bundock V (2022). WDI: World Development Indicators and Other World Bank Data. R package version 2.7.8, https://CRAN.R-project.org/package=WDI. These data update the dataset wdi.2011 in the previous version of the package, which is outdated and has been removed. Some variable names have been changed for readability and some constructed variables (e.g., log(gdp)) have not been included. Missing values have been retained.
wdi_2022
wdi_2022
A data frame with 217 rows and 11 columns
Character variable with country name
Numeric, gross national income (GNI) per capita, based on purchasing power parity (PPP) in international $
Numeric, gross domestic product (GDP) per capita, based on PPP in international $
Numeric, life expectancy at birth, in years
Numeric, adolescent fertility rate, births per 1,000 women age 15 - 19
Numeric, total fertility rate, births per woman
Numeric, infant deaths per 1,000 live births
Numeric, percent of the population with access to basic sanitation
Numeric, adult literacy rate, percent of population above the age of 15 considered literate
Numeric, government expenditures on education as a percent of GDP
Numeric, primary school completion rate among the relevant population of women
https://data.worldbank.org/indicator
These times represent times between gondolas at Sterling Winery. The main take-away: there are 7 cars, as evidenced by the somewhat regular increases in splits between every 7 cars. The reason the times are slightly non-constant is that the gondolas come off the tracks, so times will change a little between each period.
winery_cars
winery_cars
A data frame with 52 observations on the following 2 variables.
The observation number, e.g. observation 3 was immediately preceded by observation 2.
Time until this gondola car arrived since the last car had left.
Important context: there was a sufficient line that people were leaving the winery.
So why is this data valuable? It indicates that the winery should add one more car since it has a lot of time wasted every 7th car. By adding another car, fewer visitors are likely to be turned away, resulting in increased revenue.
In-person data collection by David Diez (OpenIntro) on 2013-07-04.
winery_cars$car_number <- rep(1:7, 10)[1:nrow(winery_cars)] col <- COL[ifelse(winery_cars$car_number == 3, 4, 1)] plot(winery_cars[, c("obs_number", "time_until_next")], col = col, pch = 19 ) plot(winery_cars$car_number, winery_cars$time_until_next, col = fadeColor(col, "88"), pch = 19 )
winery_cars$car_number <- rep(1:7, 10)[1:nrow(winery_cars)] col <- COL[ifelse(winery_cars$car_number == 3, 4, 1)] plot(winery_cars[, c("obs_number", "time_until_next")], col = col, pch = 19 ) plot(winery_cars$car_number, winery_cars$time_until_next, col = fadeColor(col, "88"), pch = 19 )
From World Bank, population 1960-2020
world_pop
world_pop
A data frame with 216 rows and 62 variables.
Name of country.
population in 1960.
population in 1961.
population in 1962.
population in 1963.
population in 1964.
population in 1965.
population in 1966.
population in 1967.
population in 1968.
population in 1969.
population in 1970.
population in 1971.
population in 1972.
population in 1973.
population in 1974.
population in 1975.
population in 1976.
population in 1977.
population in 1978.
population in 1979.
population in 1980.
population in 1981.
population in 1982.
population in 1983.
population in 1984.
population in 1985.
population in 1986.
population in 1987.
population in 1988.
population in 1989.
population in 1990.
population in 1991.
population in 1992.
population in 1993.
population in 1994.
population in 1995.
population in 1996.
population in 1997.
population in 1998.
population in 1999.
population in 2000.
population in 2001.
population in 2002.
population in 2003.
population in 2004.
population in 2005.
population in 2006.
population in 2007.
population in 2008.
population in 2009.
population in 2010.
population in 2011.
population in 2012.
population in 2013.
population in 2014.
population in 2015.
population in 2016.
population in 2017.
population in 2018.
population in 2019.
population in 2020.
library(dplyr) library(ggplot2) library(tidyr) # List percentage of population change from 1960 to 2020 world_pop |> mutate(percent_change = round((year_2020 - year_1960) / year_2020 * 100, 2)) |> mutate(rank_pop_change = round(rank(-percent_change)), 0) |> select(rank_pop_change, country, percent_change) |> arrange(rank_pop_change) # Graph population in millions by decade for specified countries world_pop |> select( country, year_1960, year_1970, year_1980, year_1990, year_2000, year_2010, year_2020 ) |> filter(country %in% c("China", "India", "United States")) |> pivot_longer( cols = c(year_1960, year_1970, year_1980, year_1990, year_2000, year_2010, year_2020), names_to = "year", values_to = "population" ) |> mutate(year = as.numeric(gsub("year_", "", year))) |> ggplot(aes(year, population, color = country)) + geom_point() + geom_smooth(method = "loess", formula = "y ~ x") + labs( title = "Population", subtitle = "by Decade", x = "Year", y = "Population (in millions)", color = "Country" )
library(dplyr) library(ggplot2) library(tidyr) # List percentage of population change from 1960 to 2020 world_pop |> mutate(percent_change = round((year_2020 - year_1960) / year_2020 * 100, 2)) |> mutate(rank_pop_change = round(rank(-percent_change)), 0) |> select(rank_pop_change, country, percent_change) |> arrange(rank_pop_change) # Graph population in millions by decade for specified countries world_pop |> select( country, year_1960, year_1970, year_1980, year_1990, year_2000, year_2010, year_2020 ) |> filter(country %in% c("China", "India", "United States")) |> pivot_longer( cols = c(year_1960, year_1970, year_1980, year_1990, year_2000, year_2010, year_2020), names_to = "year", values_to = "population" ) |> mutate(year = as.numeric(gsub("year_", "", year))) |> ggplot(aes(year, population, color = country)) + geom_point() + geom_smooth(method = "loess", formula = "y ~ x") + labs( title = "Population", subtitle = "by Decade", x = "Year", y = "Population (in millions)", color = "Country" )
The function should be run with a path to a package directory.
It will then look through the data
directory of the package,
and for all datasets that are data frames, create CSV variants
in a data-csv
directory.
write_pkg_data( pkg, dir = paste0("data-", out_type), overwrite = FALSE, out_type = c("csv", "tab", "R") )
write_pkg_data( pkg, dir = paste0("data-", out_type), overwrite = FALSE, out_type = c("csv", "tab", "R") )
pkg |
The R package where we'd like to generate CSVs of any data frames. |
dir |
A character string representing the path to the folder. where the CSV files should be written. If no such directory exists, one will be created (recursively). |
overwrite |
Boolean to indicate if to overwrite any existing files that have conflicting names in the directory specified. |
out_type |
Format for the type of output as a CSV ( |
## Not run: write_pkg_data("openintro") list.files("data-csv") ## End(Not run)
## Not run: write_pkg_data("openintro") list.files("data-csv") ## End(Not run)
Monthly data covering 2006 through early 2014.
xom
xom
A data frame with 98 observations on the following 7 variables.
Date.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Yahoo! Finance.
xom
xom
An experiment conducted by the MythBusters, a science entertainment TV program on the Discovery Channel, tested if a person can be subconsciously influenced into yawning if another person near them yawns. 50 people were randomly assigned to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a group where there wasn't a person yawning near them (control).
yawn
yawn
A data frame with 50 observations on the following 2 variables.
a factor with levels not yawn
yawn
a factor with levels ctrl
trmt
MythBusters, Season 3, Episode 28.
yawn
yawn
Select variables from YRBSS.
yrbss
yrbss
A data frame with 13583 observations on the following 13 variables.
Age, in years.
Gender.
School grade.
Hispanic or not.
Race / ethnicity.
Height, in meters (3.28 feet per meter).
Weight, in kilograms (2.2 pounds per kilogram).
How often did you wear a helmet when biking in the last 12 months?
How many days did you text while driving in the last 30 days?
How many days were you physically active for 60+ minutes in the last 7 days?
How many hours of TV do you typically watch on a school night?
How many days did you do strength training (e.g. lift weights) in the last 7 days?
How many hours of sleep do you typically get on a school night?
CDC's Youth Risk Behavior Surveillance System (YRBSS)
table(yrbss$physically_active_7d)
table(yrbss$physically_active_7d)
A sample of the yrbss
dataset.
yrbss_samp
yrbss_samp
A data frame with 100 observations on the following 13 variables.
Age, in years.
Gender.
School grade.
Hispanic or not.
Race / ethnicity.
Height, in meters (3.28 feet per meter).
Weight, in kilograms (2.2 pounds per kilogram).
How often did you wear a helmet when biking in the last 12 months?
How many days did you text while driving in the last 30 days?
How many days were you physically active for 60+ minutes in the last 7 days?
How many hours of TV do you typically watch on a school night?
How many days did you do strength training (e.g. lift weights) in the last 7 days?
How many hours of sleep do you typically get on a school night?
CDC's Youth Risk Behavior Surveillance System (YRBSS)
table(yrbss_samp$physically_active_7d)
table(yrbss_samp$physically_active_7d)
Social experiment
Description
A "social experiment" conducted by a TV program questioned what people do when they see a very obviously bruised woman getting picked on by her boyfriend. On two different occasions at the same restaurant, the same couple was depicted. In one scenario the woman was dressed "provocatively" and in the other scenario the woman was dressed "conservatively". The table below shows how many restaurant diners were present under each scenario, and whether or not they intervened.
Usage
Format
A data frame with 45 observations on the following 2 variables.
Whether other diners intervened or not.
How the woman was dressed.
Examples