Tidy Tuesday Exercise

Author

Leah Lariscy

Introduction

In this exercise, I will analyze the Tidy Tuesday data from the week of April 11, 2023. The data comes from the Humane League’s US Egg Production dataset, which is based on USDA reports of cage-free egg supply from 2007 to 2021.

Load packages

library(here)

here() starts at /Users/leahlariscy/Desktop/MADA2023/leahlariscy-MADA-portfolio

library(skimr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(rsample) #Data split
library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ recipes      1.0.5
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.4     ✔ yardstick    1.1.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

library(rpart) #Model fit


Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune

library(ranger) #Model fit
library(glmnet) #Model fit

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7

library(rpart.plot)  #viz of decision tree
library(vip) #viz of variable importance plots


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

library(ggpmisc) #for adding linear regression to plots

Loading required package: ggpp

Attaching package: 'ggpp'

The following object is masked from 'package:ggplot2':

    annotate

Load the data

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: 
#install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2023-04-11')

--- Compiling #TidyTuesday Information for 2023-04-11 ----

--- There are 2 files available ---

--- Starting Download ---


    Downloading file 1 of 2: `egg-production.csv`
    Downloading file 2 of 2: `cage-free-percentages.csv`

--- Download complete ---

eggproduction <- tuesdata$`egg-production`
cagefreepercentages <- tuesdata$`cage-free-percentages`

Data exploration

#Look at the data using skim
skim(eggproduction)

Data summary
Name	eggproduction
Number of rows	220
Number of columns	6
_______________________
Column type frequency:
character	3
Date	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
prod_type	1	10	13	2
prod_process	1	3	23	3
source	1	23	23	108

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
observed_month	0	1	2016-07-31	2021-02-28	2018-11-15	56

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
n_hens	0	1	110839873	124121204	13500000	17284500	59939500	125539250	341166000	▇▁▁▁▂
n_eggs	0	1	2606667580	3082457619	298074240	423962023	1154550000	2963010996	8601000000	▇▁▁▁▂

skim(cagefreepercentages)

Data summary
Name	cagefreepercentages
Number of rows	96
Number of columns	4
_______________________
Column type frequency:
character	1
Date	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
source	0	1	8	35	0	31	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
observed_month	0	1	2007-12-31	2021-02-28	2018-11-15	91

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
percent_hens	0	1.00	17.95	6.58	3.20	13.46	17.30	23.46	29.20	▂▅▇▆▆
percent_eggs	42	0.56	17.10	4.29	9.56	14.52	16.23	19.46	24.55	▆▇▇▆▇

#Visualize the data to get a better idea of what we are working with
#Plot number of eggs over time
eggproduction %>% ggplot(aes(observed_month, log10(n_eggs))) +
  geom_point()

We clearly need to remove some variables in order to correctly visualize this time series.

Data cleaning

Remove unnecessary variables and observations

#For egg production data, keep all variables but source
#Then remove observations that do not contain "all" in production process
#Then remove observations that do not contain "table egg" in production type
eggprod_clean <- eggproduction %>% 
  select(!source) %>% 
  filter(prod_process == "all", prod_type == "table eggs")

#For cage free percentage data, keep all variables but source
#Since there are so many NAs in the percent eggs variable, I will also remove that variable since it is not going to be useful in analysis
#Then remove data prior to 2016 since that is as far back as egg production data goes
cagefree_clean <- cagefreepercentages %>% 
  select(!c(source, percent_eggs)) %>% 
  filter(observed_month >= "2016-04-30")

Look at the data again

eggprod_clean %>% ggplot(aes(observed_month, log10(n_eggs))) +
  geom_point() +
  geom_line()

Now that we have cleaned the data by:

Removing unnecessary variables
Subsetting the production product to “all”
Subsetting the production type to “table eggs”

We can see a distinct upward trend in the production of table eggs since 2017. We can also see that there are certain months that the egg production drops, like in the early months of each year. Egg production rates also dropped significantly at the start of the COVID-19 pandemic (summer 2020).

cagefree_clean %>% ggplot(aes(observed_month, percent_hens)) +
  geom_point() +
  geom_line()

There is a distinct upward trend in the percentage of cage free hens, rising from 10% to 30% from 2017 to 2021