Tidy Tuesday Exercise

Author

Leah Lariscy

Introduction

In this exercise, I will analyze the Tidy Tuesday data from the week of April 11, 2023. The data comes from the Humane League’s US Egg Production dataset, which is based on USDA reports of cage-free egg supply from 2007 to 2021.

Load packages

library(here)
here() starts at /Users/leahlariscy/Desktop/MADA2023/leahlariscy-MADA-portfolio
library(skimr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(rsample) #Data split
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ recipes      1.0.5
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.4     ✔ yardstick    1.1.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
library(rpart) #Model fit

Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune
library(ranger) #Model fit
library(glmnet) #Model fit
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7
library(rpart.plot)  #viz of decision tree
library(vip) #viz of variable importance plots

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(ggpmisc) #for adding linear regression to plots
Loading required package: ggpp

Attaching package: 'ggpp'

The following object is masked from 'package:ggplot2':

    annotate

Load the data

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: 
#install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2023-04-11')
--- Compiling #TidyTuesday Information for 2023-04-11 ----
--- There are 2 files available ---
--- Starting Download ---

    Downloading file 1 of 2: `egg-production.csv`
    Downloading file 2 of 2: `cage-free-percentages.csv`
--- Download complete ---
eggproduction <- tuesdata$`egg-production`
cagefreepercentages <- tuesdata$`cage-free-percentages`

Data exploration

#Look at the data using skim
skim(eggproduction)
Data summary
Name eggproduction
Number of rows 220
Number of columns 6
_______________________
Column type frequency:
character 3
Date 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
prod_type 0 1 10 13 0 2 0
prod_process 0 1 3 23 0 3 0
source 0 1 23 23 0 108 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
observed_month 0 1 2016-07-31 2021-02-28 2018-11-15 56

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
n_hens 0 1 110839873 124121204 13500000 17284500 59939500 125539250 341166000 ▇▁▁▁▂
n_eggs 0 1 2606667580 3082457619 298074240 423962023 1154550000 2963010996 8601000000 ▇▁▁▁▂
skim(cagefreepercentages)
Data summary
Name cagefreepercentages
Number of rows 96
Number of columns 4
_______________________
Column type frequency:
character 1
Date 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 8 35 0 31 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
observed_month 0 1 2007-12-31 2021-02-28 2018-11-15 91

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
percent_hens 0 1.00 17.95 6.58 3.20 13.46 17.30 23.46 29.20 ▂▅▇▆▆
percent_eggs 42 0.56 17.10 4.29 9.56 14.52 16.23 19.46 24.55 ▆▇▇▆▇
#Visualize the data to get a better idea of what we are working with
#Plot number of eggs over time
eggproduction %>% ggplot(aes(observed_month, log10(n_eggs))) +
  geom_point() 

We clearly need to remove some variables in order to correctly visualize this time series.

Data cleaning

Remove unnecessary variables and observations

#For egg production data, keep all variables but source
#Then remove observations that do not contain "all" in production process
#Then remove observations that do not contain "table egg" in production type
eggprod_clean <- eggproduction %>% 
  select(!source) %>% 
  filter(prod_process == "all", prod_type == "table eggs")

#For cage free percentage data, keep all variables but source
#Since there are so many NAs in the percent eggs variable, I will also remove that variable since it is not going to be useful in analysis
#Then remove data prior to 2016 since that is as far back as egg production data goes
cagefree_clean <- cagefreepercentages %>% 
  select(!c(source, percent_eggs)) %>% 
  filter(observed_month >= "2016-04-30")

Look at the data again

eggprod_clean %>% ggplot(aes(observed_month, log10(n_eggs))) +
  geom_point() +
  geom_line()

Now that we have cleaned the data by:

  • Removing unnecessary variables

  • Subsetting the production product to “all”

  • Subsetting the production type to “table eggs”

We can see a distinct upward trend in the production of table eggs since 2017. We can also see that there are certain months that the egg production drops, like in the early months of each year. Egg production rates also dropped significantly at the start of the COVID-19 pandemic (summer 2020).

cagefree_clean %>% ggplot(aes(observed_month, percent_hens)) +
  geom_point() +
  geom_line()

There is a distinct upward trend in the percentage of cage free hens, rising from 10% to 30% from 2017 to 2021