In this exercise, I will analyze the Tidy Tuesday data from the week of April 11, 2023. The data comes from the Humane League’s US Egg Production dataset, which is based on USDA reports of cage-free egg supply from 2007 to 2021.
Load packages
library(here)
here() starts at /Users/leahlariscy/Desktop/MADA2023/leahlariscy-MADA-portfolio
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Attaching package: 'rpart'
The following object is masked from 'package:dials':
prune
library(ranger) #Model fitlibrary(glmnet) #Model fit
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-7
library(rpart.plot) #viz of decision treelibrary(vip) #viz of variable importance plots
Attaching package: 'vip'
The following object is masked from 'package:utils':
vi
library(ggpmisc) #for adding linear regression to plots
Loading required package: ggpp
Attaching package: 'ggpp'
The following object is masked from 'package:ggplot2':
annotate
Load the data
# Get the Data# Read in with tidytuesdayR package # Install from CRAN via: #install.packages("tidytuesdayR")# This loads the readme and all the datasets for the week of interest# Either ISO-8601 date or year/week works!tuesdata <- tidytuesdayR::tt_load('2023-04-11')
--- Compiling #TidyTuesday Information for 2023-04-11 ----
--- There are 2 files available ---
--- Starting Download ---
Downloading file 1 of 2: `egg-production.csv`
Downloading file 2 of 2: `cage-free-percentages.csv`
#Visualize the data to get a better idea of what we are working with#Plot number of eggs over timeeggproduction %>%ggplot(aes(observed_month, log10(n_eggs))) +geom_point()
We clearly need to remove some variables in order to correctly visualize this time series.
Data cleaning
Remove unnecessary variables and observations
#For egg production data, keep all variables but source#Then remove observations that do not contain "all" in production process#Then remove observations that do not contain "table egg" in production typeeggprod_clean <- eggproduction %>%select(!source) %>%filter(prod_process =="all", prod_type =="table eggs")#For cage free percentage data, keep all variables but source#Since there are so many NAs in the percent eggs variable, I will also remove that variable since it is not going to be useful in analysis#Then remove data prior to 2016 since that is as far back as egg production data goescagefree_clean <- cagefreepercentages %>%select(!c(source, percent_eggs)) %>%filter(observed_month >="2016-04-30")
We can see a distinct upward trend in the production of table eggs since 2017. We can also see that there are certain months that the egg production drops, like in the early months of each year. Egg production rates also dropped significantly at the start of the COVID-19 pandemic (summer 2020).