Data Analysis Exercise

Author

Leah Lariscy

Load libraries here

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(lubridate)

Data description

This dataset was accessed from the Bacteria, Enterics, Amoeba, and Mycotics (BEAM) Dashboard, which houses data collected by the System for Enteric Disease Response, Investigation, and Coordination (SEDRIC). The data points represent pathogens isolated from infected individuals, namely Salmonella, E. Coli, Shigella, and Campylobacter bacteria.

You can access the BEAM Dashboard here.

beam_raw <- read_csv("data_analysis_exercise/data/raw_data/BEAM_Dashboard_Report_Data.csv")

Rows: 128342 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): State, Source, Pathogen, Serotype/Species
dbl (6): Year, Month, Number_of_isolates, Outbreak_associated_isolates, New_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data cleaning

Begin by combining the Year and Month variables into one Date variable, then select for variables of interest and rename them. Lastly, change all NAs to 0s where necessary.

beam_clean <- beam_raw %>% mutate(date = make_date(year=Year, month=Month, day=1)) %>% 

  select(c(date, state=State, source=Source, pathogen=Pathogen, species=`Serotype/Species`,
       n_isolates=Number_of_isolates, n_outbreak_associated=Outbreak_associated_isolates)) %>% 

  mutate(n_isolates = ifelse(is.na(n_isolates), 0, n_isolates), 
       n_outbreak_associated = ifelse(is.na(n_outbreak_associated), 0, n_outbreak_associated))

print(beam_clean)

# A tibble: 128,342 × 7
   date       state source pathogen    species                   n_iso…¹ n_out…²
   <date>     <chr> <chr>  <chr>       <chr>                       <dbl>   <dbl>
 1 2017-09-01 TX    Stool  Escherichia Shigella Flexneri Seroty…       1       0
 2 2017-09-01 TX    Stool  Escherichia sonnei                         14       0
 3 2017-09-01 TX    Stool  Salmonella  Agbeni                          1       0
 4 2017-09-01 TX    Stool  Salmonella  Agona                           5       0
 5 2017-09-01 TX    Stool  Salmonella  Altona                          1       0
 6 2017-09-01 TX    Stool  Salmonella  Anatum                          2       2
 7 2017-09-01 TX    Stool  Salmonella  Anatum, Newington, Minne…       2       0
 8 2017-09-01 TX    Stool  Salmonella  Baildon                         1       0
 9 2017-09-01 TX    Stool  Salmonella  Bareilly                        2       0
10 2017-09-01 TX    Stool  Salmonella  Berta                           2       0
# … with 128,332 more rows, and abbreviated variable names ¹n_isolates,
#   ²n_outbreak_associated

Save cleaned data to RDS file

saveRDS(beam_clean, "data_analysis_exercise/data/raw_data/BEAM_Report_clean.RDS")

Read in RDS file

#read_rds("data_analysis_exercise/data/raw_data/BEAM_Report_clean.RDS")

Exploratory analysis

I would be interested in grouping by pathogen and exploring trends across states. I’d also like to see which pathogens are more likely to be associated with outbreaks.

This section was added by Kimberly Perez

#Reading in RDS

Here I will use the readRDS() function to load Leah’s cleaned BEAM data.

#Utilize readRDS for this task
newbc<-readRDS("data_analysis_exercise/data/raw_data/BEAM_Report_clean.RDS")

A bit more data wrangling

I will now select four states to analyze

#Creating a new dataframe with four selected states
bc<- newbc %>%
  filter(state == 'GA'  | state == 'CA'| state == 'ND' | state == 'OR' )

Selecting and condensing data for visualization!

Next, I will use the filter function to select four states, remove “n_isolate” and “species” columns, and find count information on each pathogen. I will then produce a visualization that will present case count information for each pathogen by state throughout 2017-2022. Hopefully, this will provide us with some basic trend information.

#Creating a new dataframe, grouping variables, and removing n_isolates, source, n_outbreak_associated, and species columns 
bc_path<- bc %>% 
  group_by(date, pathogen, state) %>% 
  mutate(count=n()) %>% 
  ungroup() %>%  
  select(-c(n_isolates, species, source, n_outbreak_associated))

#Removing duplicates (by mean) and creating a new dataframe presenting case counts for each pathogen in each state by day (2017-2022)
final_bcc<- bc_path %>% 
  group_by(date,state,pathogen) %>% 
  summarize_if(is.numeric, mean) %>% 
  ungroup()

#Graphing
final_bcc %>% 
  ggplot() +geom_line(
  aes(x = date,
      y = count,
      color = pathogen,
      linetype = state)) +
  theme_bw() +
  labs(x = "Year",
       y = "Case Count",
       color= "Pathogen",
       linetype="State",
       title = "Pathogen trends from 2017-2022") +
  theme(plot.title = element_text(hjust = 0.6))

From the graph rendered above, it seems as though Salmonella comprises the majority of reported cases for most states.

Visualizing Outbreak by State and Pathogen (2017-2022) in two ways…

theme_set(theme_bw())  

#Visualizing the data...
bc1 <- ggplot(bc, aes(state, n_outbreak_associated)) + 
  labs(title="Pathogen Associated with Outbreak by State (2017-2022) by State",
      y= "Number of Outbreaks",
      x= "State")

bc1 + geom_jitter(aes(col=pathogen, size=n_outbreak_associated)) + 
  geom_smooth(aes(col=pathogen), method="lm", se=F)

`geom_smooth()` using formula = 'y ~ x'

#Another way to visualize the data...
bc %>% 
  ggplot() +geom_line(
  aes(x = date,
      y = n_outbreak_associated,
      color = pathogen,
      linetype = state)) +
  theme_bw() +
  labs(x = "Year",
       y = "Number of Outbreaks (Associated with Pathogen)",
       color= "Pathogen",
       linetype="State",
       title = "Pathogen Associated with an Outbreak by State (2017-2022)") +
  theme(plot.title = element_text(hjust = 0.6))

Salmonella and Ecoli look to be the pathogens that are associated with the largest number of outbreaks among these four states.

Pathogen Source by State

Have you ever wondered what laboratories isolate pathogens from? Let’s explore the possibilities below!

#Plotting 
ggplot(bc, aes(state, pathogen, colour = source)) + geom_count(show.legend=T) +
  labs(y="Pathogen", 
       x="State", 
       title="Pathogen Source by State (2017-2022)")

From the visualization above, it looks like most labs are able to isolate pathogens from stool samples, however, there may be variation by state!

Potentially helpful resource

This website seems to be a helpful resource for data visualization. I found some interesting visualizations and hope to practice utilizing several.