Flu Data Wrangling

Author

Leah Lariscy

Wrangling time..

Load libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(here)
here() starts at /Users/leahlariscy/Desktop/MADA2023/leahlariscy-MADA-portfolio

Load data

symptoms <- readRDS(here("fluanalysis/data/raw_data/SympAct_Any_Pos.Rda"))

Look at data

glimpse(symptoms)
Rows: 735
Columns: 63
$ DxName1           <fct> "Influenza like illness - Clinical Dx", "Acute tonsi…
$ DxName2           <fct> NA, "Influenza like illness - Clinical Dx", "Acute p…
$ DxName3           <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Fever, unspecified"…
$ DxName4           <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Other fatigue", NA,…
$ DxName5           <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Headache", NA, NA, …
$ Unique.Visit      <chr> "340_17632125", "340_17794836", "342_17737773", "342…
$ ActivityLevel     <int> 10, 6, 2, 2, 5, 3, 4, 0, 0, 5, 9, 1, 3, 6, 5, 2, 2, …
$ ActivityLevelF    <fct> 10, 6, 2, 2, 5, 3, 4, 0, 0, 5, 9, 1, 3, 6, 5, 2, 2, …
$ SwollenLymphNodes <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Y…
$ ChestCongestion   <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y…
$ ChillsSweats      <fct> No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, …
$ NasalCongestion   <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y…
$ CoughYN           <fct> Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, …
$ Sneeze            <fct> No, No, Yes, Yes, No, Yes, No, Yes, No, No, No, No, …
$ Fatigue           <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ SubjectiveFever   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes…
$ Headache          <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes…
$ Weakness          <fct> Mild, Severe, Severe, Severe, Moderate, Moderate, Mi…
$ WeaknessYN        <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ CoughIntensity    <fct> Severe, Severe, Mild, Moderate, None, Moderate, Seve…
$ CoughYN2          <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes…
$ Myalgia           <fct> Mild, Severe, Severe, Severe, Mild, Moderate, Mild, …
$ MyalgiaYN         <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ RunnyNose         <fct> No, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No…
$ AbPain            <fct> No, No, Yes, No, No, No, No, No, No, No, Yes, Yes, N…
$ ChestPain         <fct> No, No, Yes, No, No, Yes, Yes, No, No, No, No, Yes, …
$ Diarrhea          <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No,…
$ EyePn             <fct> No, No, No, No, Yes, No, No, No, No, No, Yes, No, Ye…
$ Insomnia          <fct> No, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, Y…
$ ItchyEye          <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes,…
$ Nausea            <fct> No, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Y…
$ EarPn             <fct> No, Yes, No, Yes, No, No, No, No, No, No, No, Yes, Y…
$ Hearing           <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No,…
$ Pharyngitis       <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, …
$ Breathless        <fct> No, No, Yes, No, No, Yes, No, No, No, Yes, No, Yes, …
$ ToothPn           <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, N…
$ Vision            <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
$ Vomit             <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, Yes, N…
$ Wheeze            <fct> No, No, No, Yes, No, Yes, No, No, No, No, No, Yes, N…
$ BodyTemp          <dbl> 98.3, 100.4, 100.8, 98.8, 100.5, 98.4, 102.5, 98.4, …
$ RapidFluA         <fct> Presumptive Negative For Influenza A, NA, Presumptiv…
$ RapidFluB         <fct> Presumptive Negative For Influenza B, NA, Presumptiv…
$ PCRFluA           <fct> NA, NA, NA, NA, NA, NA,  Influenza A Not Detected, N…
$ PCRFluB           <fct> NA, NA, NA, NA, NA, NA,  Influenza B Not Detected, N…
$ TransScore1       <dbl> 1, 3, 4, 5, 0, 2, 2, 5, 4, 4, 2, 3, 2, 5, 3, 5, 1, 5…
$ TransScore1F      <fct> 1, 3, 4, 5, 0, 2, 2, 5, 4, 4, 2, 3, 2, 5, 3, 5, 1, 5…
$ TransScore2       <dbl> 1, 2, 3, 4, 0, 2, 2, 4, 3, 3, 1, 2, 2, 4, 2, 4, 1, 4…
$ TransScore2F      <fct> 1, 2, 3, 4, 0, 2, 2, 4, 3, 3, 1, 2, 2, 4, 2, 4, 1, 4…
$ TransScore3       <dbl> 1, 1, 2, 3, 0, 2, 2, 3, 2, 2, 0, 1, 1, 3, 1, 3, 1, 3…
$ TransScore3F      <fct> 1, 1, 2, 3, 0, 2, 2, 3, 2, 2, 0, 1, 1, 3, 1, 3, 1, 3…
$ TransScore4       <dbl> 0, 2, 4, 4, 0, 1, 1, 4, 3, 3, 2, 2, 2, 4, 3, 4, 0, 4…
$ TransScore4F      <fct> 0, 2, 4, 4, 0, 1, 1, 4, 3, 3, 2, 2, 2, 4, 3, 4, 0, 4…
$ ImpactScore       <int> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9…
$ ImpactScore2      <int> 6, 7, 13, 11, 10, 11, 7, 6, 9, 6, 12, 16, 10, 12, 8,…
$ ImpactScore3      <int> 3, 4, 9, 7, 6, 7, 3, 3, 6, 4, 7, 11, 6, 8, 4, 4, 5, …
$ ImpactScoreF      <fct> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9…
$ ImpactScore2F     <fct> 6, 7, 13, 11, 10, 11, 7, 6, 9, 6, 12, 16, 10, 12, 8,…
$ ImpactScore3F     <fct> 3, 4, 9, 7, 6, 7, 3, 3, 6, 4, 7, 11, 6, 8, 4, 4, 5, …
$ ImpactScoreFD     <fct> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9…
$ TotalSymp1        <dbl> 8, 11, 18, 17, 11, 14, 10, 12, 14, 11, 15, 20, 13, 1…
$ TotalSymp1F       <fct> 8, 11, 18, 17, 11, 14, 10, 12, 14, 11, 15, 20, 13, 1…
$ TotalSymp2        <dbl> 8, 10, 17, 16, 11, 14, 10, 11, 13, 10, 14, 19, 13, 1…
$ TotalSymp3        <dbl> 8, 9, 16, 15, 11, 14, 10, 10, 12, 9, 13, 18, 12, 16,…
skimr::skim(symptoms)
Data summary
Name symptoms
Number of rows 735
Number of columns 63
_______________________
Column type frequency:
character 1
factor 50
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Unique.Visit 0 1 10 12 0 735 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
DxName1 0 1.00 FALSE 25 Inf: 328, Inf: 131, Fev: 101, Cou: 66
DxName2 280 0.62 FALSE 42 Inf: 126, Inf: 115, Fev: 45, Cou: 41
DxName3 626 0.15 FALSE 37 Inf: 23, Inf: 14, Cou: 10, Fev: 6
DxName4 716 0.03 FALSE 14 Inf: 3, Acu: 2, Enc: 2, Inf: 2
DxName5 734 0.00 FALSE 1 Hea: 1, Acu: 0, Enc: 0, Oth: 0
ActivityLevelF 0 1.00 FALSE 11 3: 125, 5: 97, 4: 95, 2: 80
SwollenLymphNodes 0 1.00 FALSE 2 No: 421, Yes: 314
ChestCongestion 0 1.00 FALSE 2 Yes: 409, No: 326
ChillsSweats 0 1.00 FALSE 2 Yes: 604, No: 131
NasalCongestion 0 1.00 FALSE 2 Yes: 565, No: 170
CoughYN 0 1.00 FALSE 2 Yes: 660, No: 75
Sneeze 0 1.00 FALSE 2 Yes: 395, No: 340
Fatigue 0 1.00 FALSE 2 Yes: 671, No: 64
SubjectiveFever 0 1.00 FALSE 2 Yes: 505, No: 230
Headache 0 1.00 FALSE 2 Yes: 620, No: 115
Weakness 0 1.00 FALSE 4 Mod: 341, Mil: 224, Sev: 121, Non: 49
WeaknessYN 0 1.00 FALSE 2 Yes: 686, No: 49
CoughIntensity 0 1.00 FALSE 4 Mod: 360, Sev: 172, Mil: 156, Non: 47
CoughYN2 0 1.00 FALSE 2 Yes: 688, No: 47
Myalgia 0 1.00 FALSE 4 Mod: 327, Mil: 214, Sev: 115, Non: 79
MyalgiaYN 0 1.00 FALSE 2 Yes: 656, No: 79
RunnyNose 0 1.00 FALSE 2 Yes: 524, No: 211
AbPain 0 1.00 FALSE 2 No: 642, Yes: 93
ChestPain 0 1.00 FALSE 2 No: 501, Yes: 234
Diarrhea 0 1.00 FALSE 2 No: 636, Yes: 99
EyePn 0 1.00 FALSE 2 No: 622, Yes: 113
Insomnia 0 1.00 FALSE 2 Yes: 419, No: 316
ItchyEye 0 1.00 FALSE 2 No: 553, Yes: 182
Nausea 0 1.00 FALSE 2 No: 477, Yes: 258
EarPn 0 1.00 FALSE 2 No: 573, Yes: 162
Hearing 0 1.00 FALSE 2 No: 705, Yes: 30
Pharyngitis 0 1.00 FALSE 2 Yes: 614, No: 121
Breathless 0 1.00 FALSE 2 No: 438, Yes: 297
ToothPn 0 1.00 FALSE 2 No: 569, Yes: 166
Vision 0 1.00 FALSE 2 No: 716, Yes: 19
Vomit 0 1.00 FALSE 2 No: 656, Yes: 79
Wheeze 0 1.00 FALSE 2 No: 514, Yes: 221
RapidFluA 407 0.45 FALSE 2 Pos: 169, Pre: 159
RapidFluB 407 0.45 FALSE 2 Pre: 302, Pos: 26
PCRFluA 581 0.21 FALSE 3 In: 120, In: 33, Ind: 1, Ass: 0
PCRFluB 581 0.21 FALSE 2 In: 145, In: 9, Ass: 0
TransScore1F 0 1.00 FALSE 6 4: 210, 5: 195, 3: 157, 2: 107
TransScore2F 0 1.00 FALSE 5 4: 294, 3: 201, 2: 138, 1: 89
TransScore3F 0 1.00 FALSE 4 3: 323, 2: 222, 1: 166, 0: 24
TransScore4F 0 1.00 FALSE 5 3: 230, 4: 198, 2: 154, 1: 103
ImpactScoreF 0 1.00 FALSE 17 8: 105, 9: 104, 10: 88, 7: 84
ImpactScore2F 0 1.00 FALSE 16 7: 107, 8: 102, 9: 90, 10: 86
ImpactScore3F 0 1.00 FALSE 14 4: 134, 5: 112, 3: 108, 6: 102
ImpactScoreFD 0 1.00 FALSE 17 8: 105, 9: 104, 10: 88, 7: 84
TotalSymp1F 0 1.00 FALSE 19 12: 86, 13: 84, 14: 80, 11: 72

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ActivityLevel 0 1.00 4.46 2.64 0.0 3.0 4.0 6.0 10.0 ▆▇▆▅▂
BodyTemp 5 0.99 98.94 1.20 97.2 98.2 98.5 99.3 103.1 ▇▇▂▁▁
TransScore1 0 1.00 3.47 1.31 0.0 3.0 4.0 5.0 5.0 ▂▅▆▇▇
TransScore2 0 1.00 2.92 1.11 0.0 2.0 3.0 4.0 4.0 ▁▂▃▆▇
TransScore3 0 1.00 2.15 0.88 0.0 1.0 2.0 3.0 3.0 ▁▅▁▆▇
TransScore4 0 1.00 2.58 1.21 0.0 2.0 3.0 4.0 4.0 ▂▃▆▇▇
ImpactScore 0 1.00 9.51 2.84 2.0 8.0 9.0 11.0 18.0 ▂▇▇▅▁
ImpactScore2 0 1.00 8.58 2.78 2.0 7.0 8.0 10.0 17.0 ▂▇▆▃▁
ImpactScore3 0 1.00 5.06 2.34 0.0 3.0 5.0 7.0 13.0 ▂▇▃▂▁
TotalSymp1 0 1.00 12.99 3.41 5.0 11.0 13.0 15.0 23.0 ▂▇▇▅▁
TotalSymp2 0 1.00 12.43 3.22 4.0 10.0 12.0 15.0 22.0 ▁▇▇▅▁
TotalSymp3 0 1.00 11.66 3.10 3.0 10.0 12.0 14.0 21.0 ▁▇▇▅▁

After viewing the data, I see there are 63 variables and 735 observations. Most are coded as factors and integers, and there is 1 character variable. Some variables have quite a good amount of NAs.

Remove unnecessary variables

symptoms <- symptoms %>% select(9:40) %>% select(!contains("YN", ignore.case = FALSE))

Since all the variables of interest were all consecutive, I was able to easily select them based on their range of column number.

I then selected for all variables that did not include “YN”, removing duplicate variables.

Remove “near-zero” variables with less than 50 entries in one binary category

summary(symptoms) #Vison and Hearing have the less than 50 yes
 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze   
 No :421           No :326         No :131      No :170         No :340  
 Yes:314           Yes:409         Yes:604      Yes:565         Yes:395  
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
 Fatigue   SubjectiveFever Headache      Weakness    CoughIntensity
 No : 64   No :230         No :115   None    : 49   None    : 47   
 Yes:671   Yes:505         Yes:620   Mild    :224   Mild    :156   
                                     Moderate:341   Moderate:360   
                                     Severe  :121   Severe  :172   
                                                                   
                                                                   
                                                                   
     Myalgia    RunnyNose AbPain    ChestPain Diarrhea  EyePn     Insomnia 
 None    : 79   No :211   No :642   No :501   No :636   No :622   No :316  
 Mild    :214   Yes:524   Yes: 93   Yes:234   Yes: 99   Yes:113   Yes:419  
 Moderate:327                                                              
 Severe  :115                                                              
                                                                           
                                                                           
                                                                           
 ItchyEye  Nausea    EarPn     Hearing   Pharyngitis Breathless ToothPn  
 No :553   No :477   No :573   No :705   No :121     No :438    No :569  
 Yes:182   Yes:258   Yes:162   Yes: 30   Yes:614     Yes:297    Yes:166  
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
 Vision    Vomit     Wheeze       BodyTemp     
 No :716   No :656   No :514   Min.   : 97.20  
 Yes: 19   Yes: 79   Yes:221   1st Qu.: 98.20  
                               Median : 98.50  
                               Mean   : 98.94  
                               3rd Qu.: 99.30  
                               Max.   :103.10  
                               NA's   :5       
symptoms <- symptoms %>% select(!c(Vision, Hearing))

Remove NAs

symptoms <- symptoms %>% na.omit()

Only 5 observations were removed, so it looks like we took care of most of the NAs when we selected for the relevant variables.

Cleaning is complete, save as RDS file

saveRDS(symptoms, here("fluanalysis/data/processed_data/symptoms_clean.RDS"))