library(tidymodels)
library(tidyverse)
library(here)
library(dotwhisker)Flu Data Model Fitting
Modeling!
Load packages
Load data
symptoms_fit <- readRDS(here("fluanalysis/data/processed_data/symptoms_clean.RDS"))
tibble(symptoms_fit) #quick look at data# A tibble: 730 × 32
SwollenLymph…¹ Chest…² Chill…³ Nasal…⁴ CoughYN Sneeze Fatigue Subje…⁵ Heada…⁶
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 Yes No No No Yes No Yes Yes Yes
2 Yes Yes No Yes Yes No Yes Yes Yes
3 Yes Yes Yes Yes No Yes Yes Yes Yes
4 Yes Yes Yes Yes Yes Yes Yes Yes Yes
5 Yes No Yes No No No Yes Yes Yes
6 No No Yes No Yes Yes Yes Yes Yes
7 No No Yes No Yes No Yes Yes No
8 No Yes Yes Yes Yes Yes Yes Yes Yes
9 Yes Yes Yes Yes Yes No Yes Yes Yes
10 No Yes No Yes Yes No Yes No Yes
# … with 720 more rows, 23 more variables: Weakness <fct>, WeaknessYN <fct>,
# CoughIntensity <fct>, CoughYN2 <fct>, Myalgia <fct>, MyalgiaYN <fct>,
# RunnyNose <fct>, AbPain <fct>, ChestPain <fct>, Diarrhea <fct>,
# EyePn <fct>, Insomnia <fct>, ItchyEye <fct>, Nausea <fct>, EarPn <fct>,
# Hearing <fct>, Pharyngitis <fct>, Breathless <fct>, ToothPn <fct>,
# Vision <fct>, Vomit <fct>, Wheeze <fct>, BodyTemp <dbl>, and abbreviated
# variable names ¹SwollenLymphNodes, ²ChestCongestion, ³ChillsSweats, …
Data looks good to go
Model fitting 1: Predicting body temp using multivariate linear regression
Split data into training set and testing set
data_split <- initial_split(symptoms_fit, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)Define model
lm_mod <- linear_reg() #set model type as linear regressionCreate recipe
recipe_bodytemp <- recipe(BodyTemp ~ ., data = train_data)Add model and recipe together to create workflow
bodytemp_lm_workflow <- workflow() %>%
add_model(lm_mod) %>%
add_recipe(recipe_bodytemp)Model using the workflow created above
bodytemp_fit <- bodytemp_lm_workflow %>%
fit(data = train_data)
tidy(bodytemp_fit)# A tibble: 38 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 98.2 0.346 284. 0
2 SwollenLymphNodesYes -0.108 0.106 -1.01 0.313
3 ChestCongestionYes 0.0478 0.115 0.414 0.679
4 ChillsSweatsYes 0.313 0.145 2.15 0.0320
5 NasalCongestionYes -0.213 0.131 -1.63 0.105
6 CoughYNYes 0.405 0.289 1.40 0.163
7 SneezeYes -0.374 0.112 -3.34 0.000912
8 FatigueYes 0.111 0.186 0.596 0.552
9 SubjectiveFeverYes 0.348 0.121 2.88 0.00412
10 HeadacheYes 0.0126 0.149 0.0847 0.933
# … with 28 more rows
Create a dotwhisker plot to look at regression coefficients
tidy(bodytemp_fit) %>%
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 0,
colour = "grey50", linetype = 2))
Use model on test data to predict body temp outcome
bodytemp_aug_test <- augment(bodytemp_fit, test_data)Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from a rank-deficient fit may be misleading
bodytemp_aug_test %>% select(BodyTemp, .pred)# A tibble: 183 × 2
BodyTemp .pred
<dbl> <dbl>
1 98.3 99.1
2 100. 99.5
3 102. 99.5
4 98.4 98.8
5 98.2 99.2
6 99.2 98.8
7 98.5 99.0
8 98.2 98.6
9 100. 99.0
10 100 99.0
# … with 173 more rows
Use RMSE to evaluate model
Info on RMSE or Root Mean Squared Error can be found here.
bodytemp_aug_test %>%
rmse(truth = BodyTemp, .pred)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.20
I think this means that this model is estimating body temp incorrectly by this amount (in either direction)
Model fitting 2: Predicting body temp using single-variate linear regression
Predictor of interest: Runny nose
Create recipe
Since we already defined the training set and testing set, we don’t need to do that again. We will just reuse those. We also already defined the model, so we don’t need to do that again.
recipe_bodytemp2 <- recipe(BodyTemp ~ RunnyNose, data = train_data)Here, we have set the outcome of interest to be body temperature, and predictor of interest to be runny nose.
Combine model and recipe to create workflow
bodytemp_lm_workflow2 <- workflow() %>%
add_model(lm_mod) %>%
add_recipe(recipe_bodytemp2)Model using workflow and training data set
bodytemp_fit2 <- bodytemp_lm_workflow2 %>%
fit(data = train_data)
tidy(bodytemp_fit2)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 99.1 0.0931 1065. 0
2 RunnyNoseYes -0.317 0.110 -2.88 0.00419
Now we have trained the model using the training data set.
Create a dotwhisker plot to look at regression coefficient
tidy(bodytemp_fit2) %>%
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 0,
colour = "grey50", linetype = 2))
Looks like runny nose is a negative predictor of body temp (body temp is more likely to be lower if runny nose symptom is present)
Use model on test data to predict body temp
bodytemp_aug_test2 <- augment(bodytemp_fit2, test_data)
bodytemp_aug_test2 %>% select(BodyTemp, .pred)# A tibble: 183 × 2
BodyTemp .pred
<dbl> <dbl>
1 98.3 99.1
2 100. 99.1
3 102. 98.8
4 98.4 98.8
5 98.2 99.1
6 99.2 98.8
7 98.5 98.8
8 98.2 98.8
9 100. 98.8
10 100 98.8
# … with 173 more rows
Use RMSE to evaluate model
bodytemp_aug_test2 %>%
rmse(truth = BodyTemp, .pred)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.26
This is a similar output to what we saw in the first model, where estimation of body temp is slightly off by about 1 degree.
Model fitting 3: Predicting nausea using multivariate logistic regression
Define model
We will continue to use the model testing and training data sets that we created at the start. However, since we now want to create a logistic regression (outcome of interest is categorical), we will need to define a new model
log_mod <- logistic_reg() %>%
set_engine("glm")Create recipe
recipe_nausea <- recipe(Nausea ~., data = train_data)This sets the stage to predict nausea outcomes based on all other variables in the data set (predictors)
Combine model and recipe to create workflow
nausea_log_wf <- workflow() %>%
add_model(log_mod) %>%
add_recipe(recipe_nausea)Model using workflow and training data
nausea_fit <- nausea_log_wf %>%
fit(data = train_data)
nausea_fit %>% extract_fit_parsnip() %>%
tidy()# A tibble: 38 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -9.21 9.18 -1.00 0.316
2 SwollenLymphNodesYes -0.274 0.231 -1.19 0.235
3 ChestCongestionYes 0.337 0.256 1.32 0.188
4 ChillsSweatsYes 0.254 0.335 0.757 0.449
5 NasalCongestionYes 0.415 0.296 1.40 0.161
6 CoughYNYes -0.382 0.603 -0.633 0.527
7 SneezeYes 0.231 0.245 0.943 0.346
8 FatigueYes 0.0278 0.429 0.0649 0.948
9 SubjectiveFeverYes 0.419 0.267 1.57 0.117
10 HeadacheYes 0.364 0.346 1.05 0.292
# … with 28 more rows
Now we have trained the model, so let’s use the test data to see how well this model predicts nausea.
Use model on test data
predict(nausea_fit, test_data)Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
# A tibble: 183 × 1
.pred_class
<fct>
1 No
2 No
3 No
4 No
5 Yes
6 No
7 No
8 No
9 No
10 Yes
# … with 173 more rows
Use ROC curve to assess model fit
Info on ROC curves or Receiver Operating Characteristic can be found here.
nausea_aug_test <- augment(nausea_fit, test_data)
nausea_aug_test %>%
roc_curve(truth = Nausea, .pred_Yes, event_level = "second") %>%
autoplot()
nausea_aug_test %>% roc_auc(truth = Nausea, .pred_Yes,
event_level = "second")# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.773
I think what this shows is that the model is a pretty decent predictor of nausea, however, it is more sensitive than it is specific, meaning that it is good at indicating true positives but not as good at indicating true negatives.
Model fitting 4: Predicting nausea using single-variate logistic regression
Predictor of interest: Runny nose
Create recipe
Since we have already split the data and defined the appropriate model, we will just need to create a new recipe that predicts nausea based on runny nose.
recipe_nausea2 <- recipe(Nausea ~ RunnyNose, data = train_data)Combine model and recipe to create workflow
nausea_log_wf2 <- workflow() %>%
add_model(log_mod) %>%
add_recipe(recipe_nausea2)Model using workflow and training data
nausea_fit2 <- nausea_log_wf2 %>%
fit(data = train_data)
nausea_fit2 %>% extract_fit_parsnip() %>%
tidy()# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.590 0.167 -3.54 0.000400
2 RunnyNoseYes -0.0804 0.198 -0.406 0.685
Use model on test data
predict(nausea_fit2, test_data)# A tibble: 183 × 1
.pred_class
<fct>
1 No
2 No
3 No
4 No
5 No
6 No
7 No
8 No
9 No
10 No
# … with 173 more rows
Use ROC curve to assess model fit
nausea_aug_test2 <- augment(nausea_fit2, test_data)
nausea_aug_test2 %>%
roc_curve(truth = Nausea, .pred_Yes, event_level = "second") %>%
autoplot()
nausea_aug_test2 %>% roc_auc(truth = Nausea, .pred_Yes,
event_level = "second")# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.456