C5 - Exercise Sheet

Here you find the exercise sheet for chapter 5: “Regression Models”

Getting ready

Task 1

Create an R project for solving this Exercise Sheet.

Task 2

Download the csv-file SSRC_data.csv and the R script SSRC_C5_template.R and put it in the R project folder you created in Task 1.

Task 3

Open the SSRC_C5_template.R R Script.

Task 4

Load the tidyverse package.

Task 5

Use the read.csv() command to load the SSRC data into R and call the respective data object SSRC_data.

Task 6

Get a first impression of the dataset by checking out the dataset using the str() command.

Task 7

Transform the three categorical variables in the dataset into factor variables. Make sure that the levels of the physical_activity_level and education_level variables are ordered in a reasonable way. (You learned how to do that in chapter 2)

Linear Regression

Task 8

Analyze the relationship between age and bmi by fitting a linear regression model with the variable bmi as dependent variable and age as independent variable. Call this model lm_mod_restricted. Use the summary() command to check out the results.

Task 9

Estimate a linear regression model with bmi as dependent variable and all other variables in the SSRC data set as independent variables. Call this model lm_mod_full. Use the summary() command to check out the results.

Fitted Values and Residuals

Task 10

Use the lm_mod_full model you just fitted in task 9 to add the fitted bmi values to the SSRC dataset. Call this new variable bmi_fitted. Use the str() command to check out whether it worked.

Task 11

Calculate the residuals by substracting the fitted values from the true observed bmi values and add this residuals variable to the SSRC dataset. Check out whether it worked.

Task 12

Analyze the distribution of the residuals variable you created in task 11 with a histogram and a density plot.

Prediction

Task 13

Create a data frame with two observations and the four variables: age, gender, education_level and physical_activity_level. The first observation is a female person of age 28 who features both a high level of education and a high level of physical activity. The second observation is a male person of age 59 who features a medium level of education and a low level of physical activity. Call this data frame SSRC_data_new.

Task 14

Use the lm_mod_full model to predict the bmi for the two new observations described in Task 13.

Extracting Coefficients

Task 15

Extract the beta coefficients for the age variable from the lm_mod_full and the lm_mod_restricted model objects. Combine them in data frame that displays the two age coefficients nicely next to each other and where the columns are called full_model and restricted_model. Call this data frame age_coefficients and check it out after creating it.

Logistic Regression

Task 16

Create an indicator variable that equals one if the bmi of a person is larger or equal to 30 and zero otherwise. Add this variable to the SSRC dataset and call it obesity. (You can use the mutate() command in combination with the ifelse() command to do this in one line of code)

Task 17

Fit a logistic regression model with the obesity variable as dependent variable and age, gender, education_level and physical_activity_level as independent variables. Call this model logit_mod. Use the summary() command to check out the results.

Task 18

Predict the probability that the two new observations described in Task 13 are obese. You can reuse the SSRC_data_new dataframe. You can also use the type option of the predict() command to get the predicted probabilities directly. Just set type equal to “response”.