Here you find the exercise sheet for chapter 5: “Regression Models”
Create an R project for solving this Exercise Sheet.
Download the csv-file SSRC_data.csv and the R script SSRC_C5_template.R and put it in the R project folder you created in Task 1.
Open the SSRC_C5_template.R R Script.
Load the tidyverse package.
Use the read.csv() command to load the SSRC data into R and call the respective data object SSRC_data.
Get a first impression of the dataset by checking out the dataset using the str() command.
Transform the three categorical variables in the dataset into factor variables. Make sure that the levels of the physical_activity_level and education_level variables are ordered in a reasonable way. (You learned how to do that in chapter 2)
Analyze the relationship between age and bmi by fitting a linear regression model with the variable bmi as dependent variable and age as independent variable. Call this model lm_mod_restricted. Use the summary() command to check out the results.
Estimate a linear regression model with bmi as dependent variable and all other variables in the SSRC data set as independent variables. Call this model lm_mod_full. Use the summary() command to check out the results.
Use the lm_mod_full model you just fitted in task 9 to add the fitted bmi values to the SSRC dataset. Call this new variable bmi_fitted. Use the str() command to check out whether it worked.
Calculate the residuals by substracting the fitted values from the true observed bmi values and add this residuals variable to the SSRC dataset. Check out whether it worked.
Analyze the distribution of the residuals variable you created in task 11 with a histogram and a density plot.
Create a data frame with two observations and the four variables: age, gender, education_level and physical_activity_level. The first observation is a female person of age 28 who features both a high level of education and a high level of physical activity. The second observation is a male person of age 59 who features a medium level of education and a low level of physical activity. Call this data frame SSRC_data_new.
Use the lm_mod_full model to predict the bmi for the two new observations described in Task 13.
Extract the beta coefficients for the age variable from the lm_mod_full and the lm_mod_restricted model objects. Combine them in data frame that displays the two age coefficients nicely next to each other and where the columns are called full_model and restricted_model. Call this data frame age_coefficients and check it out after creating it.
Create an indicator variable that equals one if the bmi of a person is larger or equal to 30 and zero otherwise. Add this variable to the SSRC dataset and call it obesity. (You can use the mutate() command in combination with the ifelse() command to do this in one line of code)
Fit a logistic regression model with the obesity variable as dependent variable and age, gender, education_level and physical_activity_level as independent variables. Call this model logit_mod. Use the summary() command to check out the results.
Predict the probability that the two new observations described in Task 13 are obese. You can reuse the SSRC_data_new dataframe. You can also use the type option of the predict() command to get the predicted probabilities directly. Just set type equal to “response”.