Here you find the sample solution for the exercise sheet of chapter 3
Create an R project for solving this Exercise Sheet.
Download the csv-file SSRC_data.csv and the R script SSRC_C3_template.R and put it in the R project folder you created in Task 1.
Open the SSRC_C3_template.R R Script.
Use the read.csv() command to load the SSRC data into R and call the respective data object SSRC_data.
# Load the dataset
SSRC_data <- read.csv("SSRC_data.csv")
Get a first impression of the dataset by checking out the first 6 rows of the dataset.
# Check out the first 6 rows
head(SSRC_data)
age gender education_level physical_activity_level bmi
1 64 male medium low 27.9
2 59 female low medium 27.5
3 39 female high low 27.4
4 30 female high low 24.2
5 49 male medium low 23.9
6 37 male medium medium 30.7
Install and load the tidyverse package. (If you have already installed the package before, loading the package is sufficient)
# Install tidyverse package (if you have not done it yet)
# install.packages("tidyverse")
# Load tidyverse package
library("tidyverse")
Create a new variable that contains the age of an individual in months. Call the variable age_in_months and add it to the SSRC dataset. Check out the first six rows of the dataset after creating this new variable.
# Add variable to the SSRC dataset
SSRC_data <- mutate(SSRC_data, age_in_months = 12*age)
# Check out dataset
head(SSRC_data)
age gender education_level physical_activity_level bmi
1 64 male medium low 27.9
2 59 female low medium 27.5
3 39 female high low 27.4
4 30 female high low 24.2
5 49 male medium low 23.9
6 37 male medium medium 30.7
age_in_months
1 768
2 708
3 468
4 360
5 588
6 444
Use the str() command to check the variable types in the SSRC data.
# Check variable types
str(SSRC_data)
'data.frame': 1000 obs. of 6 variables:
$ age : int 64 59 39 30 49 37 33 44 27 55 ...
$ gender : chr "male" "female" "female" "female" ...
$ education_level : chr "medium" "low" "high" "high" ...
$ physical_activity_level: chr "low" "medium" "low" "low" ...
$ bmi : num 27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
$ age_in_months : num 768 708 468 360 588 444 396 528 324 660 ...
Transform the variables gender, education_level and physical_activity_level into factor variables. Use the str() command to check out whether it worked.
# Transform into factor variables (without dplyr)
SSRC_data$gender <- as.factor(SSRC_data$gender)
SSRC_data$education_level <- as.factor(SSRC_data$education_level)
SSRC_data$physical_activity_level <- as.factor(SSRC_data$physical_activity_level)
# Transform into factor variables (with dplyr)
SSRC_data <- mutate(SSRC_data, gender = as.factor(gender),
education_level = as.factor(education_level),
physical_activity_level = as.factor(physical_activity_level))
# Check whether it worked
str(SSRC_data)
'data.frame': 1000 obs. of 6 variables:
$ age : int 64 59 39 30 49 37 33 44 27 55 ...
$ gender : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ education_level : Factor w/ 3 levels "high","low","medium": 3 2 1 1 3 3 1 3 1 3 ...
$ physical_activity_level: Factor w/ 3 levels "high","low","medium": 2 3 2 2 2 3 2 1 3 2 ...
$ bmi : num 27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
$ age_in_months : num 768 708 468 360 588 444 396 528 324 660 ...
Check the level order of the variable physical_activity_level and adjust it if necessary to “low”, “medium,”high”.
# Check level ordering
levels(SSRC_data$physical_activity_level)
[1] "high" "low" "medium"
# Change order (without forcats package from the tidyverse)
SSRC_data$physical_activity_level <- factor(SSRC_data$physical_activity_level,
levels = c("low", "medium", "high"))
# Change order (with forcats package from the tidyverse)
SSRC_data <- mutate(SSRC_data, physical_activity_level = fct_relevel(physical_activity_level, c("low", "medium", "high")))
# Check whether releveling worked
levels(SSRC_data$physical_activity_level)
[1] "low" "medium" "high"
Use the mutate() command to create a logical variable indicating whether an individual has a bmi larger than 25 and call this variable overweight_indicator. Check the class of the new variable using the class() and str() command.
# Create overweight indicator
SSRC_data <- mutate(SSRC_data, overweight_indicator = bmi > 25)
# Check class of overweight indicator
class(SSRC_data$overweight_indicator)
[1] "logical"
str(SSRC_data$overweight_indicator)
logi [1:1000] TRUE TRUE TRUE FALSE FALSE TRUE ...
Change the name of the variable gender into sex. Use the str() command to check whether it worked.
# Rename variable
SSRC_data <- rename(SSRC_data, sex = gender)
# Check whether it worked
str(SSRC_data)
'data.frame': 1000 obs. of 7 variables:
$ age : int 64 59 39 30 49 37 33 44 27 55 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ education_level : Factor w/ 3 levels "high","low","medium": 3 2 1 1 3 3 1 3 1 3 ...
$ physical_activity_level: Factor w/ 3 levels "low","medium",..: 1 2 1 1 1 2 1 3 2 1 ...
$ bmi : num 27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
$ age_in_months : num 768 708 468 360 588 444 396 528 324 660 ...
$ overweight_indicator : logi TRUE TRUE TRUE FALSE FALSE TRUE ...
Create a new factor variable that indicates whether an individual is “underweight” (bmi < 18.5), “normal” (bmi between 18.5 and 25) or “overweight” (bmi > 25). Call this variable weight_category. Use the str() and head() command to check out whether it worked out.
# Create weight_category variable
SSRC_data <- SSRC_data %>%
mutate(weight_category = ifelse(bmi < 18.5, "underweight",
ifelse(bmi > 25, "overweight", "normal"))) %>%
mutate(weight_category = as_factor(weight_category))
# There are many different ways to do this. This is only one of it.
# Another useful command to create categorical varibles with multiple levels from a numerical variable is the cut() command.
# Check whether it worked
head(SSRC_data)
age sex education_level physical_activity_level bmi
1 64 male medium low 27.9
2 59 female low medium 27.5
3 39 female high low 27.4
4 30 female high low 24.2
5 49 male medium low 23.9
6 37 male medium medium 30.7
age_in_months overweight_indicator weight_category
1 768 TRUE overweight
2 708 TRUE overweight
3 468 TRUE overweight
4 360 FALSE normal
5 588 FALSE normal
6 444 TRUE overweight
str(SSRC_data)
'data.frame': 1000 obs. of 8 variables:
$ age : int 64 59 39 30 49 37 33 44 27 55 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ education_level : Factor w/ 3 levels "high","low","medium": 3 2 1 1 3 3 1 3 1 3 ...
$ physical_activity_level: Factor w/ 3 levels "low","medium",..: 1 2 1 1 1 2 1 3 2 1 ...
$ bmi : num 27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
$ age_in_months : num 768 708 468 360 588 444 396 528 324 660 ...
$ overweight_indicator : logi TRUE TRUE TRUE FALSE FALSE TRUE ...
$ weight_category : Factor w/ 3 levels "overweight","normal",..: 1 1 1 2 2 1 1 1 2 1 ...