C3 - Sample Solution

Here you find the sample solution for the exercise sheet of chapter 3

Start a project and import data

Task 1

Create an R project for solving this Exercise Sheet.

Task 2

Download the csv-file SSRC_data.csv and the R script SSRC_C3_template.R and put it in the R project folder you created in Task 1.

Task 3

Open the SSRC_C3_template.R R Script.

Task 4

Use the read.csv() command to load the SSRC data into R and call the respective data object SSRC_data.

# Load the dataset
SSRC_data <- read.csv("SSRC_data.csv")

Task 5

Get a first impression of the dataset by checking out the first 6 rows of the dataset.

# Check out the first 6 rows
head(SSRC_data)
  age gender education_level physical_activity_level  bmi
1  64   male          medium                     low 27.9
2  59 female             low                  medium 27.5
3  39 female            high                     low 27.4
4  30 female            high                     low 24.2
5  49   male          medium                     low 23.9
6  37   male          medium                  medium 30.7

Task 6

Install and load the tidyverse package. (If you have already installed the package before, loading the package is sufficient)

# Install tidyverse package (if you have not done it yet)
# install.packages("tidyverse")

# Load tidyverse package
library("tidyverse")

The mutate() command

Task 7

Create a new variable that contains the age of an individual in months. Call the variable age_in_months and add it to the SSRC dataset. Check out the first six rows of the dataset after creating this new variable.

# Add variable to the SSRC dataset 
SSRC_data <- mutate(SSRC_data, age_in_months = 12*age)

# Check out dataset
head(SSRC_data)
  age gender education_level physical_activity_level  bmi
1  64   male          medium                     low 27.9
2  59 female             low                  medium 27.5
3  39 female            high                     low 27.4
4  30 female            high                     low 24.2
5  49   male          medium                     low 23.9
6  37   male          medium                  medium 30.7
  age_in_months
1           768
2           708
3           468
4           360
5           588
6           444

Check variable types

Task 8

Use the str() command to check the variable types in the SSRC data.

# Check variable types
str(SSRC_data)
'data.frame':   1000 obs. of  6 variables:
 $ age                    : int  64 59 39 30 49 37 33 44 27 55 ...
 $ gender                 : chr  "male" "female" "female" "female" ...
 $ education_level        : chr  "medium" "low" "high" "high" ...
 $ physical_activity_level: chr  "low" "medium" "low" "low" ...
 $ bmi                    : num  27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
 $ age_in_months          : num  768 708 468 360 588 444 396 528 324 660 ...

Factor and Indicator Variables

Task 9

Transform the variables gender, education_level and physical_activity_level into factor variables. Use the str() command to check out whether it worked.

# Transform into factor variables (without dplyr)
SSRC_data$gender <- as.factor(SSRC_data$gender)
SSRC_data$education_level <- as.factor(SSRC_data$education_level)
SSRC_data$physical_activity_level <- as.factor(SSRC_data$physical_activity_level)

# Transform into factor variables (with dplyr)
SSRC_data <- mutate(SSRC_data, gender = as.factor(gender),
                               education_level = as.factor(education_level),
                               physical_activity_level = as.factor(physical_activity_level))

# Check whether it worked 
str(SSRC_data)
'data.frame':   1000 obs. of  6 variables:
 $ age                    : int  64 59 39 30 49 37 33 44 27 55 ...
 $ gender                 : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ education_level        : Factor w/ 3 levels "high","low","medium": 3 2 1 1 3 3 1 3 1 3 ...
 $ physical_activity_level: Factor w/ 3 levels "high","low","medium": 2 3 2 2 2 3 2 1 3 2 ...
 $ bmi                    : num  27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
 $ age_in_months          : num  768 708 468 360 588 444 396 528 324 660 ...

Task 10

Check the level order of the variable physical_activity_level and adjust it if necessary to “low”, “medium,”high”.

# Check level ordering
levels(SSRC_data$physical_activity_level)
[1] "high"   "low"    "medium"
# Change order (without forcats package from the tidyverse) 
SSRC_data$physical_activity_level <- factor(SSRC_data$physical_activity_level, 
                                            levels = c("low", "medium", "high"))


# Change order (with forcats package from the tidyverse) 
SSRC_data <- mutate(SSRC_data, physical_activity_level = fct_relevel(physical_activity_level, c("low", "medium", "high")))

# Check whether releveling worked
levels(SSRC_data$physical_activity_level)
[1] "low"    "medium" "high"  

Task 11

Use the mutate() command to create a logical variable indicating whether an individual has a bmi larger than 25 and call this variable overweight_indicator. Check the class of the new variable using the class() and str() command.

# Create overweight indicator
SSRC_data <- mutate(SSRC_data, overweight_indicator = bmi > 25)

# Check class of overweight indicator
class(SSRC_data$overweight_indicator)
[1] "logical"
str(SSRC_data$overweight_indicator)
 logi [1:1000] TRUE TRUE TRUE FALSE FALSE TRUE ...

Task 12

Change the name of the variable gender into sex. Use the str() command to check whether it worked.

# Rename variable
SSRC_data <- rename(SSRC_data, sex = gender)

# Check whether it worked
str(SSRC_data)
'data.frame':   1000 obs. of  7 variables:
 $ age                    : int  64 59 39 30 49 37 33 44 27 55 ...
 $ sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ education_level        : Factor w/ 3 levels "high","low","medium": 3 2 1 1 3 3 1 3 1 3 ...
 $ physical_activity_level: Factor w/ 3 levels "low","medium",..: 1 2 1 1 1 2 1 3 2 1 ...
 $ bmi                    : num  27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
 $ age_in_months          : num  768 708 468 360 588 444 396 528 324 660 ...
 $ overweight_indicator   : logi  TRUE TRUE TRUE FALSE FALSE TRUE ...

Task 13

Create a new factor variable that indicates whether an individual is “underweight” (bmi < 18.5), “normal” (bmi between 18.5 and 25) or “overweight” (bmi > 25). Call this variable weight_category. Use the str() and head() command to check out whether it worked out.

# Create weight_category variable
SSRC_data <- SSRC_data %>% 
              mutate(weight_category = ifelse(bmi < 18.5, "underweight", 
                                              ifelse(bmi > 25, "overweight", "normal"))) %>% 
              mutate(weight_category = as_factor(weight_category))

# There are many different ways to do this. This is only one of it. 
# Another useful command to create categorical varibles with multiple levels from a numerical variable is the cut() command.

# Check whether it worked
head(SSRC_data)
  age    sex education_level physical_activity_level  bmi
1  64   male          medium                     low 27.9
2  59 female             low                  medium 27.5
3  39 female            high                     low 27.4
4  30 female            high                     low 24.2
5  49   male          medium                     low 23.9
6  37   male          medium                  medium 30.7
  age_in_months overweight_indicator weight_category
1           768                 TRUE      overweight
2           708                 TRUE      overweight
3           468                 TRUE      overweight
4           360                FALSE          normal
5           588                FALSE          normal
6           444                 TRUE      overweight
str(SSRC_data)
'data.frame':   1000 obs. of  8 variables:
 $ age                    : int  64 59 39 30 49 37 33 44 27 55 ...
 $ sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ education_level        : Factor w/ 3 levels "high","low","medium": 3 2 1 1 3 3 1 3 1 3 ...
 $ physical_activity_level: Factor w/ 3 levels "low","medium",..: 1 2 1 1 1 2 1 3 2 1 ...
 $ bmi                    : num  27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
 $ age_in_months          : num  768 708 468 360 588 444 396 528 324 660 ...
 $ overweight_indicator   : logi  TRUE TRUE TRUE FALSE FALSE TRUE ...
 $ weight_category        : Factor w/ 3 levels "overweight","normal",..: 1 1 1 2 2 1 1 1 2 1 ...