Here you find the sample solution for the exercise sheet of chapter 4
Create an R project for solving this Exercise Sheet.
Download the csv-file SSRC_data.csv and the R script SSRC_C4_template.R and put it in the R project folder you created in Task 1.
Open the SSRC_C4_template.R R Script.
Load the tidyverse package.
# Load tidyverse package
library("tidyverse")
Use the read.csv() command to load the SSRC data into R and call the respective data object SSRC_data.
# Load the dataset
SSRC_data <- read.csv("SSRC_data.csv")
Get a first impression of the dataset by checking out the dataset using the str() command.
# Check out the dataset
str(SSRC_data)
'data.frame': 1000 obs. of 5 variables:
$ age : int 64 59 39 30 49 37 33 44 27 55 ...
$ gender : chr "male" "female" "female" "female" ...
$ education_level : chr "medium" "low" "high" "high" ...
$ physical_activity_level: chr "low" "medium" "low" "low" ...
$ bmi : num 27.9 27.5 27.4 24.2 23.9 30.7 25.1 30.3 18.5 30.2 ...
Transform the three character variables in the dataset into factor variables. Make sure that the levels of the physical_activity_level and education_level variables are ordered in a reasonable way. (You learned how to do that in chapter 3)
# Transform into factor variables
SSRC_data <- mutate(SSRC_data, gender = as.factor(gender),
education_level = as.factor(education_level),
physical_activity_level = as.factor(physical_activity_level))
# Order education_level and physical_activity_level in a reasonable way
SSRC_data <- SSRC_data %>%
mutate(education_level = fct_relevel(education_level, c("low", "medium", "high"))) %>%
mutate(physical_activity_level = fct_relevel(physical_activity_level, c("low", "medium", "high")))
# Check whether it worked out
levels(SSRC_data$education_level)
[1] "low" "medium" "high"
levels(SSRC_data$physical_activity_level)
[1] "low" "medium" "high"
What kind of plot would be useful to analyze the …
# variation of gender -> Bar Chart
# Variation of bmi -> Histogram
# Covariation of physical_activity_level and education_level -> Heat Map
# Covariation of bmi and physical_activity_level -> Boxplot
# Covariation of age and bmi -> Scatterplot
Use a bar chart to analyze the distribution of the physical_activity_level variable.
An alternative to putting the mapping argument in geom_bar() command is to put it in the ggplot() command in the first line:
Both alternatives work, but the later one is probably the approach that is more commonly used. Hence, we will also use the second approach whenever we create plots in the SSRC sample solutions.
Create the same bar chart as in Task 9 but with colored bars and a decreased bar-width of 0.5.
Create a histogram to analyze the distribution of the variable age.
# Create histogram
ggplot(data = SSRC_data, mapping = aes(x = age)) +
geom_histogram()
Create the same histogram as in Task 11 but change the binwidth to 1.
# Create histogram
ggplot(data = SSRC_data, mapping = aes(x = age)) +
geom_histogram( binwidth = 1)
Create a density plot to analyze the distribution of the variable bmi.
# Create density plot
ggplot(data = SSRC_data, mapping = aes(x = bmi)) +
geom_density()
Create a plot that depicts the distributions of bmi for males and females in a single plot.
# Create density plots
ggplot(data = SSRC_data, mapping = aes(x = bmi, col = gender)) +
geom_density()
Create a set of parallel boxplots to describe the relationship between education_level and bmi.
# Create boxplots
ggplot(data = SSRC_data, mapping = aes(x = education_level, y = bmi)) +
geom_boxplot()
Create a scatterplot to analyze the relationship between age and bmi.
# Create Scatterplot
ggplot(data = SSRC_data, mapping = aes(x = age, y = bmi)) +
geom_point()
Create the same scatterplot as in task 16 and add a line that approximates the relationship between age and bmi. (Use method = “lm”)
# Create Scatterplot with a smoother
ggplot(data = SSRC_data, mapping = aes(x = age, y = bmi)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Create the same scatterplot as in task 17 and add three horizontal lines that indicate bmi levels of 18.5, 25 and 30.
# Create Scatterplot with horizontal lines
ggplot(data = SSRC_data, mapping = aes(x = age, y = bmi)) +
geom_point() +
geom_hline(yintercept = c(18.5, 25, 30))