Functions Exercises

Open R Sessions 2024

Authors

Etka Yapar

Iain Moodie

Simon Jacobsen Ellerstrand

Violeta Caballero López

Ximena Alva Caballero

Welcome to the functions exercises! We suggest you work in pairs or small groups for these exercises. In this session you will need to think creatively about how to solve each problem. There is going to be more than one write answer for each.

Remember to annotate your code with comments to help you remember what you did and why. To write a comment in R, use the # symbol. Anything after the # symbol will be ignored by R.

It’s up to you how you want to organise your R scripts. We recommend you setup a new working directory for these exercises, as you will be working with multiple R scripts.

1 Simple functions

In this section, you will write some functions so solve simple tasks. For this section only you can work in a single script, or save each function in seperate scripts. In later sections, we will practise a script-based workflow. For now, the goal is to get comfortable with writing functions.

1.1 Farenheit to Celcius

Imagine you’re working with ecological data collected from around the world. One variable that was recorded was mean daily temperatures. However, some of the data was recorded in Farenheit and some in Celcius. Write a function that when given a temperature in Farenheit, converts it to Celcius. The formula to do so is:

\[ ^{\circ}C=\frac{5}{9}(^{\circ}F-32) \]

Give your function a helpful name. Remember, the general formulation of a function is:

function_name <- function(input) {
  # some stuff my function does
  return(output)
}

Once you’ve got a solution, try it out! Does $41^{\circ}F=5^{\circ}C$ ? Does you function still work if instead of a single temperature, you give a vector of temperatures as the input?

Show potential solution

fahrenheit_to_celsius <- function(temp_f) {
  temp_c <- (5/9) * (temp_f - 32)
  return(temp_c)
}

1.2 Standard error

In base R, there is no single function to calculate the standard error, like there is for the standard deviation, sd(). The standard error of a sample is defined as:

\[ SE=\frac{\sigma}{N} \]

where $\sigma$ is the standard deviation of the sample, and $N$ is the number of samples.

Write a function that takes a numeric vector as an input and returns the standard error. Try it on these vectors to see if it works as it should:

vec1 <- c(36, 1, 53,  2, 75, 74, 59, 28, 63, 87)
vec2 <- c(39, 84, 8, 26, 4, 21, 29, 35, 17, 14)

The standard error for vec1 and vec2 should be 3.012124 and 2.273543 respectively.

Show potential solution

standard_error <- function(x) {
  sd(x)/length(x) # note that we don't always need to use return()
}

1.3 Compare means

Write a function that takes two vectors as inputs, x and y. Your function output should be a printed statement declaring which vector has a larger mean. You will likely need to use some if() statments here, and the print() function might come in helpful. Once you’ve got a function, try it out on these vec1 and vec2 from the previous problem.

vec1 has a larger mean than vec2. Does your function agree? What about with these next two vectors?

vec3 <- c(27, 71, 79, 20, 47, 89, 85, 44, 4, 58)
vec4 <- c(84 ,18 ,75 ,36 ,96, 10, 82, 15, 39, 69)

vec3 and vec4 have the same mean. Can your function handle this scenario? If not, update it so it can.

Show potential solution

compare_means <- function(x, y) {
  if (mean(x) > mean(y)) {
    print("The mean of vector x is larger than the mean of vector y.")
  } else if (mean(x) < mean(y)) {
    print("The mean of vector y is larger than the mean of vector x.")
  } else {
    print("The means of vector x and vector y are equal.")
  }
}

2 Script-based workflow

Save what you have done in Section 1 and close those scripts. We are going to create a small script-based analysis of the palmerpenguins dataset. We will use a main script that will layout the structure of our analysis, that will import functions from separate scripts, all within the same folder. Our “pipeline” should:

Import and manipulate the dataset
Perform a statistical analysis
Make a figure

Each of these steps should be done using one (or more) function(s).

2.1 Download the dataset

The dataset can be found on canvas or here, as a CSV file called palmerpenguins.csv. It’s the same file we used in the plotting exercises. Download the dataset and put it in your working directory.

2.2 Setting up your main script

Create a new .R file, and name is palmerpenguin_main.R. This script will be where we layout the structure of our analysis. Using comments, give the script a title, write your name, and todays date, like the following:

# Palmer Penguins Dataset Analysis
# Iain Moodie
# 2023-11-02

Make four sections, one for each part of the analysis pipeline, and an additional one where we will define (load) our functions. In Rstudio, you can use the shortcut Command+Shift+R or Control+Shift+R to create a new section. These sections are now also shown at the bottom left of the script window, and you can quickly navigate between them. It should look something like this:

# functions ---------------------------------------------------------------


# import dataset ----------------------------------------------------------


# analysis ----------------------------------------------------------------


# plot --------------------------------------------------------------------

Often, we put the command rm(list=ls()) at the start of our main script to clear the environment, so that we can be sure our script contains all the information needed to run the analysis. Add it to yours.

The main script is setup, so now let’s work on our functions.

2.3 `import_data()`

Create a new .R file called import_data.R and save it in your working directory. In this file, we are going to make our import function. The function should take the filename of the palmerpenguins dataset as it’s input, and return a list of dataframes.

Breaking that down, the function should:

Import the CSV file into R
Remove any rows that have NA values in them
Split the dataframe into a list of dataframes, one for each species

Try to write a function that does all this.

Show potential solution

import_data <- function(file_path) {
  # Import CSV file into R
  penguins <- read.csv(file_path)
  
  # Remove rows with NA values
  penguins <- na.omit(penguins)
  
  # Split dataframe into a list of dataframes, one for each species
  penguins_list <- split(penguins, penguins$species)
  
  return(penguins_list)
}

Save the import_data.R file, and open up your main script palmerpenguin_main.R. In the functions section, we will now define (load) the function from import_data.R so that it can be used in the main script. We do that using source().

# functions ---------------------------------------------------------------
source("import_data.R")

You can check that the function is loaded using ls(). The function import_data() should show up if you have run the source() command. You can now add it to the import dataset section, to start building up your analysis. Does the whole script run (select everything, then run)?

Next we will move onto the statistical analysis, where we will test if the mean mass of males and females differ significantly from each other in each species.

2.4 `analyse_data()`

Create a new .R file called analyse_data.R and save it in your working directory. This function should take a single species’ dataframe as it’s input, and output the results of a statistical test comparing the mean body masses of males and females.

In detail, the function should:

Isolate the variables of interest
Split the data into two vectors, one of male body masses, and the other of female body masses.
Perform a t-test to test if the mean body mass of males and females differ, and return the results.

We will then use a for loop in the main script to loop the function over each species.

Don’t worry about the stats here, that’s not the goal of this course. To run a t-test in R, use the t.test() function. Have a look at the helpfiles if you are unsure with ?. If you can produce the same outcome with less steps than above (it’s possible), then go for it!

Show potential solution

analyse_data <- function(df) {
  # Isolate variables of interest
  body_mass <- df$body_mass_g
  sex <- df$sex
  
  # Split data into two vectors, one for male body masses and one for female body masses
  male_mass <- body_mass[sex == "male"]
  female_mass <- body_mass[sex == "female"]
  
  # Perform t-test to compare mean body masses of males and females
  t_test <- t.test(male_mass, female_mass)
  
  # Return results
  return(t_test)
}

Save the analyse_data.R script. As before, go into your main script and define the function using source().

In the main script, in the analysis section, use your knowledge of for() loops to run your analyse_data() function over each dataset in the list returned by import_data(). You will need to save the results in a list, so you can access them after it has run. Naming each item in the list with the species name seems helpful too. If you get stuck, I show a potential solution as to what my script looked like at this point below.

Show an example main script

# Palmer Penguins Dataset Analysis
# Iain Moodie
# 2023-11-02

rm(list=ls())

# functions ---------------------------------------------------------------
source("import_data.R")
source("analyse_data.R")

# import dataset ----------------------------------------------------------

penguin_data <- import_data("palmerpenguins.csv")

# analysis ----------------------------------------------------------------

# Loop over each species in penguin_data
results <- list() # create an empty list to store the results
species_names <- names(penguin_data) # get the species names
for (i in seq_along(penguin_data)) {
  results[[species_names[[i]]]] <- analyse_data(penguin_data[[i]]) # run analyse_data() on each df and store the results in the list
  
}

# plot --------------------------------------------------------------------

2.5 `plot_data()`

The final step is to make a function that will produce a figure that illustrates our results. Again, create a new script called plot_data.R, and write a function which produces a plot that you think would illustrate these results. This one is up to you to decide what should be an argument, but the output needs to be at least one plot that can demonstrate the sexual dimorphism in each of these three penguin species.

Show potential solution

plot_data <- function(df, species_name) {
  boxplot(body_mass_g ~ sex, data = df, main = paste("Body Mass by Sex for", species_name))
}

Once you’re happy with it, add it to your main script like before.

2.6 Finishing up

If you select your whole script and run it all , does it work as intended? If so, that’s brilliant! You’ve written a full analysis, start to finish, using a script-based workflow. Congratulations! Make sure all your scripts are well commented and saved. That’s all for today. See you at the next session!

Show an example completed main script

# Palmer Penguins Dataset Analysis
# Iain Moodie
# 2023-11-02

rm(list=ls())

# functions ---------------------------------------------------------------
source("import_data.R")
source("analyse_data.R")
source("plot_data.R")

# import dataset ----------------------------------------------------------

penguin_data <- import_data("palmerpenguins.csv")

# analysis ----------------------------------------------------------------

# Loop over each species in penguin_data
results <- list() # create an empty list to store the results
species_names <- names(penguin_data) # get the species names
for (i in seq_along(penguin_data)) {
  # run analyse_data() on each df and store the results in the list
  results[[species_names[[i]]]] <- analyse_data(penguin_data[[i]])
  
}

# plot --------------------------------------------------------------------

par(mfrow=c(1,3)) # 3 panels
# Loop over each species in penguin_data
for (i in seq_along(penguin_data)) {
  # Use plot_data() to make a plot for each species
  plot_data(penguin_data[[i]], species_names[[i]])
}
par(mfrow=c(1,1)) # reset to 1 panel

3 Extra exercises

If you’ve completed all that and still want more to do, try this problem out.

3.1 Fizzbuzz

This is a classic in programming, and is (allegedly) often used as a question in coding interviews. The problem is simple: create a function that takes a number as an argument and returns "Fizz", "Buzz", or FizzBuzz. The rules to decide which to return are as follows:

If the number is a multiple of 3 the output should be "Fizz".
If the number given is a multiple of 5, the output should be "Buzz".
If the number given is a multiple of both 3 and 5, the output should be "FizzBuzz".
If the number is not a multiple of either 3 or 5, the input number should be returned.

For example, an input of 9 should return "Fizz", as should an input of 12. An input of 10 should return "Buzz". An input of 15 should return "FizzBuzz". An input of 11 should return 11.

Give it a go!

Show hint

# A very helpful operator for this is the modulus operator %%
# It returns the remainder from the division of two numbers
# For example:
10 %% 5 # returns 0 
10 %% 3 # returns 1

Show potential solution

fizzbuzz <- function(num) {
  if (num %% 3 == 0 & num %% 5 == 0) {
    return("FizzBuzz")
  } else if (num %% 3 == 0) {
    return("Fizz")
  } else if (num %% 5 == 0) {
    return("Buzz")
  } else {
    return(num)
  }
}

Now, write a for loop that runs fizzbuzz() on all numbers between 1 and 10000.

--- title: "Functions Exercises" subtitle: "Open R Sessions 2024" author: - Etka Yapar - "**Iain Moodie**" - Simon Jacobsen Ellerstrand - Violeta Caballero López - Ximena Alva Caballero resources: - "palmerpenguins.csv" format: html: code-fold: true code-tools: true code-overflow: wrap toc: true toc-depth: 4 self-contained: true anchor-sections: true smooth-scroll: true theme: light: flatly dark: darkly number-sections: true number-depth: 3 execute: echo: true warning: false eval: false --- Welcome to the functions exercises! We suggest you work in pairs or small groups for these exercises. In this session you will need to think creatively about how to solve each problem. There is going to be more than one write answer for each. Remember to annotate your code with comments to help you remember what you did and why. To write a comment in `R`, use the `#` symbol. Anything after the `#` symbol will be ignored by `R`. It's up to you how you want to organise your `R` scripts. We recommend you setup a new working directory for these exercises, as you will be working with multiple R scripts. # Simple functions In this section, you will write some functions so solve simple tasks. For this section **only** you can work in a single script, or save each function in seperate scripts. In later sections, we will practise a script-based workflow. For now, the goal is to get comfortable with writing functions. ## Farenheit to Celcius Imagine you're working with ecological data collected from around the world. One variable that was recorded was mean daily temperatures. However, some of the data was recorded in Farenheit and some in Celcius. Write a function that when given a temperature in Farenheit, converts it to Celcius. The formula to do so is: $$ ^{\circ}C=\frac{5}{9}(^{\circ}F-32) $$ Give your function a helpful name. Remember, the general formulation of a function is: ```{r} #| code-fold: false function_name <- function(input) { # some stuff my function does return(output) } ``` Once you've got a solution, try it out! Does $41^{\circ}F=5^{\circ}C$ ? Does you function still work if instead of a single temperature, you give a vector of temperatures as the input? ```{r} #| code-summary: "Show potential solution" fahrenheit_to_celsius <- function(temp_f) { temp_c <- (5/9) * (temp_f - 32) return(temp_c) } ``` ## Standard error In base `R`, there is no single function to calculate the standard error, like there is for the standard deviation, `sd()`. The standard error of a sample is defined as: $$ SE=\frac{\sigma}{N} $$ where $\sigma$ is the standard deviation of the sample, and $N$ is the number of samples. Write a function that takes a numeric vector as an input and returns the standard error. Try it on these vectors to see if it works as it should: ```{r} #| code-fold: false vec1 <- c(36, 1, 53, 2, 75, 74, 59, 28, 63, 87) vec2 <- c(39, 84, 8, 26, 4, 21, 29, 35, 17, 14) ``` The standard error for `vec1` and `vec2` should be `3.012124` and `2.273543` respectively. ```{r} #| code-summary: "Show potential solution" standard_error <- function(x) { sd(x)/length(x) # note that we don't always need to use return() } ``` ## Compare means Write a function that takes two vectors as inputs, `x` and `y`. Your function output should be a printed statement declaring which vector has a larger mean. You will likely need to use some `if()` statments here, and the `print()` function might come in helpful. Once you've got a function, try it out on these `vec1` and `vec2` from the previous problem. `vec1` has a larger mean than `vec2`. Does your function agree? What about with these next two vectors? ```{r} #| code-fold: false vec3 <- c(27, 71, 79, 20, 47, 89, 85, 44, 4, 58) vec4 <- c(84 ,18 ,75 ,36 ,96, 10, 82, 15, 39, 69) ``` `vec3` and `vec4` have the same mean. Can your function handle this scenario? If not, update it so it can. ```{r} #| code-summary: "Show potential solution" compare_means <- function(x, y) { if (mean(x) > mean(y)) { print("The mean of vector x is larger than the mean of vector y.") } else if (mean(x) < mean(y)) { print("The mean of vector y is larger than the mean of vector x.") } else { print("The means of vector x and vector y are equal.") } } ``` # Script-based workflow Save what you have done in **Section 1** and close those scripts. We are going to create a small script-based analysis of the `palmerpenguins` dataset. We will use a main script that will layout the structure of our analysis, that will import functions from separate scripts, all within the same folder. Our "pipeline" should: 1. Import and manipulate the dataset 2. Perform a statistical analysis 3. Make a figure Each of these steps should be done using one (or more) function(s). ## Download the dataset The dataset can be found on canvas or [here](palmerpenguins.csv), as a CSV file called `palmerpenguins.csv`. It's the same file we used in the plotting exercises. Download the dataset and put it in your working directory. ## Setting up your main script Create a new `.R` file, and name is `palmerpenguin_main.R`. This script will be where we layout the structure of our analysis. Using comments, give the script a title, write your name, and todays date, like the following: ```{r} #| code-fold: false # Palmer Penguins Dataset Analysis # Iain Moodie # 2023-11-02 ``` Make four sections, one for each part of the analysis pipeline, and an additional one where we will define (load) our functions. In Rstudio, you can use the shortcut `Command+Shift+R` or `Control+Shift+R` to create a new section. These sections are now also shown at the bottom left of the script window, and you can quickly navigate between them. It should look something like this: ```{r} #| code-fold: false # functions --------------------------------------------------------------- # import dataset ---------------------------------------------------------- # analysis ---------------------------------------------------------------- # plot -------------------------------------------------------------------- ``` Often, we put the command `rm(list=ls())` at the start of our main script to clear the environment, so that we can be sure our script contains all the information needed to run the analysis. Add it to yours. The main script is setup, so now let's work on our functions. ## `import_data()` Create a new `.R` file called `import_data.R` and save it in your working directory. In this file, we are going to make our import function. The function should take the filename of the palmerpenguins dataset as it's input, and return a list of dataframes. Breaking that down, the function should: - Import the CSV file into `R` - Remove any rows that have `NA` values in them - Split the dataframe into a list of dataframes, one for each species Try to write a function that does all this. ```{r} #| code-summary: "Show potential solution" import_data <- function(file_path) { # Import CSV file into R penguins <- read.csv(file_path) # Remove rows with NA values penguins <- na.omit(penguins) # Split dataframe into a list of dataframes, one for each species penguins_list <- split(penguins, penguins$species) return(penguins_list) } ``` Save the `import_data.R` file, and open up your main script `palmerpenguin_main.R`. In the `functions` section, we will now define (load) the function from `import_data.R` so that it can be used in the main script. We do that using `source()`. ```{r} #| code-fold: false #| eval: false # functions --------------------------------------------------------------- source("import_data.R") ``` You can check that the function is loaded using `ls()`. The function `import_data()` should show up if you have run the `source()` command. You can now add it to the `import dataset` section, to start building up your analysis. Does the whole script run (select everything, then run)? Next we will move onto the statistical analysis, where we will test if the mean mass of males and females differ significantly from each other in each species. ## `analyse_data()` Create a new `.R` file called `analyse_data.R` and save it in your working directory. This function should take a single species' dataframe as it's input, and output the results of a statistical test comparing the mean body masses of males and females. In detail, the function should: - Isolate the variables of interest - Split the data into two vectors, one of male body masses, and the other of female body masses. - Perform a t-test to test if the mean body mass of males and females differ, and return the results. We will then use a for loop in the main script to loop the function over each species. Don't worry about the stats here, that's not the goal of this course. To run a t-test in `R`, use the `t.test()` function. Have a look at the helpfiles if you are unsure with `?`. If you can produce the same outcome with less steps than above (it's possible), then go for it! ```{r} #| code-summary: "Show potential solution" analyse_data <- function(df) { # Isolate variables of interest body_mass <- df$body_mass_g sex <- df$sex # Split data into two vectors, one for male body masses and one for female body masses male_mass <- body_mass[sex == "male"] female_mass <- body_mass[sex == "female"] # Perform t-test to compare mean body masses of males and females t_test <- t.test(male_mass, female_mass) # Return results return(t_test) } ``` Save the `analyse_data.R` script. As before, go into your main script and define the function using `source()`. In the main script, in the analysis section, use your knowledge of `for()` loops to run your `analyse_data()` function over each dataset in the list returned by `import_data()`. You will need to save the results in a list, so you can access them after it has run. Naming each item in the list with the species name seems helpful too. If you get stuck, I show a potential solution as to what my script looked like at this point below. ```{r} #| code-summary: "Show an example main script" # Palmer Penguins Dataset Analysis # Iain Moodie # 2023-11-02 rm(list=ls()) # functions --------------------------------------------------------------- source("import_data.R") source("analyse_data.R") # import dataset ---------------------------------------------------------- penguin_data <- import_data("palmerpenguins.csv") # analysis ---------------------------------------------------------------- # Loop over each species in penguin_data results <- list() # create an empty list to store the results species_names <- names(penguin_data) # get the species names for (i in seq_along(penguin_data)) { results[[species_names[[i]]]] <- analyse_data(penguin_data[[i]]) # run analyse_data() on each df and store the results in the list } # plot -------------------------------------------------------------------- ``` ## `plot_data()` The final step is to make a function that will produce a figure that illustrates our results. Again, create a new script called `plot_data.R`, and write a function which produces a plot that you think would illustrate these results. This one is up to you to decide what should be an argument, but the output needs to be at least one plot that can demonstrate the sexual dimorphism in each of these three penguin species. ```{r} #| code-summary: "Show potential solution" plot_data <- function(df, species_name) { boxplot(body_mass_g ~ sex, data = df, main = paste("Body Mass by Sex for", species_name)) } ``` Once you're happy with it, add it to your main script like before. ## Finishing up If you select your whole script and run it all , does it work as intended? If so, that's brilliant! You've written a full analysis, start to finish, using a script-based workflow. Congratulations! Make sure all your scripts are well commented and saved. That's all for today. See you at the next session! ```{r} #| code-summary: "Show an example completed main script" # Palmer Penguins Dataset Analysis # Iain Moodie # 2023-11-02 rm(list=ls()) # functions --------------------------------------------------------------- source("import_data.R") source("analyse_data.R") source("plot_data.R") # import dataset ---------------------------------------------------------- penguin_data <- import_data("palmerpenguins.csv") # analysis ---------------------------------------------------------------- # Loop over each species in penguin_data results <- list() # create an empty list to store the results species_names <- names(penguin_data) # get the species names for (i in seq_along(penguin_data)) { # run analyse_data() on each df and store the results in the list results[[species_names[[i]]]] <- analyse_data(penguin_data[[i]]) } # plot -------------------------------------------------------------------- par(mfrow=c(1,3)) # 3 panels # Loop over each species in penguin_data for (i in seq_along(penguin_data)) { # Use plot_data() to make a plot for each species plot_data(penguin_data[[i]], species_names[[i]]) } par(mfrow=c(1,1)) # reset to 1 panel ``` # Extra exercises If you've completed all that and still want more to do, try this problem out. ## Fizzbuzz This is a classic in programming, and is (allegedly) often used as a question in coding interviews. The problem is simple: create a function that takes a number as an argument and returns `"Fizz"`, `"Buzz"`, or `FizzBuzz`. The rules to decide which to return are as follows: - If the number is a multiple of 3 the output should be `"Fizz"`. - If the number given is a multiple of 5, the output should be `"Buzz"`. - If the number given is a multiple of both 3 and 5, the output should be `"FizzBuzz"`. - If the number is not a multiple of either 3 or 5, the input number should be returned. For example, an input of `9` should return `"Fizz"`, as should an input of `12`. An input of `10` should return `"Buzz"`. An input of `15` should return `"FizzBuzz"`. An input of `11` should return `11`. Give it a go! ```{r} #| code-summary: "Show hint" # A very helpful operator for this is the modulus operator %% # It returns the remainder from the division of two numbers # For example: 10 %% 5 # returns 0 10 %% 3 # returns 1 ``` ```{r} #| code-summary: "Show potential solution" fizzbuzz <- function(num) { if (num %% 3 == 0 & num %% 5 == 0) { return("FizzBuzz") } else if (num %% 3 == 0) { return("Fizz") } else if (num %% 5 == 0) { return("Buzz") } else { return(num) } } ``` Now, write a for loop that runs `fizzbuzz()` on all numbers between 1 and 10000.

1 Simple functions

1.1 Farenheit to Celcius

1.2 Standard error

1.3 Compare means

2 Script-based workflow

2.1 Download the dataset

2.2 Setting up your main script

2.3 import_data()

2.4 analyse_data()

2.5 plot_data()

2.6 Finishing up

3 Extra exercises

3.1 Fizzbuzz

2.3 `import_data()`

2.4 `analyse_data()`

2.5 `plot_data()`