Open R Sessions 2023

Authors

Violeta Caballero López

Laura Hildesheim

Simon Jacobsen Ellerstrand

Iain Moodie

Pedro Rosero

Welcome to the functions exercises! We suggest you work in pairs or small groups for these exercises. In this session you will need to think creatively about how to solve each problem. There is going to be more than one write answer for each.

Remember to annotate your code with comments to help you remember what you did and why. To write a comment in R, use the # symbol. Anything after the # symbol will be ignored by R.

It’s up to you how you want to organise your R scripts. We recommend you setup a new working directory for these exercises, as you will be working with multiple R scripts.

1 Simple functions

In this section, you will write some functions so solve simple tasks. For this section only you can work in a single script, or save each function in seperate scripts. In later sections, we will practise a script-based workflow. For now, the goal is to get comfortable with writing functions.

1.1 Farenheit to Celcius

Imagine you’re working with ecological data collected from around the world. One variable that was recorded was mean daily temperatures. However, some of the data was recorded in Farenheit and some in Celcius. Write a function that when given a temperature in Farenheit, converts it to Celcius. The formula to do so is:

\[ ^{\circ}C=\frac{5}{9}(^{\circ}F-32) \]

Give your function a helpful name. Remember, the general formulation of a function is:

function_name <- function(input) {
  # some stuff my function does
  return(output)
}

Once you’ve got a solution, try it out! Does \(41^{\circ}F=5^{\circ}C\) ? Does you function still work if instead of a single temperature, you give a vector of temperatures as the input?

Show potential solution
fahrenheit_to_celsius <- function(temp_f) {
  temp_c <- (5/9) * (temp_f - 32)
  return(temp_c)
}

1.2 Standard error

In base R, there is no single function to calculate the standard error, like there is for the standard deviation, sd(). The standard error of a sample is defined as:

\[ SE=\frac{\sigma}{N} \]

where \(\sigma\) is the standard deviation of the sample, and \(N\) is the number of samples.

Write a function that takes a numeric vector as an input and returns the standard error. Try it on these vectors to see if it works as it should:

vec1 <- c(36, 1, 53,  2, 75, 74, 59, 28, 63, 87)
vec2 <- c(39, 84, 8, 26, 4, 21, 29, 35, 17, 14)

The standard error for vec1 and vec2 should be 3.012124 and 2.273543 respectively.

Show potential solution
standard_error <- function(x) {
  sd(x)/length(x) # note that we don't always need to use return()
}

1.3 Compare means

Write a function that takes two vectors as inputs, x and y. Your function output should be a printed statement declaring which vector has a larger mean. You will likely need to use some if() statments here, and the print() function might come in helpful. Once you’ve got a function, try it out on these vec1 and vec2 from the previous problem.

vec1 has a larger mean than vec2. Does your function agree? What about with these next two vectors?

vec3 <- c(27, 71, 79, 20, 47, 89, 85, 44, 4, 58)
vec4 <- c(84 ,18 ,75 ,36 ,96, 10, 82, 15, 39, 69)

vec3 and vec4 have the same mean. Can your function handle this scenario? If not, update it so it can.

Show potential solution
compare_means <- function(x, y) {
  if (mean(x) > mean(y)) {
    print("The mean of vector x is larger than the mean of vector y.")
  } else if (mean(x) < mean(y)) {
    print("The mean of vector y is larger than the mean of vector x.")
  } else {
    print("The means of vector x and vector y are equal.")
  }
}

2 Script-based workflow

Save what you have done in Section 1 and close those scripts. We are going to create a small script-based analysis of the palmerpenguins dataset. We will use a main script that will layout the structure of our analysis, that will import functions from separate scripts, all within the same folder. Our “pipeline” should:

  1. Import and manipulate the dataset
  2. Perform a statistical analysis
  3. Make a figure

Each of these steps should be done using one (or more) function(s).

2.1 Download the dataset

The dataset can be found on canvas, as a CSV file called palmerpenguins.csv. It’s the same file we used in the plotting exercises. Download the dataset and put it in your working directory.

2.2 Setting up your main script

Create a new .R file, and name is palmerpenguin_main.R. This script will be where we layout the structure of our analysis. Using comments, give the script a title, write your name, and todays date, like the following:

# Palmer Penguins Dataset Analysis
# Iain Moodie
# 2023-11-02

Make four sections, one for each part of the analysis pipeline, and an additional one where we will define (load) our functions. In Rstudio, you can use the shortcut Command+Shift+R or Control+Shift+R to create a new section. These sections are now also shown at the bottom left of the script window, and you can quickly navigate between them. It should look something like this:

# functions ---------------------------------------------------------------


# import dataset ----------------------------------------------------------


# analysis ----------------------------------------------------------------


# plot --------------------------------------------------------------------

Often, we put the command rm(list=ls()) at the start of our main script to clear the environment, so that we can be sure our script contains all the information needed to run the analysis. Add it to yours.

The main script is setup, so now let’s work on our functions.

2.3 import_data()

Create a new .R file called import_data.R and save it in your working directory. In this file, we are going to make our import function. The function should take the filename of the palmerpenguins dataset as it’s input, and return a list of dataframes.

Breaking that down, the function should:

  • Import the CSV file into R
  • Remove any rows that have NA values in them
  • Split the dataframe into a list of dataframes, one for each species

Try to write a function that does all this.

Show potential solution
import_data <- function(file_path) {
  # Import CSV file into R
  penguins <- read.csv(file_path)
  
  # Remove rows with NA values
  penguins <- na.omit(penguins)
  
  # Split dataframe into a list of dataframes, one for each species
  penguins_list <- split(penguins, penguins$species)
  
  return(penguins_list)
}

Save the import_data.R file, and open up your main script palmerpenguin_main.R. In the functions section, we will now define (load) the function from import_data.R so that it can be used in the main script. We do that using source().

# functions ---------------------------------------------------------------
source("import_data.R")

You can check that the function is loaded using ls(). The function import_data() should show up if you have run the source() command. You can now add it to the import dataset section, to start building up your analysis. Does the whole script run (select everything, then run)?

Next we will move onto the statistical analysis, where we will test if the mean mass of males and females differ significantly from each other in each species.

2.4 analyse_data()

Create a new .R file called analyse_data.R and save it in your working directory. This function should take a single species’ dataframe as it’s input, and output the results of a statistical test comparing the mean body masses of males and females.

In detail, the function should:

  • Isolate the variables of interest
  • Split the data into two vectors, one of male body masses, and the other of female body masses.
  • Perform a t-test to test if the mean body mass of males and females differ, and return the results.

We will then use a for loop in the main script to loop the function over each species.

Don’t worry about the stats here, that’s not the goal of this course. To run a t-test in R, use the t.test() function. Have a look at the helpfiles if you are unsure with ?. If you can produce the same outcome with less steps than above (it’s possible), then go for it!

Show potential solution
analyse_data <- function(df) {
  # Isolate variables of interest
  body_mass <- df$body_mass_g
  sex <- df$sex
  
  # Split data into two vectors, one for male body masses and one for female body masses
  male_mass <- body_mass[sex == "male"]
  female_mass <- body_mass[sex == "female"]
  
  # Perform t-test to compare mean body masses of males and females
  t_test <- t.test(male_mass, female_mass)
  
  # Return results
  return(t_test)
}

Save the analyse_data.R script. As before, go into your main script and define the function using source().

In the main script, in the analysis section, use your knowledge of for() loops to run your analyse_data() function over each dataset in the list returned by import_data(). You will need to save the results in a list, so you can access them after it has run. Naming each item in the list with the species name seems helpful too. If you get stuck, I show a potential solution as to what my script looked like at this point below.

Show an example main script
# Palmer Penguins Dataset Analysis
# Iain Moodie
# 2023-11-02

rm(list=ls())

# functions ---------------------------------------------------------------
source("import_data.R")
source("analyse_data.R")

# import dataset ----------------------------------------------------------

penguin_data <- import_data("palmerpenguins.csv")

# analysis ----------------------------------------------------------------

# Loop over each species in penguin_data
results <- list() # create an empty list to store the results
species_names <- names(penguin_data) # get the species names
for (i in seq_along(penguin_data)) {
  results[[species_names[[i]]]] <- analyse_data(penguin_data[[i]]) # run analyse_data() on each df and store the results in the list
  
}

# plot --------------------------------------------------------------------

2.5 plot_data()

The final step is to make a function that will produce a figure that illustrates our results. Again, create a new script called plot_data.R, and write a function which produces a plot that you think would illustrate these results. This one is up to you to decide what should be an argument, but the output needs to be at least one plot that can demonstrate the sexual dimorphism in each of these three penguin species.

Show potential solution
plot_data <- function(df, species_name) {
  boxplot(body_mass_g ~ sex, data = df, main = paste("Body Mass by Sex for", species_name))
}

Once you’re happy with it, add it to your main script like before.

2.6 Finishing up

If you select your whole script and run it all , does it work as intended? If so, that’s brilliant! You’ve written a full analysis, start to finish, using a script-based workflow. Congratulations! Make sure all your scripts are well commented and saved. That’s all for today. See you at the next session!

Show an example completed main script
# Palmer Penguins Dataset Analysis
# Iain Moodie
# 2023-11-02

rm(list=ls())

# functions ---------------------------------------------------------------
source("import_data.R")
source("analyse_data.R")
source("plot_data.R")

# import dataset ----------------------------------------------------------

penguin_data <- import_data("palmerpenguins.csv")

# analysis ----------------------------------------------------------------

# Loop over each species in penguin_data
results <- list() # create an empty list to store the results
species_names <- names(penguin_data) # get the species names
for (i in seq_along(penguin_data)) {
  # run analyse_data() on each df and store the results in the list
  results[[species_names[[i]]]] <- analyse_data(penguin_data[[i]])
  
}

# plot --------------------------------------------------------------------

par(mfrow=c(1,3)) # 3 panels
# Loop over each species in penguin_data
for (i in seq_along(penguin_data)) {
  # Use plot_data() to make a plot for each species
  plot_data(penguin_data[[i]], species_names[[i]])
}
par(mfrow=c(1,1)) # reset to 1 panel

3 Extra exercises

If you’ve completed all that and still want more to do, try this problem out.

3.1 Fizzbuzz

This is a classic in programming, and is (allegedly) often used as a question in coding interviews. The problem is simple: create a function that takes a number as an argument and returns "Fizz", "Buzz", or FizzBuzz. The rules to decide which to return are as follows:

  • If the number is a multiple of 3 the output should be "Fizz".
  • If the number given is a multiple of 5, the output should be "Buzz".
  • If the number given is a multiple of both 3 and 5, the output should be "FizzBuzz".
  • If the number is not a multiple of either 3 or 5, the input number should be returned.

For example, an input of 9 should return "Fizz", as should an input of 12. An input of 10 should return "Buzz". An input of 15 should return "FizzBuzz". An input of 11 should return 11.

Give it a go!

Show hint
# A very helpful operator for this is the modulus operator %%
# It returns the remainder from the division of two numbers
# For example:
10 %% 5 # returns 0 
10 %% 3 # returns 1
Show potential solution
fizzbuzz <- function(num) {
  if (num %% 3 == 0 & num %% 5 == 0) {
    return("FizzBuzz")
  } else if (num %% 3 == 0) {
    return("Fizz")
  } else if (num %% 5 == 0) {
    return("Buzz")
  } else {
    return(num)
  }
}

Now, write a for loop that runs fizzbuzz() on all numbers between 1 and 10000.