Functions in R

Open R Sessions 2023

Violeta Caballero López
Laura Hildesheim
Simon Jacobsen Ellerstrand
Iain Moodie
Pedro Rosero

So far you’ve covered:

  • How to work with R and RStudio
  • Data types, data handling, and data visualisation with R
  • Boolean logic, for loops, and big data

Coming up:

  • Functions - what, how, and why? [Now]
  • The tidyverse [16th November]

Goals for this session

  • Understand what a function is, and when to use one
  • Develop some “best practises” for creating functions
  • An introduction to a script-based workflow in R

What is a function?

flowchart LR
A[Input] --> B{Function} --> C[Output]

Example: area of a circle

\[A = \pi r^2\]

Example: area of a circle

\[A = \pi r^2\]

flowchart LR
A[radius] --> B{"circle_area()"} --> C[area]

Example: area of a circle

circle_area <- function(radius) {

  area <- pi * radius^2
  
  return(area)

}

Result:

circle_area(radius = 60)
[1] 11309.73
circle_area(radius = 400)
[1] 502654.8
circle_area(radius = 0.5)
[1] 0.7853982

But why?

Example: re-scaling vectors

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Example: re-scaling vectors

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

rescale_01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  rescaled <- (x - rng[1]) / (rng[2] - rng[1])
  return(rescaled)
}

df$a <- rescale_01(df$a)
df$b <- rescale_01(df$b)
df$c <- rescale_01(df$c)
df$d <- rescale_01(df$d)

Example: re-scaling vectors

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

rescale_01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  rescaled <- (x - rng[1]) / (rng[2] - rng[1])
  return(rescaled)
}

for (col in names(df)) {
  df[[col]] <- rescale_01(df[[col]])
}

But why?

Three big advantages over using copy-and-paste:

  • You can give a function an evocative name that makes your code easier to understand.
  • As requirements change, you only need to update code in one place, instead of many.
  • You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

Function arguments (inputs)

Function arguments

mean_ci <- function(x, conf) {
  # calculate standard error
  se <- sd(x) / sqrt(length(x))
  # get alpha value
  alpha <- 1 - conf
  # return vector with lower and upper conf int
  return(mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2)))
}

sample <- rnorm(100)

mean_ci(x = sample, conf = 0.95)
[1] -0.2591254  0.1288432
  • No limit to number of arguments
  • There are two broad classes of arguments, data and details
  • We can take objects from the R enviroment, and “copy” them into the function

Function arguments

  • Arguments can have default values
mean_ci <- function(x, conf = 0.95) {
  # calculate standard error
  se <- sd(x) / sqrt(length(x))
  # get alpha value
  alpha <- 1 - conf
  # return vector with lower and upper conf int
  return(mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2)))
}

sample <- rnorm(100)

mean_ci(x = sample)
[1] -0.2159416  0.1640188

Function arguments

  • Default values get overwritten if a value is provided
mean_ci(x = sample, conf = 0.99)
[1] -0.2756377  0.2237149
  • Arguments can be given without names if in the correct order
mean_ci(sample, 0.99)
[1] -0.2756377  0.2237149
  • But must be named if given in another order
mean_ci(conf = 0.99, x = sample)
[1] -0.2756377  0.2237149

A word of warning

f <- function(x) {
  z <- x + y
  return(z)
}

y <- 100

f(10)
[1] 110
  • Since y is not defined inside the function, R will look in the environment where the function was defined
  • This is generally not advised, and a recipe for bugs

Return values (output)

Return values

  • The value returned by the function is usually the last statement it evaluates:
f <- function(a, b) {
  a + b
}

f(4, 8)
[1] 12
f <- function(a, b) {
  return(a + b)
}

f(4, 8)
[1] 12

Return values

  • Anything after return() will not be evaluated
f <- function(a, b) {
  a_plus_b <- a + b
  return(a_plus_b)
  a_plus_b <- 0
}

f(4, 8)
[1] 12
  • This is most useful when you want to make your function return “early” instead of doing something complicated
  • e.g. if the arguments are of the wrong type, etc

Return values

  • If you want to return multiple objects, put them in a list.
f <- function(a, b) {
  a_plus_b <- a + b
  return(list(a = a, b = b, a_plus_b = a_plus_b))
}

f(4, 8)
$a
[1] 4

$b
[1] 8

$a_plus_b
[1] 12

What should be a function?

What should be a function?

You should consider writing a function if:

  • you’ve copied and pasted a block of code more than twice
  • you plan to reuse the code in another project or with another dataset
  • you want to share your code for others to re-use
  • you want to break-up your script into defined “chunks” for readability

How to decide the scope of a funtion?

  • A function should perform a well defined task (e.g. calculate confidence intervals)
  • Consider writing psuedo-code to figure out what the arguments and return values need to be, and what happens inside the function
function(sample) {
  # get standard error
  # get alpha value
  # get mean
  # use qnorm to get quantiles of normal dist
  # get ci with mean(sample) + se * qnorm(alpha)
  # return ci
}

How to decide the scope of a funtion?

function(sample) {
  # get standard error
  se <- sd(sample) / sqrt(length(sample))
  # get alpha value
  # get mean
  # use qnorm to get quantiles of normal dist
  # get ci with mean(sample) + se * qnorm(alpha)
  # return ci
}
  • Fill in your pseudo-code with real R code
  • Helpful to find issues before spending a lot of time on a function

Script-based workflow

Script-based workflow

flowchart LR
A[Function A] --> B{Main Script}
C[Function B] --> B
D[Function C] --> B
E(Data) --> B
B --> F(Output)

  • Useful where functions are likely to be reused multiple times
  • Goal is to save functions in one (or more) .R files, and then call them into the “main” script, that defines your analysis

Script-based workflow

# functions.R:
my_function <- function(a, b) {
  a_plus_b <- a + b
  return(list(a = a, b = b, a_plus_b = a_plus_b))
}
  • To define (load) all functions within a .R file, use source()
source("functions.R")
ls()
[1] "my_function"
my_function
function (a, b) 
{
    a_plus_b <- a + b
    return(list(a = a, b = b, a_plus_b = a_plus_b))
}

Script-based workflow

  • It’s not the only way to work in R, and other methods might be more suitable for you (e.g. coding “notebooks” like Quarto/Jupyter)
  • Benefits might not seem obvious now, but will pay off in the future, especially if your projects get big
  • Makes integrating with version control tools like git very clean and useful

Script-based workflow

Any questions before the exercises?

Exercises

  1. Write your own functions to solve simple but repetative tasks
  2. Setup a script-based workflow

You can also work on previous exercises, or your own work. We are here to help with anything R related!

Exercise session will be in Heden.

Thanks!