Introduction to R

Author

Iain R. Moodie

Published

April 23, 2025

What is R?

R is a powerful, open-source programming language specifically designed for statistical computing, data analysis, and visualization. For biologists, it offers an invaluable toolkit to analyse experimental results, manage large datasets (e.g., genomic or ecological data), and create publication-quality graphs. Unlike point-and-click software, R allows you to automate repetitive tasks, ensuring efficiency and reproducibility in your research. Its flexibility and extensive capabilities make it a staple in almost all fields within biology, both in academia and in industries. A huge reason for this is that R is free to use, and as such has a global community continually developing new tools and resources tailored to scientific research.

How do I use R?

R can be used in a number of ways. In the next exercise session, we will install R on your computer, along with Rstudio, which is a friendly user interface for R. In this exercise, you will use R in your browser to explore its capabilities.

Note that once the webpage has loaded, you can edit the code in any of the boxes below (I strongly encourage you to do this!). Press the “Run code” button to run the code you have written. You will learn a lot through experimenting, and you can always reset the code box back to its original state with the “Start over” button.

Introduction to R

R as a calculator

R, like most programming languages, can perform arithmetic operations. It follows the order of operations used in mathematics. If you want to review that, you can do so in Chapter 1 of Duthie (2025).

You can use the following operators to write equations in R:

  • + : Addition
  • - : Subtraction
  • * : Multiplication
  • / : Division
  • ^ or ** : Exponentiation
  • %% : Modulus (remainder from division)
  • %/% : Integer division

Use these to solve the questions below:

Fill in the blank so that the result of the sum is 10. You need to delete the ______ and replace it with a number.


Fill in the blank so that the result of the sum is 12.


Fill in the blank so that the result of the sum is 81.

Programming concepts

While it is not required to be an experienced computer programmer to use R, there is still a set of basic programming concepts that new R users need to understand. We will cover these first. You do not need to memorise these things.

Objects

In R, data can be stored in objects. An object can be thought of as a container that holds data. You can create an object by assigning a value to a name using the assignment operator <-. In the example below, I assign the value 5 to the object x, and the value 10 to the object y. We can then perform maths or other operations using these objects. Calculate the sum of x and y using + on the line below.

x <- 5
y <- 10
x + y

Add a third object called z and assign it the value 12. Write a math equation that will output the value 24, using x, y, and z only.

x <- 5
y <- 10
z <- 12

y / x * z

Objects can hold any sort of data in R. It could be a single value like in the above example, multiple values, text, a whole dataset, or a plot.

Data types

In R, data can come in various types, and it’s important to understand these types to manipulate and analyse data effectively. Here are some of the most common data types in R:

  • Numeric: Represents numbers and can be either integers or floating-point numbers. For example, 42 and 3.14 are numeric values.
  • Character: Represents text or string data. Character values are enclosed in quotes, such as "Hello, world!".
  • Logical: Represents boolean values, which can be either TRUE or FALSE.
  • Factor: Used to represent categorical data. Factors are useful for storing data that has a fixed number of unique values, such as “Species A” and “Species B” for species ID.

Vectors

Vectors are one of the most basic data structures in R. A vector is a sequence of data elements of the same basic type. We will sometimes directly use vectors in this course, so it will be good to be familiar with them.

  • Creating Vectors: You can create a vector using the c() function, which stands for “combine” or “concatenate”. For example, here I create 3 vectors, and assign them to different objects:

Accessing Elements: You can access elements (position) of a vector using square brackets []. For example, to access the second element of character_vector:

Note that in R, the first position is [1], not [0] like in some programming languages.

Vector Operations: You can perform operations on vectors. These operations are applied element-wise. For example:

Note that every value in the vector gets multiplied and returned.

Vector Length: You can find the length (number of values in it) of a vector using the length() function:

Dataframes

Dataframes are like spreadsheets. They have rows and columns, and all columns are the same length. These are the primary way we will represent data in this course.

species mass_g sex
blue_tit 9.1 male
blue_tit 10.6 male
sparrow 27.3 female

We will come back to them soon.

Boolean and logical operators

Boolean operators are used to perform logical operations and return boolean values (TRUE or FALSE). We will use them in this course to describe our hypotheses. Here are the most common boolean operators in R:

  • Comparison Operators: These operators compare two values and return a boolean value.
    • == : Equal to
    • != : Not equal to
    • < : Less than
    • > : Greater than
    • <= : Less than or equal to
    • >= : Greater than or equal to

For example, this bit of code should evaluate to TRUE:

And this should be FALSE:

Use the operators above to fill in the blanks below such that the code will evaluate to TRUE:

100 == 100
p <- 48

8 + p == 56
q <- 24
r <- 88

1q + 65 > r
1
Any number > 64 will work.

We can now add in some logical operators:

  • Logical Operators: These operators are used to combine multiple boolean expressions.
    • & : Logical AND
    • | : Logical OR
    • ! : Logical NOT

For example, this bit of code should evaluate to TRUE, because both the first part 1 + 3 == 4 and the second part 5 >= 4 is TRUE:

Whereas this evaluates to FALSE, because only the first part is TRUE:

But if we change the & to an OR operator |, it evaluates to TRUE because at least one part of it is TRUE:

Use the operators above to fill in the blanks below such that the code will evaluate to TRUE:

fruit_a <- "apple"
fruit_b <- "banana"

1(fruit_a != fruit_b) & (1.5 > 1.2)
1
OR | would also work here.
fruit_a <- "apple"
fruit_b <- "banana"

(fruit_a == fruit_a) | (35 + 12 > 47)

Functions

Functions perform tasks in R. Functions can take inputs, called arguments, and return outputs. We put the arguments inside the brackets. For example, in R there is a function called mean(). This function’s first argument x should be a vector of numeric data. The function then outputs the mean as a single numeric value. For example, here we assign a vector of tree heights (cm) to an object called trees. We then calculate the mean tree height using the mean() function.

Note that if we are going to supply arguments in the order that the function expects them, we do not have to tell the function which object is for each argument. Since mean() expects the first argument to be the vector you want the mean of, we can also write:

To find out what a function can do, and its arguments, use can write ?function_name, and the R helpfile will be returned for that function (e.g., ?mean). These helpfiles can be confusing at first, but the more you use R, the more they will make sense.

We will work with functions a lot in this course, so don’t worry if it still seems confusing.

Pipes

One of the final concepts I will introduce is the pipe operator |>. Note that you will often see it written as %>% when searching online. This is for historical reasons (R by default did not have a pipe operator until recently, so people had made their own). |> comes with R by default now, while %>% requires you to load a package called magrittr first (we will cover packages soon).

Pipes allow you to write code in a way that often makes more sense to people, especially non-programmers. To explain, here’s an example. Note that this is not real code, so you cannot run it.

Say I wanted to run 3 different functions on a dataframe called my_data. The functions are function_1(), function_2(), and function_3(). Imagine function_1() first transforms my data into the right scale, function_2() then performs a statistical test, and function_3() then makes a plot (again, these are not real functions, just for the example).

I could write that in a few ways. The first way would look like this:

1my_data_1 <- function_1(my_data)
2my_data_2 <- function_2(my_data_1)
3my_data_final <- function_3(my_data_2)
1
The original data, my_data, is passed to function_1(), and the result is stored in my_data_1.
2
The transformed data, my_data_1, is then passed to function_2(), and the result is stored in my_data_2.
3
Finally, the data from my_data_2 is passed to function_3(), and the result is stored in my_data_final.

While this method is quite clear to read, it creates a lot of objects that we might not want to do anything with. This is not a huge issue, but could become one if you are working with very large data sets.

We could also write it like this:

my_data_final <- function_3(function_2(function_1(my_data)))

We can wrap functions within functions to put this whole operation on one line. This gets rid of those extra objects, having only a my_data_final as the output. However, the order in which the functions are written no longer matches the order in which they are run. In the above example, function_1() runs first, then function_2(), then function_3(). But they are written in reverse order when we read it left to right.

A final method of writing this makes use of pipes |>, and has the best of both approaches:

my_data_final <- my_data |> function_1() |> function_2() |> function_3()

Pipes also allow us to spread our code over multiple lines, and the |> will look for the next bit of code on the next line if nothing comes after it:

my_data_final <- 
  my_data |> 
  function_1() |> 
  function_2() |> 
  function_3()

All the above examples have the same my_data_final output, but are just written in different ways. The computer reads them all identically, so the main benefit is how readable your code is.

In this course, we will use pipes extensively, along with a set of packages that are designed for this kind of workflow. Below, rewrite the examples to use pipes. You can check the solutions tab to see if you are on the right track:

1trees |> mean()
1
Take the trees vector, and then pipe|> it into the mean() function.

The log() function performs a natural logarithm transformation of the data.

1trees |>
  log() |>
  mean() 
1
Take the trees vector, and then pipe|> it into the log() function, then into the mean() function.

Packages

An R package is a set of functions, data and/or information that someone else has written, that you can first load, then use in your own R code. Packages are written by other R users, and distributed for free via repositories, like The Comprehensive R Archive Network (CRAN).

R packages are often used to save you time. While all the functions in an R package are written with R, and you could write them again yourself, why bother? If someone else has done it already and shared it, fantastic! In this course, we are going to use two package “families”. They are tidyverse and tidymodels. Note that both start with tidy. Remember from the lecture, that tidy refers to a particular format of data, and these packages all assume your data will be in the format, and will always return data in that format. They are also all built with pipes in mind, and are designed to make complex programming tasks (especially those performed by data scientists, of which biology fits in well) very easy. We will cover these packages in detail soon, but know to use them you need to do two things:

  1. Install the package. This needs to be done once on your computer, using the install.packages() command. For examples:
install.packages("ggplot2")

This will install ggplot2, a package for plotting data. It will install it from CRAN by default, and probably (assuming you are in Sweden) will be downloaded from a server in Umeå.

  1. We now need to load the package, so that we can access it while we write code. To do that, we use the library() function.
1library(ggplot2)
1
Note that we no longer require the " around the package name. But the function would still work if you did include them.

Below I have written some code that makes a plot using an inbuilt R dataset called iris using the package ggplot2. But if you try to run it, you will get an error. The ggplot2 package has already been installed, so fix the code by loading the ggplot2 package before the code that makes the plot.

1library(ggplot2)
iris |>
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  geom_point()
1
Make sure to load the ggplot2 package before the ggplot() function. Code is always executed top to bottom.

That was a lot of concepts in a very short amount of time! Take a well deserved break before the next exercise.