Exercise 1: Data analysis using R

Author

Iain R. Moodie

Welcome to RStudio

Open RStudio

Just like you would any other program on your computer, open RStudio (not R!).

It should detect your R installation automatically, but if not, a window will open asking you to select it. If R does not appear here, I suggest you restart your computer first.

You should be met by a scene that looks like this:

A screenshot of a new RStudio installation on macOS.

Important settings

Before we go any further, we need to change some default settings in RStudio. If you know what you are doing and don’t like these suggestions, feel free to ignore them. If this is your first time using RStudio, then please follow them carefully. It will make following along easier.

Change the following settings

Go to Tools -> Global Settings, then:

  1. Go to the General tab.
    1. Un-tickRestore .RData into workspace at startup
    2. Set “Save workspace to RData on exit:” to Never.
  2. Go to the Code tab
    1. TickUse native pipe operator, l> (requires R 4.1+)
  3. Go to the RMarkdown tab
    1. Un-tickShow output inline for all R Markdown documents

While we are here, if you wanted to change the font size or theme, you can do that in the Appearance tab. RStudio also has screenreader support. You can enable that in the Accessibility tab.

Great, now that’s done, let’s move on.

RStudio layout

Rstudio is designed around a four panel layout. Currently you can see three of them. Follow the instructions to open an RMarkdown document, which will also reveal the fourth panel.

How to open an RMarkdown document

Go to File -> New file -> R markdown…

This will open a new window and you will be prompted to fill in some details about your new RMarkdown document.

  1. Give your document a title, such as "BIOC13 Exercise Session 1".
  2. Write your name as the author.
  3. Leave the rest of the options as default for now, and click OK.

This will open the RMarkdown document with some example text and code. If you followed the Getting familiar with R set of execises, you’re already familiar with Rmarkdown documents, as that page behaves exactly like one! They allow you to mix code “chunks” with text and images. Sort of like a powerful word document. We will use RMarkdown documents exclusively in this course.

Your screen should look like the screenshot below. Note we now have four panels.

A screenshot of RStudio on macOS with an RMarkdown file open, and the panels labelled.
  1. Source: This is where we edit code related documents. Anything you want to be able to save should be written here.
  2. Console: the console is where R lives. This is where any command you write in the source pane and run will be sent to be executed.
  3. Environments: this panel shows you objects loaded into R. For example, if you were to assign a value to an object (e.g.x <- 1), then it would appear here.
  4. Output: this panel has many functions, but is commonly used to navigate files, show plots, show a rendered RMarkdown file or to read the R help documentation.

RMarkdown

RMarkdown is a file format for making dynamic documents with R. It combines plain text with embedded R code chunks that are run when the document is rendered, allowing you to include results and your R code directly in the document. This makes it a powerful tool for creating reproducible analyses, which are extremely important in science.

The RMarkdown document you opened has some example text and code. An RMarkdown document consists of three main parts:

  1. YAML Header: This section, enclosed by --- at the beginning and end, contains metadata about the document, such as the title, author, date, and output format.

  2. Text: You can write plain text using Markdown syntax to format it. Markdown is a lightweight markup language with plain text formatting syntax, which is easy to read and write.

  3. Code Chunks: These are sections of R code enclosed by triple backticks and {r}. You can click the green arrow to run all the code in a code chunk, or run each line of code using the Run button, or by using Ctrl+Enter (Windows) or Cmd+Enter (macOS)When the document is rendered, the code is executed, and the results are included in the output.

Notice at the top left of the Source panel, there are two buttons: Source and Visual. These allow you to switch betwee two views of the RMarkdown document. The Source view is what you are looking at, and it is the raw text document. You can also use the Visual view, which allows you to work in a WYSIWYG (what you see is what you get) view, similar to Microsoft Office or other text editors. This “renders” your markdown code for you while you write. It also gives you a series of menus to help you format text, which means you do not need to learn how to write markdown code (although it is extremely simple, and you likely know some already).

Try switching between the Source and Visual modes

Use the toggle to see the document in both views. Which ever view you prefer (and you can switch as often as you like), the code part stays the same. It is primarily there for editing the text around your code.

Working directory

I strongly recommend you create a folder where you save all the work you do as part of this section of the course. I also recommend you make this folder in a part of your computer that is not being synced with a cloud service (iCloud, OneDrive, Google Drive, Dropbox, etc). These services can cause issues with RStudio. You can always back up your work at the end of a session.

Make a folder for the statistics section of BIOC13

Generally this is easier outside of RStudio.

  1. Make a folder for the statistics section of BIOC13.
  2. Make a folder within that folder for this exercise.

For example, this is exercise 1, so my main folder might be called bioc13, and within that folder I might make a folder called exercise_1.

You should make a new folder for each exercise we do. This will make it easy for you to stay organsied and submit work you do to me for feedback. It also makes your code reproducible by simply sending someone the contents of the folder in question.

We now want to set our working directory to this bioc13/exercise_1 folder. A working directory is the directory (folder) in a file system where a user is currently working. It is the default location where all your R code will be executed and where files are read from or written to unless specified otherwise.

Set your working directory

Within RStudio go to Session -> Set working directory -> Choose directory, then navigate to the folder you just made for this exercise.

Notice that now in your Output pane, in the files tab, you can see the contents of your folder (which is probably nothing currently). Let’s change that.

Saving your document

Let’s save this example RMarkdown document that RStudio has made for us. You do that exactly how you might expect.

Saving a document

Go to File -> Save, or use the floppy disc icon. Ensure you save it in your working directory with a descriptive name (e.g. exercise_1.Rmd).

The file should have appeared in your Output pane, with the extension .Rmd.

Let’s move onto working with some data!


A walkthrough analysis

Today we will work with a dataset called tephritis_phenotype.csv. The dataset comes from a study conducted at Lund University by Nilsson et al. (2022).

Background

The dataset describes morphological measurements (lengths, widths of different body parts) of the fly Tephritis conura. This species has specialised to utilise two different host plants (host_plant), Cirsium heterophyllum and C. oleraceum, and formed stable “host races”. Individuals of both host races were collected in both sympatry (where both Cirsium heterophyllum and C. oleraceum host plants co-occur) and allopatry (where only one Cirsium species occurs) (patry) from eight different populations in northern Europe (region) from both sides of the Baltic sea (baltic). Individuals were measured after having been hatched in a common lab environment. One female and one male (sex) from each bud was measured. The authors took magnified photographs of each individual, and of the wings of each individual.

Figure 1 from Nilsson et al. (2024): Sampling design, host plants, and traits investigated. a. Parallel sampling of allopatric and sympatric populations of the two host races of T. conura flies east and west of the Baltic. CH denotes the C. heterophyllum host race and CO denotes the C. oleraceum host race. b. Size measurements of T. conura. c. The ancestral host plant, C. heterophyllum. d. The derived host plant, C. oleraceum

Measured traits included the length of a wing (wing_length_mm), the width of a wing (wing_width_mm), the amount of the wing that was melanised (melanized_percent), the length of the body (body_length_mm) and the length of the ovipositor (ovipositor_length_mm).

Getting everything setup

Before we conduct the analysis, we need to download the dataset and move it to our working directory.

Download the dataset

You can download the dataset called tephritis_phenotype.csv from here.

Once downloaded, you should move it to your working directory folder for this exercise before continuing.

Next, let’s clean up our document.

Cleaning up the RMarkdown document

Using the RMarkdown file we generated at the start (that I called exercise_1.Rmd):

  1. Delete all the code and text that RStudio automatically generated, except the YAML header (the text at the start between the ---). You can do that as you would expect in any other text editor.

Now we have our blank RMarkdown file, let’s get started.

Installing R packages

In this exercise, we will use the tidyverse package, and the infer package. To install them you need to use the install.packages() function. Since we only need to do this once per computer, we should run this function directly in the Console panel.

Install the R packages on your computer

Type or copy the install function into the console, and press enter to run:

install.packages("tidyverse")
install.packages("infer")

From now on, we won’t write things directly in the Console, and instead write code in the RMarkdown document in the Source panel, which we then “Run” and send the Console.

Creating code cells

Code cells are where we write code in an RMarkdown document. This allows use to write normal text outside these sections.

Make a code cell

In your Source panel, in the RMarkdown document, add a R code cell. This is done in different ways depending on if you are using the visual view or the source view:

Go to Insert -> Executable Cell -> R.

Use three back-ticks (```) to mark the start and end of a code cell. Additionally at the start, we declare the language used by enclosing it in two curly brackets {r}.

```{r}

```

In both views, you can also use the shortcut Shift-Alt-I or Option-Command-I.

Loading R packages

After installing an R package, we need to tell R that we want to use the package in this document. We use the library() function to do that.

R code should always be executed “top to bottom”, so this bit of code should come right at the start.

Tell R that we want to use the packages we just installed

Inside that code cell you just made, use the library() function to load the tidyverse and infer packages:

library(tidyverse)
library(infer)

Then run the code. (To run code in the Source panel, you can click on the line you want to run, and then press the “Run” button. Or you can also use the keyboard shortcut Ctrl+Enter or Cmd+Return.)

If that worked, you will get a message that reads something similar to:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Not all packages produce a message when they are loaded (for example, infer did not).

Writing text alongside code

Anywhere outside a code cell you can write normal text. In this course, you might find it helpful to write yourself notes alongside your code, so that you can come back to your notes during other exercises, or later in your studies.

Along side normal text, you can structure an RMarkdown document using headings.

Change the type of text you are typing in the menu at the top:

Use #s to indicate the level of the heading:

# Heading level 1
## Heading level 2
### Heading level 3

I leave it up to you to decide how and when to use headings and text.

Importing data into R

We will now load the tephritis_phenotype.csv data file that you downloaded earlier. A .csv file is a file that stores information in a table-like format with Comma Separated Values. A typical .csv file will look something like this:

species,height,n_flowers
persica,1.2,12
persica,1.5,18
banksiae,2.4,3
banksiae,1.7,8

.csv files are especially suited to storing data that can be used across a wide variety of programmes, as everything is stored as plain text.

Load the data into R

We will load the tephritis_phenotype.csv data file using the read_csv() function and assign it to an object named tephritis_data. To do that, first make a new code cell, then within it write:

1tephritis_data <- read_csv("tephritis_phenotype.csv")
1
Be sure to use quote marks around the file name.

If that worked, you should get the following message with some information about the data:

Rows: 579 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): region, host_plant, patry, sex, baltic
dbl (5): body_length_mm, ovipositor_length_mm, wing_length_mm, wing_width_mm...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This has loaded a copy of the data from tephritis_phenotype.csv into R. Notice that the object tephritis_data has also appeared in the Environment panel.

View the data in R

To view the data, click on the object tephritis_data in the Environment panel with your mouse.

This allows you to view the dataset like you would in a spreadsheet software like Microsoft Excel. Note however, there is no way to edit the data in this view. This is by design. Any editing of the data needs to be done in the RMarkdown document with code. That way, you can keep a record of any edits you make, without touching the original data file.

Exploring data

Let’s use a few more functions to get a better understanding of the dataset.

Print the data to the console

Make a new code cell, and write the following:

1print(tephritis_data)
1
We can also just simply write tephritis_data without the print statement, and we would get the same output.

Let’s unpack what has appeared in your console:

  • The first line of the output is telling us what sort of data is in the object. In this case, its a tibble, a way of storing data in a table/spreadsheet format.
  • That same line tells us the number of rows (579) and number of columns (10) in the table.
  • The next line tell us the variable types.
  • Then the first 10 rows of the dataframe have been printed, along with the column (variable) names at the top.
  • Below that, we get two important pieces of information, denoted by the small symbol:
    • The first tells us how many rows are not being shown here. Since by default 10 are shown, \(579-10=569\).
    • The second tells us how many columns (variables) are not being shown here, as there was not enough room to print them.
  • We can also see that some rows have NA values (not applicable). This is the symbol R uses to say no data present.

Let’s use a few other ways of viewing our data in the console:

Take a glimpse at the data

Make a new code cell, and write the following:

glimpse(tephritis_data)
  • glimpse() transposes the dataframe, with columns now running down the page. This view means we can see all the variables.

You can also use summary() to calculate some useful summary statistics of the dataset as a whole:

Calculate some summary statistics of your dataset

Make a new code cell, and write the following:

summary(tephritis_data)
Questions

Answer the following questions in text in your RMarkdown document. Use what you have learned from exploring the data so far, and the background to the study:

  1. What is the unit of observation in this dataset? In other words, what does each row represent?
  2. What populations (statistical use of the word) could we make inferences about using this data?
  3. Are there any NA values in the dataset? In which variable(s) and why might this be?

Answering a research question

One of the questions that Nilsson et al. (2022) were interested in was if there has been any morphological (size of body parts) divergence between the host races of these flies. In other words, do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in some predictable way?

Let’s use the variable ovipositor_length_mm to explore this research question.

Filtering the dataset

First, we should filter() our dataset to only contain females (as only females have an ovipositor). filter() is a function that let’s us write conditional statements that only allow rows that meet those conditions to “filter” through. For example, we can use filter() to only allow rows where sex == "female" to filter through, thereby creating a new dataset of just female flies.

Filter the dataset

Make a new code cell, and write the following:

1female_data <-
  tephritis_data |> 
2  filter(sex == "female")
1
I assigned the new dataset to a new object, called female_data.
2
Inside the filter() function, I say which rows should be allowed through the filter.

Now we have the rows we are interested in, let’s move on.

Exploratory plot

Often before we start any formal statistics, we want to plot our data.

R has a built in method to make plots, but here we will use a different approach. Instead we will use ggplot2, a plotting package that is installed with tidyverse. ggplot2 provides a clear and simple way to customise your plots. It is based in a data visualisation theory known as the grammer of graphics (Wilkinson 2013).

The grammer of graphics

The grammer of graphics gives us a way to describe any plot.

A ggplot2 plot has three essential components:

  • data: the dataset that contains the variables you want to plot
  • geom: the geometric object you want to use to display your data (e.g. a point, a line, a bar).
  • aes: aesthetic attributes that you want to map to your geometric object. For example, the x and y location of a point geometry could be mapped to two variables in your dataset, and the colour of those points could be mapped to a third.

ggplot2 uses a layered approach to the grammer of graphics. This makes it very easy to start contructing plots by putting together a “recipe” step-by-step. Let’s walk through an example.

In our case, we want to plot ovipositor_length_mm for each host_plant group. One example might be a boxplot:

A description of the data a boxplot shows
ggplot( 
1  data = female_data,
2  mapping = aes(x = host_plant, y = ovipositor_length_mm)
3  ) +
4  geom_boxplot()
1
Inside the ggplot() function, I specify the dataset as female_data
2
I map the host_plant column to the x axis aesthetic and the ovipositor_length_mm to the y axis aesthetic.
3
To add layers to a ggplot() object, we can use a +
4
I add a boxplot geometry, that will use the aesthetics I specified before.

Copy the code into a new code cell and run it.

Questions
  1. What does the graph tell you? Is there a difference between the host races? Write your answer in your RMarkdown document.

Calculating the observed statistic

Our research question is:

  • Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?

The question concerns a difference in a continuous variable between two groups (host_plant). Let’s start by exploring if there is a difference in the central tendancy of ovipositor_length_mm in the two groups. That is, does average ovipositor_length_mm differ between them?

A suitable statistic that captures our question could be a difference in means.

\[ \bar{x}_{\text{heterophyllum}} - \bar{x}_{\text{oleraceum}} \]

We are going to do that using the infer package we loaded at the start.

To use infer to calculate an observed statistic, we use the approach we covered in the lecture:

In code, that looks like this:

4obs_stat <-
1  female_data |>
2  specify(response = ovipositor_length_mm, explanatory = host_plant) |>
3  calculate(stat = "diff in means", order = c("heterophyllum", "oleraceum"))
1
Using the dataset we just made, we pipe |> it into the next line.
2
We specify() which columns we are interested in, and which is our response variable, and which is our explanatory variable. In this case, we want to explain ovipositor_length_mm using host_plant, and express that as a "diff in means".
3
We calculate() our chosen statisitc "diff in means", and we say that the subtraction should be "heterophyllum" minus "oleraceum".
4
We assign this to a new object called obs_stat.
Calculate the observed test statistic

Use the code above in a new code cell to calculate the observed statistic.

To see your observed statistic, simply print it by writing the name of the object.

obs_stat
Questions
  1. Describe in words what this observed statistic tells us. Is there a difference in mean ovipositor length? If so, which host race is bigger, and by how much?
  2. Does it make sense given the graph you made earlier?

Write your answer in your RMarkdown document.

Quantifying uncertainty

If we took another sample (that is, if we repeated the field work that Nilsson et al. (2022) did), it is unlikely that we would get exactly the same observed statistic, just due to chance. We want to quantify this effect. In this exercise, we will use a 95% confidence interval to do that.

A general approach to bootstrapping confidence intervals.

We have already calculated our observed statistic, so let’s generate a sampling distribution using bootstrapping.

Recall from the lecture that a bootstrap sample is generated from the original dataset by resampling (randomly selecting rows) it with replacement (we can select the same row more than once). To complete one bootstrap sample, we resample with replacement until we have the same number of rows of data that were in our original dataset.

We then calculate our observed statistic using the bootstrap sample. This can then be repeated many times to construct a sampling distribution, which show the range of statistics we think it would be possible to observe, if we had actually sampled our population again.

To do that in code, we write the following:

2sampling_dist <-
  female_data |> 
  specify(response = ovipositor_length_mm, explanatory = host_plant) |>
1  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("heterophyllum", "oleraceum"))
1
Notice how this is almost identical to the code we used to get the observed statistic, but we have one extra step, where we generate 10000 bootstrapped datasets, then calculate the diff in means for all of them.
2
We assign this to a new object, called sampling_dist
Simulate and visualize a sampling distribution

Use the code above in a new code cell to get a bootstrap sampling distribution.

Again we can print it by just writing its name:

sampling_dist

infer has a helpful function to quickly see the distribution:

visualize(sampling_dist)
Questions
  1. What does the output of visualize(sampling_dist) show? How would you describe it to a friend who is not in this class?

Write your answer in your RMarkdown document.

Confidence intervals

We can use the percentile method to calculate a confidence interval. To do that we take the middle 95% of the bootstrapped sampling distribution. Again, infer has a helpful function to do this for us.

Calculate a 95% confidence interval

Use the code below in a new code cell to calculate your confidence interval.

conf_int <-
  get_confidence_interval(sampling_dist, level = 0.95, type = "percentile")

Again, print the object to see the confidence intervals:

conf_int

We can also visualise the confidence interval on our sampling distribution:

visualize(sampling_dist) +
  shade_confidence_interval(endpoints = conf_int)
Questions

Using our original research question:

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?

Write a statement to address our original research question using your observed statistic and confidence intervals.

Hypothesis testing

While our confidence interval is one way to address this question, we can also address it through “null hypothesis testing”. A null hypothesis test tests if flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm more so than is expected by chance.

To do that we need to construct a null and alternative hypothesis.

Constructing a null hypothesis

A null hypothesis posits that the explanatory variable we test does NOT affect our data. We then test if our data is sufficiently different from the null hypothesis such that we can reject it.

A null hypothesis does not have be defined by a “zero effect”.

Constructing an alternative hypothesis

Opposite of the Null hypothesis (the explanatory variable affects our data). We normally think in terms of an alternative hypothesis

Questions

Using our original research question:

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?

Write a null and alternative hypotheses in your document in text.

Generating data assuming the null hypothesis is true

As we did in the lecture, we will use a shuffling or permute method of creating a sampling distribution compatible with our null hypothesis.

To do that with infer, the code looks like this:

3null_dist <-
  female_data |> 
  specify(response = ovipositor_length_mm, explanatory = host_plant) |>
1  hypothesise(null = "independence") |>
2  generate(reps = 10000, type = "permute") |>
  calculate(stat = "diff in means", order = c("heterophyllum", "oleraceum"))
1
Again, the code looks very similar, but now we have added an extra hypothesise() step, where we declare our null hypothesis to be that the response and explanatory variables are independent of each other.
2
We generate 10000 permutated datasets under this hypothesis, and calculate the statistic for each dataset
3
Save as an object called null_dist
Simulate and visualize a null distribution

Use the code above in a new code cell to simulate a null distribution.

You can print by writing the name:

null_dist

The same command as we used before for the bootstrap sampling distribution can be used to visualise your null distribution. Use visualise() to show your null distribution.

Questions

Answer these questions in your document:

  1. What is shown when you visualise() the null distribution? What does that graph represent?

Answer in a way that an ecology student who is not in this class could understand.

Visualising and calculating a p-value

Recall that we want to answer the following question:

Are flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm more so than is expected by chance?

To help answer it, we can calculate and visualise a p-value, which is a probability of observing our original statistic, or one more extreme, assuming the null hypothesis is true.

Using our null distribution, we can take our definition above, and simple count the number of times we found an absolute difference in mean as great (or greater than) our observed test statistic, and divide that by the total number of statistics in our null distribution:

\[ p = \frac{\text{Number of times } |\text{null statistic}| \geq |\text{observed statistic}|}{\text{Total number of null statistics}} \]

This is a two-sided p-value.

Calculate and visualise the p-value

In a new code cell, calculate the p-value using the code below:

null_dist |>
1  get_p_value(obs_stat, direction = "two-sided")
1
Since we did not have a reason to think one particular host race would be bigger, we use a two-sided hypothesis test.

We can also visualise the p-value on the null distribution, to get a better idea of where that value comes from:

visualise(null_dist) +
  shade_p_value(obs_stat, direction = "two-sided")
Questions

With reference to your null and alternative hypotheses:

  1. Do you reject or fail to reject your null hypothesis?
  2. Does your conclusion align with what you found using the confidence interval approach?
  3. Summarise the difference in the type of question being answered when performing a null hypothesis test, vs using a confidence interval.

When you get to this stage, let the teacher know so that we can discuss your findings!

End of exercise.

References

Nilsson, K. J., M. Tsuboi, Ø. H. Opedal, and A. Runemark. 2024. Colonization of a Novel Host Plant Reduces Phenotypic Variation. Evolutionary Biology 51:269–282.
Nilsson, K., J. Ortega, M. Friberg, and A. Runemark. 2022. Morphological measurements of two host specialists of the dipteran Tephritis conura, both in sympatry and allopatry. Dryad.
Wilkinson, L. 2013. The Grammar of Graphics. Springer Science & Business Media.