Open R Sessions 2023

Authors

Violeta Caballero López

Laura Hildesheim

Simon Jacobsen Ellerstrand

Iain Moodie

Pedro Rosero

Welcome to the tidyverse exercises! We suggest you work in pairs or small groups for these exercises. Throughout this session, you should try to use packages from tidyverse to solve most problems, but combining base R solutions and tidyverse is also ok. Use pipes (|>) where appropriate. All datasets mention can be found on Canvas.

Make sure your R version newer than 4.1.0, using the R.Version() command. If not, you will need to update R to have access to the base R pipe.

Next, install the tidyverse meta package. This installs all of the tidyverse packages at the same time.

install.packages("tidyverse")

Now load the package.

library(tidyverse)
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Note the message that pops up when you load the tidyverse package. It shows the “core” packages that are loaded, and their versions. It also shows which functions have been overwritten by tidyverse functions. This is very helpful information, especially if you have loaded other packages prior to loading tidyverse.

1 Bergmann’s crabs (but with tidyverse)

The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.

The pie_crab.csv data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.

Data sampling overview:

  • 13 marshes were sampled on the Atlantic coast of the United States in summer 2016
  • Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
  • Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded

Data columns:

  • site: a string that identifies each site sampled
  • latitude: latitude for each site (does not change within site)
  • size: carapace width measurements (mm) for male crabs in the study
  • air_temp: mean air temperature for each site (does not change within site)
  • water_temp: a mean water temperature for each site (does not change within site)

Using the readr package that is installed with tidyverse, read in the CSV file called pie_crab.csv.

pie_crab <- read_csv("pie_crab.csv")

Note the message that shows when you load the package. Besides number of rows and columns, it also tells us that we have 1 column that is of the class character (chr), and 4 columns of class double (dbl).


For this dataset, use the tidyverse packages and pipes to do the following:

  1. Check the size column for outliers/mistakes, and remove any you find using filter().
  2. Produce a table that shows the mean size at each latitude, using group_by() and summarise().
  3. Use ggplot() to produce a figure that shows how size changes with latitude.
Show potential solution
# using a histogram to check for outliers
pie_crab |>
  ggplot(aes(x = size)) +
  geom_histogram()

# use filter to remove outliers, and replot to check
pie_crab_clean <-
  pie_crab |>
  filter(size < 300)

pie_crab_clean |>
  ggplot(aes(x = size)) +
  geom_histogram()

# make summarised dataset
pie_crab_means <-
  pie_crab_clean |>
  group_by(latitude) |>
  summarise(mean_size = mean(size))

# plot (many options)
pie_crab_means |>
  ggplot(aes(x = latitude, y = mean_size)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm")

2 Bison weights

This dataset (knz_records.csv and knz_ind.csv) provides age and weight records for the bison herd at Konza Prairie Biological Station in the USA, recorded from 1994 to 2020.

knz_records.csv is in a very annoying format, but one that you often encounter when first importing biological data. Each individual has their own row, and each column is a date on which the bison were weighed, with the name of the column as the date in the format year-month-day, and the values in the columns being weight in kg. The id column is a unique identifier for each bison.

2.1 Import and tidy

Start by reading in the knz_records.csv file and have a look at the format. Your first job is to take this dataset from a “wide” format, to a “long” format using the tidyr package and specifically the pivot_* functions. You are aiming for a data with a tidy structure, as shown in the table below:

id date weight_kg
bison1 2000-01-01 500
bison1 2001-01-01 600
bison2

Where each row is a single observation, with columns id, date and weight_kg. Have a go yourself, and remember that using the ? helpfiles and/or searching the internet effectively are important skills to use when programming, so if you need help, do that!

Once you’ve got the data in the write format, you should then remove all rows that have NA in the weight_kg column.

Show potential solution
knz_records <-
  read_csv("knz_records.csv") |>
  pivot_longer(!id, names_to = "date", values_to = "weight_kg") |>
  filter(!is.na(weight_kg))

Next, read in the knz_ind.csv file. It contains information about each individual. We want to use one of the dplyr *_join() functions to join this to the bison record dataset. Look at the helpfile for any of the *_join() functions (e.g. ?full_join()) to help you decide which of the functions you need to use here. We want to join the datasets by the id column. The final dataset should have 5 columns: id, date, weight_kg, sex and birth_year.

Show potential solution
knz_ind <- read_csv("knz_ind.csv")

knz_data <-
  left_join(
    x = knz_records, 
    y = knz_ind,
    by = join_by(id)
  )

knz_data

2.2 Transform

Now we want to calculate the age of the bison at each weighing date.

  1. First, separate() the date column into three columns: rec_year, rec_month and rec_day.
  2. Then, use the mutate() function to add a new column to the dataset called rec_age.
  3. Finally, arrange() the dataset by rec_age in descending order. Do this all in a single “step”, using pipes.
Show potential solution
knz_data <-
  knz_data |>
  separate(col = date, into = c("rec_year", "rec_month", "rec_day"), sep = "-") |>
  mutate(rec_age = as.numeric(rec_year) - birth_year) |>
  arrange(desc(rec_age))

2.3 Plot

Use ggplot2 to create a plot that illustrates the differences in growth tragectories for male and female bison. Your plot should also:

  • use colour to indicate sex
  • make use of transparency to better show the data
  • have a title, and clear x and y axis labels
  • have a legend, that should be a the bottom of the plot
  • use a theme other than the default one
  • (if you use points), use some “jitter” to better show the data points
Show potential solution
ggplot(data = knz_data, aes(x = rec_age, y = weight_kg)) +
  geom_point(aes(colour = sex), alpha = 0.3, position = position_jitter(width = 0.2, height = 0)) +
  labs(title = "Bison growth", x = "Bison age (years)", y = "Recorded weight (kg)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Export your plot with ggsave() to a pdf format.

Show potential solution
ggsave(filename = "knz_bison_plot.pdf", width = 6, height = 4) # assumes the last plot you made was the one in the previous step

3 Still want more?

The tidyverse is a huge project, and this has only been an introduction to some of the most commonly used parts of it. For a full course, we highly recommend R for Data Science by Hadley Wickham.

If you still want more to do this session, ask one of the TAs and we will help!