Tidyverse Exercises

Open R Sessions 2024

Authors

Etka Yapar

Iain Moodie

Simon Jacobsen Ellerstrand

Violeta Caballero López

Ximena Alva Caballero

Welcome to the tidyverse exercises! We suggest you work in pairs or small groups for these exercises. Throughout this session, you should try to use packages from tidyverse to solve most problems, but combining base R solutions and tidyverse is also ok. Use pipes (|>) where appropriate. All datasets mention can be found on Canvas.

Make sure your R version newer than 4.1.0, using the R.Version() command. If not, you will need to update R to have access to the base R pipe.

Next, install the tidyverse meta package. This installs all of the tidyverse packages at the same time.

install.packages("tidyverse")

Now load the package.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Note the message that pops up when you load the tidyverse package. It shows the “core” packages that are loaded, and their versions. It also shows which functions have been overwritten by tidyverse functions. This is very helpful information, especially if you have loaded other packages prior to loading tidyverse.

1 Bergmann’s crabs (but with `tidyverse`)

The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.

The pie_crab.csv data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.

Data sampling overview:

13 marshes were sampled on the Atlantic coast of the United States in summer 2016
Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded

Data columns:

site: a string that identifies each site sampled
latitude: latitude for each site (does not change within site)
size: carapace width measurements (mm) for male crabs in the study
air_temp: mean air temperature for each site (does not change within site)
water_temp: a mean water temperature for each site (does not change within site)

Using the readr package that is installed with tidyverse, read in the CSV file called pie_crab.csv.

pie_crab <- read_csv("pie_crab.csv")

Note the message that shows when you load the package. Besides number of rows and columns, it also tells us that we have 1 column that is of the class character (chr), and 4 columns of class double (dbl).

For this dataset, use the tidyverse packages and pipes to do the following:

Check the size column for outliers/mistakes, and remove any you find using filter().
Produce a table that shows the mean size at each latitude, using group_by() and summarise().
Use ggplot() to produce a figure that shows how size changes with latitude.

Show potential solution

# using a histogram to check for outliers
pie_crab |>
  ggplot(aes(x = size)) +
  geom_histogram()

# use filter to remove outliers, and replot to check
pie_crab_clean <-
  pie_crab |>
  filter(size < 300)

pie_crab_clean |>
  ggplot(aes(x = size)) +
  geom_histogram()

# make summarised dataset
pie_crab_means <-
  pie_crab_clean |>
  group_by(latitude) |>
  summarise(mean_size = mean(size))

# plot (many options)
pie_crab_means |>
  ggplot(aes(x = latitude, y = mean_size)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm")

2 Bison weights

This dataset (knz_records.csv and knz_ind.csv) provides age and weight records for the bison herd at Konza Prairie Biological Station in the USA, recorded from 1994 to 2020.

knz_records.csv is in a very annoying format, but one that you often encounter when first importing biological data. Each individual has their own row, and each column is a date on which the bison were weighed, with the name of the column as the date in the format year-month-day, and the values in the columns being weight in kg. The id column is a unique identifier for each bison.

2.1 Import and tidy

Start by reading in the knz_records.csv file and have a look at the format. Your first job is to take this dataset from a “wide” format, to a “long” format using the tidyr package and specifically the pivot_* functions. You are aiming for a data with a tidy structure, as shown in the table below:

id	date	weight_kg
bison1	2000-01-01	500
bison1	2001-01-01	600
bison2	…	…

Where each row is a single observation, with columns id, date and weight_kg. Have a go yourself, and remember that using the ? helpfiles and/or searching the internet effectively are important skills to use when programming, so if you need help, do that!

Once you’ve got the data in the write format, you should then remove all rows that have NA in the weight_kg column.

Show potential solution

knz_records <-
  read_csv("knz_records.csv") |>
  pivot_longer(!id, names_to = "date", values_to = "weight_kg") |>
  filter(!is.na(weight_kg))

Next, read in the knz_ind.csv file. It contains information about each individual. We want to use one of the dplyr *_join() functions to join this to the bison record dataset. Look at the helpfile for any of the *_join() functions (e.g. ?full_join()) to help you decide which of the functions you need to use here. We want to join the datasets by the id column. The final dataset should have 5 columns: id, date, weight_kg, sex and birth_year.

Show potential solution

knz_ind <- read_csv("knz_ind.csv")

knz_data <-
  left_join(
    x = knz_records, 
    y = knz_ind,
    by = join_by(id)
  )

knz_data

2.2 Transform

Now we want to calculate the age of the bison at each weighing date.

First, separate() the date column into three columns: rec_year, rec_month and rec_day.
Then, use the mutate() function to add a new column to the dataset called rec_age.
Finally, arrange() the dataset by rec_age in descending order. Do this all in a single “step”, using pipes.

Show potential solution

knz_data <-
  knz_data |>
  separate(col = date, into = c("rec_year", "rec_month", "rec_day"), sep = "-") |>
  mutate(rec_age = as.numeric(rec_year) - birth_year) |>
  arrange(desc(rec_age))

2.3 Plot

Use ggplot2 to create a plot that illustrates the differences in growth tragectories for male and female bison. Your plot should also:

use colour to indicate sex
make use of transparency to better show the data
have a title, and clear x and y axis labels
have a legend, that should be a the bottom of the plot
use a theme other than the default one
(if you use points), use some “jitter” to better show the data points

Show potential solution

ggplot(data = knz_data, aes(x = rec_age, y = weight_kg)) +
  geom_point(aes(colour = sex), alpha = 0.3, position = position_jitter(width = 0.2, height = 0)) +
  labs(title = "Bison growth", x = "Bison age (years)", y = "Recorded weight (kg)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Export your plot with ggsave() to a pdf format.

Show potential solution

ggsave(filename = "knz_bison_plot.pdf", width = 6, height = 4) # assumes the last plot you made was the one in the previous step

3 Still want more?

The tidyverse is a huge project, and this has only been an introduction to some of the most commonly used parts of it. For a full course, we highly recommend R for Data Science by Hadley Wickham.

If you still want more to do this session, ask one of the TAs and we will help!

--- title: "Tidyverse Exercises" subtitle: "Open R Sessions 2024" author: - Etka Yapar - "**Iain Moodie**" - Simon Jacobsen Ellerstrand - Violeta Caballero López - Ximena Alva Caballero format: html: code-fold: true code-tools: true code-overflow: wrap toc: true toc-depth: 4 self-contained: true anchor-sections: true smooth-scroll: true theme: light: flatly dark: darkly number-sections: true number-depth: 3 execute: echo: true warning: false eval: false --- Welcome to the `tidyverse` exercises! We suggest you work in pairs or small groups for these exercises. Throughout this session, you should try to use packages from `tidyverse` to solve most problems, but combining base `R` solutions and `tidyverse` is also ok. Use pipes (`|>`) where appropriate. All datasets mention can be found on Canvas. Make sure your `R` version newer than 4.1.0, using the `R.Version()` command. If not, you will need to update `R` to have access to the base `R` pipe. Next, install the `tidyverse` meta package. This installs all of the `tidyverse` packages at the same time. ```{r} #| code-fold: false install.packages("tidyverse") ``` Now load the package. ```{r} #| code-fold: false #| eval: true #| warning: true library(tidyverse) ``` Note the message that pops up when you load the `tidyverse` package. It shows the "core" packages that are loaded, and their versions. It also shows which functions have been overwritten by `tidyverse` functions. This is very helpful information, especially if you have loaded other packages prior to loading `tidyverse`. # Bergmann's crabs (but with `tidyverse`) ![](https://bugguide.net/images/cache/5HT/HXH/5HTHXHDH6H6ZMLAZMLNZXLGZ7LRR7LBZ9HJHXL9Z5L1Z4LJHIH8ZIHEZ0LVHPHHR5LEZGH5ZSL2ZHLGZSLAZ8LJH8L.jpg){fig-align="center"} The Atlantic marsh fiddler crab, _Minuca pugnax_, lives in salt marshes throughout the eastern coast of the United States. Historically, _M. pugnax_ were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming. The `pie_crab.csv` data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site. Data sampling overview: - 13 marshes were sampled on the Atlantic coast of the United States in summer 2016 - Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts - Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded Data columns: - `site`: a string that identifies each site sampled - `latitude`: latitude for each site (does not change within site) - `size`: carapace width measurements (mm) for male crabs in the study - `air_temp`: mean air temperature for each site (does not change within site) - `water_temp`: a mean water temperature for each site (does not change within site) --- Using the `readr` package that is installed with `tidyverse`, read in the CSV file called `pie_crab.csv`. ```{r} #| code-fold: false #| eval: true #| warning: false pie_crab <- read_csv("pie_crab.csv") ``` Note the message that shows when you load the package. Besides number of rows and columns, it also tells us that we have 1 column that is of the class character (`chr`), and 4 columns of class double (`dbl`). --- For this dataset, use the `tidyverse` packages and pipes to do the following: 1. Check the `size` column for outliers/mistakes, and remove any you find using `filter()`. 2. Produce a table that shows the mean `size` at each `latitude`, using `group_by()` and `summarise()`. 3. Use `ggplot()` to produce a figure that shows how `size` changes with `latitude`. ```{r} #| code-summary: "Show potential solution" # using a histogram to check for outliers pie_crab |> ggplot(aes(x = size)) + geom_histogram() # use filter to remove outliers, and replot to check pie_crab_clean <- pie_crab |> filter(size < 300) pie_crab_clean |> ggplot(aes(x = size)) + geom_histogram() # make summarised dataset pie_crab_means <- pie_crab_clean |> group_by(latitude) |> summarise(mean_size = mean(size)) # plot (many options) pie_crab_means |> ggplot(aes(x = latitude, y = mean_size)) + geom_point(size = 3) + geom_smooth(method = "lm") ``` # Bison weights ![](https://lter.github.io/lterdatasampler/reference/figures/knz_bison_img_2.jpg){fig-align="center"} This dataset (`knz_records.csv` and `knz_ind.csv`) provides age and weight records for the bison herd at Konza Prairie Biological Station in the USA, recorded from 1994 to 2020. `knz_records.csv` is in a very annoying format, but one that you often encounter when first importing biological data. Each individual has their own row, and each column is a date on which the bison were weighed, with the name of the column as the date in the format year-month-day, and the values in the columns being weight in kg. The `id` column is a unique identifier for each bison. ## Import and tidy Start by reading in the `knz_records.csv` file and have a look at the format. Your first job is to take this dataset from a "wide" format, to a "long" format using the `tidyr` package and specifically the `pivot_*` functions. You are aiming for a data with a tidy structure, as shown in the table below: | id | date | weight_kg | |--------|------------|-----------| | bison1 | 2000-01-01 | 500 | | bison1 | 2001-01-01 | 600 | | bison2 | ... | ... | Where each row is a single observation, with columns `id`, `date` and `weight_kg`. Have a go yourself, and remember that using the `?` helpfiles and/or searching the internet effectively are important skills to use when programming, so if you need help, do that! Once you've got the data in the write format, you should then remove all rows that have `NA` in the `weight_kg` column. ```{r} #| code-summary: "Show potential solution" knz_records <- read_csv("knz_records.csv") |> pivot_longer(!id, names_to = "date", values_to = "weight_kg") |> filter(!is.na(weight_kg)) ``` --- Next, read in the `knz_ind.csv` file. It contains information about each individual. We want to use one of the `dplyr` `*_join()` functions to join this to the bison record dataset. Look at the helpfile for any of the `*_join()` functions (e.g. `?full_join()`) to help you decide which of the functions you need to use here. We want to join the datasets by the `id` column. The final dataset should have 5 columns: `id`, `date`, `weight_kg`, `sex` and `birth_year`. ```{r} #| code-summary: "Show potential solution" knz_ind <- read_csv("knz_ind.csv") knz_data <- left_join( x = knz_records, y = knz_ind, by = join_by(id) ) knz_data ``` ## Transform Now we want to calculate the age of the bison at each weighing date. 1. First, `separate()` the date column into three columns: `rec_year`, `rec_month` and `rec_day`. 2. Then, use the `mutate()` function to add a new column to the dataset called `rec_age`. 3. Finally, `arrange()` the dataset by `rec_age` in descending order. Do this all in a single "step", using pipes. ```{r} #| code-summary: "Show potential solution" knz_data <- knz_data |> separate(col = date, into = c("rec_year", "rec_month", "rec_day"), sep = "-") |> mutate(rec_age = as.numeric(rec_year) - birth_year) |> arrange(desc(rec_age)) ``` ## Plot Use `ggplot2` to create a plot that illustrates the differences in growth tragectories for male and female bison. Your plot should also: - use colour to indicate sex - make use of transparency to better show the data - have a title, and clear x and y axis labels - have a legend, that should be a the bottom of the plot - use a theme other than the default one - (if you use points), use some "jitter" to better show the data points ```{r} #| code-summary: "Show potential solution" ggplot(data = knz_data, aes(x = rec_age, y = weight_kg)) + geom_point(aes(colour = sex), alpha = 0.3, position = position_jitter(width = 0.2, height = 0)) + labs(title = "Bison growth", x = "Bison age (years)", y = "Recorded weight (kg)") + theme_minimal() + theme(legend.position = "bottom") ``` Export your plot with `ggsave()` to a pdf format. ```{r} #| code-summary: "Show potential solution" ggsave(filename = "knz_bison_plot.pdf", width = 6, height = 4) # assumes the last plot you made was the one in the previous step ``` # Still want more? The `tidyverse` is a huge project, and this has only been an introduction to some of the most commonly used parts of it. For a full course, we highly recommend [R for Data Science](https://r4ds.hadley.nz/) by Hadley Wickham. ![](https://r4ds.hadley.nz/cover.jpg){fig-align="center"} If you still want more to do this session, ask one of the TAs and we will help!

1 Bergmann’s crabs (but with tidyverse)

2 Bison weights

2.1 Import and tidy

2.2 Transform

2.3 Plot

3 Still want more?

1 Bergmann’s crabs (but with `tidyverse`)