install.packages("tidyverse")
Tidyverse Exercises
Open R Sessions 2023
Welcome to the tidyverse
exercises! We suggest you work in pairs or small groups for these exercises. Throughout this session, you should try to use packages from tidyverse
to solve most problems, but combining base R
solutions and tidyverse
is also ok. Use pipes (|>
) where appropriate. All datasets mention can be found on Canvas.
Make sure your R
version newer than 4.1.0, using the R.Version()
command. If not, you will need to update R
to have access to the base R
pipe.
Next, install the tidyverse
meta package. This installs all of the tidyverse
packages at the same time.
Now load the package.
library(tidyverse)
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Note the message that pops up when you load the tidyverse
package. It shows the “core” packages that are loaded, and their versions. It also shows which functions have been overwritten by tidyverse
functions. This is very helpful information, especially if you have loaded other packages prior to loading tidyverse
.
1 Bergmann’s crabs (but with tidyverse
)
The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.
The pie_crab.csv
data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.
Data sampling overview:
- 13 marshes were sampled on the Atlantic coast of the United States in summer 2016
- Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
- Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded
Data columns:
site
: a string that identifies each site sampledlatitude
: latitude for each site (does not change within site)size
: carapace width measurements (mm) for male crabs in the studyair_temp
: mean air temperature for each site (does not change within site)water_temp
: a mean water temperature for each site (does not change within site)
Using the readr
package that is installed with tidyverse
, read in the CSV file called pie_crab.csv
.
<- read_csv("pie_crab.csv") pie_crab
Note the message that shows when you load the package. Besides number of rows and columns, it also tells us that we have 1 column that is of the class character (chr
), and 4 columns of class double (dbl
).
For this dataset, use the tidyverse
packages and pipes to do the following:
- Check the
size
column for outliers/mistakes, and remove any you find usingfilter()
. - Produce a table that shows the mean
size
at eachlatitude
, usinggroup_by()
andsummarise()
. - Use
ggplot()
to produce a figure that shows howsize
changes withlatitude
.
Show potential solution
# using a histogram to check for outliers
|>
pie_crab ggplot(aes(x = size)) +
geom_histogram()
# use filter to remove outliers, and replot to check
<-
pie_crab_clean |>
pie_crab filter(size < 300)
|>
pie_crab_clean ggplot(aes(x = size)) +
geom_histogram()
# make summarised dataset
<-
pie_crab_means |>
pie_crab_clean group_by(latitude) |>
summarise(mean_size = mean(size))
# plot (many options)
|>
pie_crab_means ggplot(aes(x = latitude, y = mean_size)) +
geom_point(size = 3) +
geom_smooth(method = "lm")
2 Bison weights
This dataset (knz_records.csv
and knz_ind.csv
) provides age and weight records for the bison herd at Konza Prairie Biological Station in the USA, recorded from 1994 to 2020.
knz_records.csv
is in a very annoying format, but one that you often encounter when first importing biological data. Each individual has their own row, and each column is a date on which the bison were weighed, with the name of the column as the date in the format year-month-day, and the values in the columns being weight in kg. The id
column is a unique identifier for each bison.
2.1 Import and tidy
Start by reading in the knz_records.csv
file and have a look at the format. Your first job is to take this dataset from a “wide” format, to a “long” format using the tidyr
package and specifically the pivot_*
functions. You are aiming for a data with a tidy structure, as shown in the table below:
id | date | weight_kg |
---|---|---|
bison1 | 2000-01-01 | 500 |
bison1 | 2001-01-01 | 600 |
bison2 | … | … |
Where each row is a single observation, with columns id
, date
and weight_kg
. Have a go yourself, and remember that using the ?
helpfiles and/or searching the internet effectively are important skills to use when programming, so if you need help, do that!
Once you’ve got the data in the write format, you should then remove all rows that have NA
in the weight_kg
column.
Show potential solution
<-
knz_records read_csv("knz_records.csv") |>
pivot_longer(!id, names_to = "date", values_to = "weight_kg") |>
filter(!is.na(weight_kg))
Next, read in the knz_ind.csv
file. It contains information about each individual. We want to use one of the dplyr
*_join()
functions to join this to the bison record dataset. Look at the helpfile for any of the *_join()
functions (e.g. ?full_join()
) to help you decide which of the functions you need to use here. We want to join the datasets by the id
column. The final dataset should have 5 columns: id
, date
, weight_kg
, sex
and birth_year
.
Show potential solution
<- read_csv("knz_ind.csv")
knz_ind
<-
knz_data left_join(
x = knz_records,
y = knz_ind,
by = join_by(id)
)
knz_data
2.2 Transform
Now we want to calculate the age of the bison at each weighing date.
- First,
separate()
the date column into three columns:rec_year
,rec_month
andrec_day
. - Then, use the
mutate()
function to add a new column to the dataset calledrec_age
. - Finally,
arrange()
the dataset byrec_age
in descending order. Do this all in a single “step”, using pipes.
Show potential solution
<-
knz_data |>
knz_data separate(col = date, into = c("rec_year", "rec_month", "rec_day"), sep = "-") |>
mutate(rec_age = as.numeric(rec_year) - birth_year) |>
arrange(desc(rec_age))
2.3 Plot
Use ggplot2
to create a plot that illustrates the differences in growth tragectories for male and female bison. Your plot should also:
- use colour to indicate sex
- make use of transparency to better show the data
- have a title, and clear x and y axis labels
- have a legend, that should be a the bottom of the plot
- use a theme other than the default one
- (if you use points), use some “jitter” to better show the data points
Show potential solution
ggplot(data = knz_data, aes(x = rec_age, y = weight_kg)) +
geom_point(aes(colour = sex), alpha = 0.3, position = position_jitter(width = 0.2, height = 0)) +
labs(title = "Bison growth", x = "Bison age (years)", y = "Recorded weight (kg)") +
theme_minimal() +
theme(legend.position = "bottom")
Export your plot with ggsave()
to a pdf format.
Show potential solution
ggsave(filename = "knz_bison_plot.pdf", width = 6, height = 4) # assumes the last plot you made was the one in the previous step
3 Still want more?
The tidyverse
is a huge project, and this has only been an introduction to some of the most commonly used parts of it. For a full course, we highly recommend R for Data Science by Hadley Wickham.
If you still want more to do this session, ask one of the TAs and we will help!