Open R Sessions 2023

Authors

Violeta Caballero López

Laura Hildesheim

Simon Jacobsen Ellerstrand

Iain Moodie

Pedro Rosero

Welcome to the plotting exercises! We suggest you work in pairs or small groups for these exercises. In this session we will work with real published ecological datasets as it will mirror the workflow you will likely go through when it comes to making plots of your own data. We will start by importing a CSV file into a dataframe, and then explore each dataset with a series of plots, with the goal to create at least one plot for each dataset that visualises a result or prediction from each study. All CSV files can be found on canvas.

If you get stuck or need a hint on how to proceed, you can always click the “Show potential solution button” to see one way to solve the problem (there’s always more than one way). You can copy that code to your clipboard by clicking the clipboard icon in the right of the code box. Make use of the R helpfiles using ?, as well as the resources listed in Canvas. Also feel free to search the internet for code examples of what you want to do. Learning to efficiently search the internet is a big part of learning a programming language. However, everyone makes mistakes, including people with high scores on StackOverflow, so be sure you can understand what the code is doing line-by-line, especially if it is important.

Remember to annotate your code with comments to help you remember what you did and why. To write a comment in R, use the # symbol. Anything after the # symbol will be ignored by R.

It’s up to you how you want to organise your R scripts. We would suggest to create a new R script for each dataset, all saved in a single folder somewhere safe that you can find again. Remember to set your working directory to that folder for each script (do you remember the command?).

1 Bergmann’s crabs

1.1 Dataset

The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.

The pie_crab.csv data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.

Data sampling overview:

  • 13 marshes were sampled on the Atlantic coast of the United States in summer 2016
  • Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
  • Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded

Data columns:

  • site: a string that identifies each site sampled
  • latitude: latitude for each site (does not change within site)
  • size: carapace width measurements (mm) for male crabs in the study
  • air_temp: mean air temperature for each site (does not change within site)
  • water_temp: a mean water temperature for each site (does not change within site)

1.2 Exercises

1.2.1

The data is stored as a comma-separated values (CSV) file. You can load the data into your R session using the read.csv() function. Remember, if you are unsure about how to use a function, you can use the ? operator to access the help documentation (e.g. ?read.csv).

Load the data into your R session and assign it to an object called crab_data. Inspect the data using the head() and str() functions. What are the column names? What are the data types? How many rows of data are there?

crab_data <- read.csv("pie_crab.csv", stringsAsFactors = TRUE) # load the data

head(crab_data) # inspect the data

str(crab_data)  # 392 rows, 5 columns (site, latitude, size, air_temp, water_temp)

1.2.2

Let’s get a better understanding of each of our variables, and check for any obvious errors. We can use the summary() function to get a quick summary of each variable. What is the mean carapace width of a crab (size)? What are the min and max air and water temperatures? Plot histograms of carapace widths (size) using the hist() function. Does anything stand out to you?

summary(crab_data) # get a summary of each variable
hist(crab_data$size) # plot a histogram of carapace widths

That looks a bit weird right? Outliers are expected in real-world data, but we should check to make sure it isn’t a mistake. size here is measured in mm. Does a fiddler crab with a carapace width >1m seem resonable here? I hope not. It’s probably a data entry mistake. Let’s remove it from our dataset. There’s a few ways to do this. We can of course open the CSV file in excel, and edit it there, but we can also do this in R.

We can either locate the outline by row number, and then remove that row, or we can use a condition to remove all rows that meet that condition. For the former, we can use which() to find the row number of the outlier, and then use the - operator to remove that row.

# find the outlier as the max value in the size vector
outlier <- max(crab_data$size)
# find the row number of the outlier
outlier_row <- which(crab_data$size == outlier) # row 270
# remove the outlier
crab_data <- crab_data[-outlier_row, ]

For the latter, we can use the subset() function to remove all rows that meet a condition. Re-plot the histogram of size again.

crab_data <- subset(crab_data, size < 1000)
hist(crab_data$size) # plot a histogram of carapace widths

That looks better!


1.2.3

Let’s look at the range of carapace widths recorded at each site. What sort of plot would be best for this? Give it a try! Add a descriptive title and some axis labels. Do you agree with how we’ve done it?

Show potential solution
boxplot(size ~ site, data = crab_data, main = "Sizes at each site", xlab = "Site", ylab = "Carapace width (mm)")

We also might want to only compare means at each site. What sort of plot would be best for this? Here’s how you can calculate means for each site using the aggregate() function.

# FUN = mean tells the aggregate function to use the mean() function
mean_sizes <- aggregate(size ~ site, data = crab_data, FUN = mean)

Have a look at the format of the data now, and make a plot with it. Add a descriptive title and some axis labels, and use the ? helpfiles if you get stuck.

Show potential solution
barplot(
  mean_sizes$size, # the heights of the bars
  names.arg = mean_sizes$site, # the names of the bars
  main = "Mean sizes at each site", 
  xlab = "Site", 
  ylab = "Carapace width (mm)"
  )
BONUS: how to add error bars
# Really, don't stress about it, it's just often people will ask about this

# you might be thinking that error bars would be nice
# error bars can be made using the arrows() function in a bit of a hacky way
# here's an example:

# calculate the standard error of the mean for each site
mean_sizes$se <- aggregate(size ~ site, data = crab_data, FUN = function(x) sd(x)/sqrt(length(x)))$size

# we need to assign it to a variable for later
(my_barplot <-
barplot(
  mean_sizes$size, # the heights of the bars
  names.arg = mean_sizes$site, # the names of the bars
  ylim = c(0, 30), # set the y axis limits
  main = "Mean sizes at each site", 
  xlab = "Site", 
  ylab = "Carapace width (mm)"
  )
)

# add the error bars
arrows(
  x0 = my_barplot, # the x position of the error bars
  x1 = my_barplot, # the x position of the error bars
  y0 = mean_sizes$size - mean_sizes$se, # the lower y position of the error bars
  y1 = mean_sizes$size + mean_sizes$se, # the upper y position of the error bars
  code = 3, # the type of arrow to use
  angle = 90, # the angle of the arrow
  length = 0.1 # the length of the arrow
  )

1.2.4

The dataset was originally collected to test Bergmann’s rule:

“One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals (Johnson et al. 2019).”

Can you make a plot that would help us to visualise the main prediciton from Bergmann’s rule in M. pugnax? Give it a try! Some helpful functions might be plot(), abline(), and lm(). Check the the potential solution, do you disagree? There’s more than one way to do this. If you have a plot in mind, but don’t know how to make it, ask for help! Does Bergmann’s rule hold for M. pugnax?

Show potential solution
# I chose to use a scatterplot with a linear model fit
plot(crab_data$latitude, crab_data$size, main = "Bergmann's rule in Minuca pugnax", xlab = "Latitude", ylab = "Carapace width (mm)")

model <- lm(crab_data$size ~ crab_data$latitude) # fit a linear model

abline(model, col = "red") # abline will use the intercept and slope from the linear model to plot a line on the figure

1.2.5

Hopefully you’ve made a plot you are happy with. Suppose you wanted to put this in your thesis, how would export it? You can use the Rstudio “Export” button above your plot, and this will work alright for most things. However, if you want to have more control over the size of your figure, one method is to use the pdf() function to create a pdf file, then plot your plot, and then use the dev.off() function to “close” the PDF file.

pdf("my_plot.pdf")
#
# plot your plot here
#
dev.off()

Using a format like PDF is nearly always preferable to something like a PNG file, because it is a vector format. This means that the file contains information about shapes in the plot, rather than just a bunch of pixels. This means that you can zoom in on the plot without it getting blurry.


Congratulations! First dataset done! If you want to see the actual figure that was used in the publication, click here. How does your plot compare? Feel free to make additional plots you have in mind for this dataset (we didn’t use air_temp or water_temp yet), or move onto the next one.

Johnson, D. S., C. Crowley, K. Longmire, J. Nelson, B. Williams, and S. Wittyngham. 2019. The fiddler crab, Minuca pugnax, follows Bergmann’s rule. Ecology and Evolution 9:14489–14497.

2 Frozen lakes

2.1 Dataset

The dataset contains information aboout the ice cover of two lakes in Madison, WI, USA. The dataset starts in the 19th century, so has the potential to be informative about how the climate has changed on long time scales.

Data columns:

  • lakeid: name of the lake
  • year: year that the record was from
  • ice_duration: the number of days that the lake was covered with ice

2.2 Exercises

2.2.1

Load the data into your R session and assign it to an object called lake_data. Inspect the data using the head() and str() functions as before.

Show potential solution
lake_data <- read.csv("ntl_icecover.csv", stringsAsFactors = TRUE)

head(lake_data)

str(lake_data)

2.2.2

Plot a histogram of ice_duration.

Show potential solution
hist(
  x = lake_data$ice_duration,
  xlab = "Ice duration (days)",
  main = "" # note that providing an empty string is one way to not plot a title or axis.
  )

Say we wanted to plot separately histograms for each lake. To do that we need to learn a bit about par(). par() hold global (i.e. effects everything) plotting parameters. Most of these are the default values for plot shape, size, colour, etc that are used when you do no specifiy what they should be. par() also controls how many panels the plotting window has. To plot two plots one above the other, we need to modify the mfrow argument (remember, use ?par to find out more). We can then make our two plots, separating the data for each lake.

# Set the layout of the plotting window to have two rows
par(mfrow = c(2, 1))

# Plot the histogram of ice duration for Lake Mendota
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Mendota"],
  main = "Lake Mendota",
  xlab = "Ice duration (days)"
  )

# Plot the histogram of ice duration for Lake Monona
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Monona"],
  main = "Lake Monona",
  xlab = "Ice duration (days)"
  )

# Set the layout back to a single plot per window
par(mfrow = c(1, 1))

For some of you, the fact that the breaks (bin sizes) are not the same, and that x axis does not match, will be very annoying. We can fix this by changing some arguments! Pass a vector of two numbers (the min and max value) to the argument xlim = in each plot, to see if you can fix this issue. To enforce the breaks to be the same, you can use the breaks = argument. Try to figure this out with ?hist before looking at the solution.

Show potential solution
# to save typing it twice, we can define our xlim and breaks vector before hand
# we can use the functions min() and max() to find a suitble range

min(lake_data$ice_duration) # 21
max(lake_data$ice_duration) # 161

x_limit <- c(10,170)
n_breaks <- 20

# Set the layout of the plotting window to have two rows
par(mfrow = c(2, 1))

# Plot the histogram of ice duration for Lake Mendota
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Mendota"],
  main = "Lake Mendota",
  xlab = "Ice duration (days)",
  xlim = x_limit,
  breaks = n_breaks
  )

# Plot the histogram of ice duration for Lake Monona
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Monona"],
  main = "Lake Monona",
  xlab = "Ice duration (days)",
  xlim = x_limit,
  breaks = n_breaks
  )

# Set the layout back to a single plot per window
par(mfrow = c(1, 1))

You might have noticed that the code for our histograms is almost identical, apart from which lake is selected. We could make our code much shorter using for loops, where we just replace which lake is used and the same code is run twice. This will be covered in a dedicated session on the 19th October! For now, let’s move on.


2.2.3

The big question here is probably surrounding how ice_duration has changed over time (that is, the number of days a year that the lakes are covered in ice). How could we visualise that? One option might be to plot the relationship between ice_duration and year. Try to do that using the plot() function. Give the plot descripte axis labels and a title. Check the helpfiles if you are unsure.

Show potential solution
# using the formula notation
plot(
  ice_duration ~ year, 
  data = lake_data,
  main = "Ice duration over time", 
  xlab = "Year", 
  ylab = "Ice duration (days)"
  )

# OR

# providing just the vectors
plot(
  x = lake_data$year,
  y = lake_data$ice_duration, 
  main = "Ice duration over time", 
  xlab = "Year", 
  ylab = "Ice duration (days)"
  )

# Both make the same plot

This sort of data is often called timeseries data, and we usually see it presented using a line, not individual points. We can use the type = l argument for the plot() function to acheive this. However, this will “join” the dots in order they appear in the dataframe. We also have two lakes, and therefore two data points for each value of the x axis. If you don’t see what the issue is, you can try plot it! Let’s deal with the two lake problem first. Try to adapt the code we used in the previous dataset to calculate a mean ice_duration for each year across the two lakes.

Show potential solution
mean_ice_duration <- aggregate(ice_duration ~ year, data = lake_data, FUN = mean)

Now let’s ensure our data is in the correct order. We can do that by ordering the dataset by the year column. Use ? to find out how order() works.

Code
mean_ice_duration <- mean_ice_duration[order(mean_ice_duration$year), ]

Ok, now we are ready to make the plot. Use the type = l argument for the plot() function to make the timeseries plot with a line connecting the data points. Change the colour of the line to one of your preference. As an additional execrise, how could you change the Y axis to be in weeks, rather than days?

Show potential solution
plot(
  ice_duration ~ year, 
  data = mean_ice_duration,
  type = "l",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (days)",
  col = "red"
  )
Change days to weeks
plot(
  ice_duration/7 ~ year, # we can make the edit here, without editing the dataframe object. diving by 7 will give weeks.
  data = mean_ice_duration,
  type = "l",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (weeks)",
  col = "red"
  )

Now, modify your plot to instead show both lines and points (look at ?plot for help). Make the points solid instead of having no fill.

Show potential solution
plot(
  ice_duration ~ year, 
  data = mean_ice_duration,
  type = "b",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (days)",
  col = "red",
  pch = 16 # 19 will also work
  )

Use lm() and abline() to add a trendline to the figure. Make the line dashed instead of solid, and increase it’s width to 2. Maybe you also need to change the colour so it stands out on your plot?

Show potential solution
abline(lm(ice_duration ~ year, data = lake_data), lty = 5, lwd = 2)

2.2.4

From what we have covered in the last two datasets, do you think you can make a similar plot as in the last section, but have a separate line for each lake? There’s a few ways to do it. You will need to use the lines() function to add a second line to the plot. Add in a legend to show which line is which. You don’t need to add a trendline with abline(). Good luck!

Show potential solution
# separate the data for each lake
mendota <- lake_data[lake_data$lakeid == "Lake Mendota", ]
monona <- lake_data[lake_data$lakeid == "Lake Monona", ]

# make sure both datasets are in order
mendota <- mendota[order(mendota$year), ]
monona <- monona[order(monona$year), ]

# make the plot with the first lake
plot(
  ice_duration ~ year, 
  data = mendota,
  type = "l",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (days)",
  col = "red"
  )

# add the second lake

lines(
  ice_duration ~ year, 
  data = monona,
  col = "blue"
  )

# add a legend
legend(
  "topright", 
  legend = c("Lake Mendota", "Lake Monona"), 
  col = c("red", "blue"), 
  lty = 1
  )

Awesome work! Congratulations for getting this far! The next section is quite open for you to do what you like, so feel free to use the Sugar Maple dataset, or to use one of your own, or to keep working on other plots from the two datasets above.

3 Sugar Maples

3.1 Dataset

The hbr_maples.csv dataset contains observations on sugar maple seedlings in untreated and calcium-amended watersheds at Hubbard Brook Experimental Forest in New Hampshire, USA. To investigate the impact of soil calcium supplementation on sugar maple seedling growth, the authors recorded “general sugar maple germinant health by height, leaf area, biomass, and chlorophyll content” for seedlings in untreated and previously calcium-treated watersheds. By comparing seedling growth in calcium-treated (W1) versus untreated (Reference) watersheds, calcium impacts on sugar maple seedling growth can be explored.

Sugar maple seedlings were examined on calcium-treated and reference sites during August 2003 and June 2004. Seedlings were sampled every ten steps in transects.

  • year: a number denoting the year that the sample was taken
  • watershed: a factor denoting the watershed where the sample was collected; W1 = calcium-treated, Reference = reference
  • elevation: a factor describing the Elevation of transect; Low = low elevation, Mid = mid elevation
  • transect: a factor denoting the transect number within the watershed
  • sample: a factor denoting the sample number within transect. There are twenty samples in each transect
  • stem_length: a number denoting the height of the seedling in millimeters
  • leaf1area: a number denoting the area of the first sampled leaf in square centimeters
  • leaf2area: a number denoting the area of the second sampled leaf in square centimeters
  • leaf_dry_mass: a number denoting the dry mass of both sampled leaves in grams
  • stem_dry_mass: a number denoting the dry mass of the stem in grams
  • corrected_leaf_area: a number denoting the area of both leaves in addition to the area removed for chlorophyll measurements in square centimeters

Field observations suggest that the seedlings in the calcium-treated watershed appeared to be more abundant and healthier than in the reference areas. They looked larger, greener, and had bigger root systems that held more soil on the roots when they were plucked out of the ground.

3.2 Exercises

Based on those observations, you should produce a set of plots that examine the differences between the sugar maple seedlings across treatments. Use plots and summary functions to understand the dataset, fix any issues or wrong data types, and then produce figures that could be part of a report or presentation, based on the field observations above, and any others you think about after looking at the data. Be creative! Try and export your figures to PDF files at the end.