Correlation and linear regression I

Exercise 9

Author

Iain R. Moodie

Published

April 4, 2025

Get RStudio setup

Each time we start a new exercise, you should:

Make a new folder in your course folder for the exercise (e.g. biob11/exercise_9)
Open RStudio
- If you haven’t closed RStudio since the last exercise, I recommend you do so and then re-open it. If it asks if you want to save your R Session data, choose no.
Set your working directory by going to Session -> Set working directory -> Choose directory, then navigate to the folder you just made for this exercise.
Create a new Rmarkdown document (File -> New file -> R markdown..). Give it a clear title.

Please ensure you have followed the step above before you start!

Bergmann’s rule and fiddler crabs

The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.

The pie_crab.csv dataset is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.

You can download the dataset here.

Data sampling overview:

13 marshes were sampled on the Atlantic coast of the United States in summer 2016
Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
Between 25 and 37 adult male fiddler crabs were collected, and their carapace size (mm) recorded

The dataset was collected to test Bergmann’s rule:

One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals (Johnson et al. 2019).

Analysis

General

In your opinion, what (statistical) population are the researchers trying to make inferences about?

Data handling and plotting

Ensure you have loaded the tidyverse and infer packages.
Import the dataset using read_csv().
Check the data for mistakes.
Make illustrative plot(s) of the key variables to address “Bergmann’s rule” in the dataset using ggplot().

Does Minuca pugnax follow Bergmann’s rule

State the null and alternative hypothesis.
What method and test statistic(s) will you use? Why?

Hint

I think this is one of the examples where you could argue for either using a correlation analysis or using a regression analysis. It depends on if you think it would be meaningful to make the statement: “For each degree increase in latitude, Minuca pugnax will be on average X bigger”. If yes, and the relationship appears to be linear, then use linear regression. If you would feel safer saying “There is a positive/negative correlation between Minuca pugnax size and latitude”, and not ascribing causality or a strict rule, then you should use a correlation approach. You could also answer the question in a completely different way, by testing if crabs from the highest latitude differ from those at the lowest, for example.

Calculate the observed statistic(s).

Code hint

Correlation:

observed_statistic <-
1  ______ |>
2  specify(______ ~ ______) |>
3  calculate(stat = "______")

4observed_statistic

1: The name of the dataset.
2: A formula in the form of dependant ~ independant
3: Calculate the observed statistic.
4: Print the observe statistic to the console.

Linear regression:

observed_fit <-
1  ______ |>
2  specify(______ ~ ______) |>
3  fit()

4observed_fit

1: The name of the dataset.
2: A formula in the form of dependant ~ independant
3: Fit the linear model.
4: Print the observe fit to the console.

Calculate 95% confidence intervals for your observed statistic(s).

Code hint

Correlation:

______ <-
1  ______ |>
2  specify(______ ~ ______) |>
3  generate(reps = ______, type = "______") |>
4  calculate(stat = "______")

5percentile_ci <- get_ci(______, type = "______", level = ______)

6percentile_ci

1: The name of the dataset.
2: A formula in the form of dependant ~ independant (order does not matter for a correlation)
3: Decide how you will generate a datasets to use in your sampling distribution.
4: Calculate the statistic for each sample to make a sampling distribution.
5: Use your sampling distribution object to calculate the confidence internals
6: Print the CI

Linear regression:

______ <-
1  ______ |>
2  specify(______ ~ ______) |>
3  generate(reps = ______, type = "______") |>
4  fit()

5percentile_ci <- get_ci(______, point_estimate = ______, level = ______)

6percentile_ci

1: The name of the dataset.
2: A formula in the form of dependant ~ independant
3: Decide how you will generate a datasets to use in your sampling distribution.
4: Fit the linear regression to each sample to get a sampling distribution.
5: Use your sampling distribution object and your observed statistics to calculate the confidence internals
6: Print the CI

Generate a null distribution. Describe why your chosen method is appropriate.

Code hint

Correlation:

______ <-
1  ______ |>
2  specify(______ ~ ______) |>
3  hypothesize(null = "______") |>
4  generate(reps = ______, type = "______") |>
5  calculate(stat = "______")

1: The name of the dataset.
2: A formula in the form of dependant ~ independant (order does not matter for a correlation)
3: What is your null hypothesis?
4: Decide how you will generate a datasets under your null hypothesis to use in your null distribution.
5: Calculate the statistic for each sample to make a sampling distribution.

Linear regression:

______ <-
1  ______ |>
2  specify(______ ~ ______) |>
3  hypothesize(null = "______") |>
4  generate(reps = ______, type = "______") |>
5  fit()

1: The name of the dataset.
2: A formula in the form of dependant ~ independant
3: What is your null hypothesis?
4: Decide how you will generate a datasets under your null hypothesis to use in your null distribution.
5: Fit the linear regression to each sample to get a sampling distribution.

Plot the null distribution and the observed statistic. Calculate a p-value(s) for your null hypothesis.

Code hint

______ |>
1  visualise() +
2  shade_p_value(obs_stat = ______, direction = "______") +
3  labs(x = "______")

4_____ |>
  get_p_value(obs_stat = ______, direction = "______")

1: Pipe your null distribution object into visualise().
2: Plot your observed statistic(s), and specify that the direction of your hypothesis.
3: You can change the axis labels to make the plot more clear.
4: Your null distribution.

Write a small statement that summarises your statistical methods and findings. You should:
1. State clearly the research question, and what your hypotheses were. Explain why these hypotheses answer your research question.
2. Explain your choice of test statistic/method. Relate this to your hypotheses and question.
3. State your observed statistics(s) and confidence intervals. Explain what these mean. Refer to a plot you made that shows the data.
4. State the outcome of your hypothesis test (quoting test statisitc(s) and p-values). Interpret this result, in both terms of your statistical hypothesis, but also the broad research question.