Regression modelling

Lecture 6

Iain R. Moodie

Wednesday 1st April, 2026

Correlation

Correlation

Do two continuous variables covary?

  • Used to assess if two continuous variables are independent, or if they covary (linearly)
  • We do not express one variable as a function of another:
    • No “response” and “explanatory”
  • Usually assumed that both variables are caused by the same thing
  • “Correlation does not mean causation”
    • But you can measure a correlation between a cause and effect

Correlation

Do two continuous variables covary?

Correlation coefficient (Pearson):

\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

\[ r = \frac{\text{Covariance}(x,y)}{\text{Standard deviation}(x) \times \text{Standard deviation}(y)} \]

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Calculate standard deviation in \(x\):

\[ \sigma_x = \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \]

data |>
  observe(response = x, stat = "sd")
Response: x (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  10.9

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Calculate standard deviation in \(y\):

\[ \sigma_y = \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2} \]

data |>
  observe(response = y, stat = "sd")
Response: y (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  3.51

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Calculate the covariance between \(x\) and \(y\):

\[ Cov(x,y) = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \]

data |>
  cov()
          x        y
x 118.23297 33.87468
y  33.87468 12.31625

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

data |>
  specify(x ~ y) |>
  calculate(stat = "correlation")
Response: x (numeric)
Explanatory: y (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.888

Correlation

Do two continuous variables covary?

Correlation

Do two continuous variables covary?

Correlation coefficient (\(r\)): \[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

Coefficient of determination (\(r^2\)): \[ r^2 = \left(\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}\right)^2 \]

Correlation

Do two continuous variables covary?

Correlation

Do two continuous variables covary?

adelie_data <-
  penguins |>
  filter(species == "Adelie")
  
adelie_data |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(size = 3, alpha = 0.7)

Correlation

Observed statistic

observed_correlation <-
  adelie_data |>
  specify(flipper_length_mm ~ body_mass_g) |>
  calculate(stat = "correlation")

observed_correlation
Response: flipper_length_mm (numeric)
Explanatory: body_mass_g (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.468

Correlation

Confidence intervals

boot_dist <-
  adelie_data |>
  specify(flipper_length_mm ~ body_mass_g) |>
  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "correlation")

percentile_ci <- 
    boot_dist |>
    get_ci(level = 0.95, type = "percentile")

Correlation

Confidence intervals

boot_dist |>
  visualize() +
  shade_confidence_interval(endpoints = percentile_ci) +
  labs(x = "r")

Correlation

Confidence intervals

  • The correlation between flipper length and body mass was 0.468 (95% CI: 0.355, 0.572).

Correlation

Hypothesis test

  • Null hypothesis:
    • The two variables do not covary (\(r=0\))
  • Alternative hypothesis:
    • The two variables do covary (\(r\neq0\))

Correlation

Hypothesis test

  • How could we generate a null distribution?

Correlation

Hypothesis test

Correlation

Hypothesis test

null_dist <-
  adelie_data |>
  specify(flipper_length_mm ~ body_mass_g) |>
  hypothesize(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  calculate(stat = "correlation")

Correlation

Hypothesis test

null_dist |>
  visualize() +
  shade_p_value(obs_stat = observed_correlation, direction = "two-sided")

Correlation

Hypothesis test

null_dist |>
  get_p_value(obs_stat = observed_correlation, direction = "two-sided")
# A tibble: 1 × 1
  p_value
    <dbl>
1       0
  • Remember that our p-value is an approximation
    • The number of decimal places of accuracy we can observe is directly linked to the number of reps

Linear Regression

  • When your response is a quantitative variable, and you want to predict it using (at least) one other quantitative variable

Linear Regression

How does variable Y depend on variable X

  • A mathematical function describing the relationship between two continuous variables
  • Y is dependant on X (the)
  • Y is a function of X
  • Predictive model
  • Very powerful framework

Linear Regression

How does variable Y depend on variable X

\[ y = \text{Slope}\times x + \text{Intercept} \]

\[ y = mx+c \]

\[ y = \beta_1x+\beta_0 \]

\[ y = \beta_0+\beta_1x+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5 \]

Linear Regression

How does variable Y depend on variable X

Linear Regression

How does variable Y depend on variable X

Linear Regression

How does variable Y depend on variable X

  • Example: For sexual selection to operate, an increase in mating success (number of mates) must result in an increase in reproductive success (number of offspring).

Linear Regression

How does variable Y depend on variable X

\[ y = \beta_1x+\beta_0 \]

\[ y = 3.83x-0.68 \]

  • \(\beta_1\) = strength of sexual selection
  • For each additional mate, an individual (on average) gains \(\beta_1\) additional offspring
  • For 5 mates (\(x=5\)):
    • \(y = 3.83\times5-0.68\)
    • \(y = 18.47\)

Linear Regression

How does variable Y depend on variable X

Linear Regression

How does variable Y depend on variable X

Linear Regression

How does variable Y depend on variable X

Linear Regression

How does variable Y depend on variable X

Linear Regression

  • Fit by solving to minimise the sum of the squared residuals (SSR)
    • Find \(\beta_1\) and \(\beta_0\) that minimise the SSR
    • Called a “loss function”

Linear Regression

How does variable Y depend on variable X

Linear Regression

How to fit a slope in R?

With infer (if no interest in the intercept):

observed_fit <-
  mating_data |>
  specify(reproductive_success ~ mating_success) |>
  calculate(stat = "slope")

observed_fit
Response: reproductive_success (numeric)
Explanatory: mating_success (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  3.83

Linear Regression

Confidence intervals for a slope

boot_dist <-
  mating_data |>
  specify(reproductive_success ~ mating_success) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "slope")

boot_dist
Response: reproductive_success (numeric)
Explanatory: mating_success (numeric)
# A tibble: 1,000 × 2
   replicate  stat
       <int> <dbl>
 1         1  4.32
 2         2  4.73
 3         3  3.80
 4         4  2.84
 5         5  3.96
 6         6  3.98
 7         7  3.38
 8         8  3.60
 9         9  4.08
10        10  3.98
# ℹ 990 more rows

Linear Regression

Confidence intervals for a slope

conf_ints <- 
  boot_dist |>
  get_confidence_interval(level = 0.95, type = "percentile")

conf_ints
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1     2.99     4.66

Linear Regression

Confidence intervals for a slope

boot_dist |>
  visualize() +
  shade_confidence_interval(endpoints = conf_ints)

Linear Regression

Confidence intervals for a slope

boot_dist |>
  visualize() +
  shade_confidence_interval(endpoints = conf_ints)

Linear Regression

Hypothesis test for a slope

  • Null hypothesis:
    • Slope = 0
  • Alternative hypothesis:
    • Slope \(\neq\) 0
    • Slope < 0
    • Slope > 0

Linear Regression

Hypothesis test for a slope

null_dist <-
  mating_data |>
  specify(reproductive_success ~ mating_success) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "slope")

null_dist
Response: reproductive_success (numeric)
Explanatory: mating_success (numeric...
# A tibble: 1,000 × 2
   replicate    stat
       <int>   <dbl>
 1         1  0.0382
 2         2 -0.603 
 3         3 -0.951 
 4         4 -0.570 
 5         5 -1.20  
 6         6 -0.0117
 7         7 -0.548 
 8         8 -0.707 
 9         9 -0.423 
10        10 -0.0407
# ℹ 990 more rows

Linear Regression

Hypothesis test for a slope

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

Linear Regression

Multiple linear regression

\[ y = \beta_1x+\beta_0 \]

\[ y = \beta_0+\beta_1x+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5... \]

Linear Regression

Multiple linear regression: intercept only model

\[ \text{flipper_length_mm} = \beta_0 \]

penguins |>
  specify(response = flipper_length_mm) |>
  fit()
# A tibble: 1 × 2
  term      estimate
  <chr>        <dbl>
1 intercept     201.

Linear Regression

Multiple linear regression: intercept and slope model

\[ \text{flipper_length_mm} = \beta_0+ \\ \beta_1 \times \text{body_mass_g} \]

penguins |>
  specify(response = flipper_length_mm, explanatory = body_mass_g) |>
  fit()
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 intercept   137.    
2 body_mass_g   0.0153

Linear Regression

Multiple linear regression: additive model with parallel slopes (ANCOVA)

\[ \text{flipper_length_mm} = \beta_0+ \\ \beta_1 \times \text{body_mass_g}+ \\ \beta_2 \times \text{species} \]

penguins |>
  specify(flipper_length_mm ~ body_mass_g + species) |>
  fit()
# A tibble: 4 × 2
  term              estimate
  <chr>                <dbl>
1 intercept        159.     
2 body_mass_g        0.00840
3 speciesChinstrap   5.60   
4 speciesGentoo     15.7    

Linear Regression

Multiple linear regression: interaction model

\[ \text{flipper_length_mm} = \beta_0 + \\ \beta_1 \times \text{body_mass_g} + \\ \beta_2 \times \text{species} + \\ \beta_3 \times \text{body_mass_g} \times \text{species} \]

penguins |>
  specify(flipper_length_mm ~ body_mass_g * species) |>
  fit()
# A tibble: 6 × 2
  term                          estimate
  <chr>                            <dbl>
1 intercept                    165.     
2 body_mass_g                    0.00668
3 speciesChinstrap             -13.9    
4 speciesGentoo                  6.06   
5 body_mass_g:speciesChinstrap   0.00523
6 body_mass_g:speciesGentoo      0.00236

Linear Regression

Multiple linear regression

  • The methods used in infer for hypothesis testing in multiple linear regression are a bit limited.
    • Need to move to parsnip package or base R lm() function.
    • Bit beyond this course, but you will learn it when you need to (hard to avoid!)
  • Super powerful framework

Logistic Regression

  • When your response is a binomial (categorical variable with two levels), and you want to predict it using (at least) one quantitative variable

Logistic Regression

Predicting a binomial response with a quantitative variable

Logistic Regression

Predicting a binomial response with a quantitative variable

  • Type of generalized linear model (GLM)
  • Two categories are treated as 0 or 1.
  • Requires a transformation of the response variable, the probability of being 1, \(p\)
    • Linear regression is a straight line
    • Will predict values less than 0 or more than 1

Logistic Regression

Logit transformation

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]

  • When \(p=0.5\) (equally likely either species), \(\text{logit}(p)=0\)
  • When \(p\) approaches 1 (almost certainly Gentoo), \(\text{logit}(p)\) approaches \(+\infty\)
  • When \(p\) approaches 0 (almost certainly Adelie), \(\text{logit}(p)\) approaches \(-\infty\)

\[ \text{logit}(p) = \beta_0+\beta_1x+\beta_2x_2+... \]

Logistic Regression

Logit transformation

Logistic Regression

Inverse logit transformation

  • After fitting a logistic regression model on the logit scale, you get predictions in logit units.
  • The inverse logit transforms them back to probabilities (0–1).

\[ p = \frac{e^{\beta_0+\beta_1x...}}{1 + e^{\beta_0+\beta_1x...}} \]

Logistic Regression

How to fit in R

\[ \text{logit}(p) = \beta_0 + \beta_1 \times \text{body_mass_g} \]

penguins |>
  filter(species == "Adelie" | species == "Gentoo") |>
  specify(species ~ body_mass_g) |>
  fit()
# A tibble: 2 × 2
  term         estimate
  <chr>           <dbl>
1 intercept   -27.5    
2 body_mass_g   0.00623

Logistic Regression

Interpreting coefficients: intercept

\(\beta_0 \approx -27.5\)

  • Predicted logit when body_mass_g = 0
  • When body mass = 0 grams: \(p \approx 0\) (almost certainly Adelie)
  • Rarely interpretable (0 grams is biologically impossible)

Logistic Regression

Interpreting coefficients: slope

\(\beta_1 \approx 0.00623\)

  • For each additional gram of body mass, log-odds increase by 0.00623
  • Odds ratio: \(e^{0.00623} \approx 1.0062\)
  • Each gram of body mass multiplies the odds of being Gentoo by 1.0062
  • A 100 gram difference: \(1.0062^{100} \approx 1.87\) times higher odds }

Logistic Regression

Interpreting coefficients

\[ \text{logit}(p) = -27.5 + 0.00623 \times \text{body_mass_g} \]

Inverse logit (to get probability): \[ p = \frac{e^{\text{logit}(p)}}{1 + e^{\text{logit}(p)}} \]

# logit scale prediction
logit_pred <- -27.5 + 0.00623 * 4500

# convert to probability
prob_pred <- plogis(logit_pred)
prob_pred
[1] 0.6306485

Logistic Regression

Multiple logisitic regression

  • The methods used in infer for hypothesis testing in multiple logistic regression are a bit limited.
    • Need to move to parsnip package.
    • Bit beyond this course, but you will learn it when you need to (hard to avoid!)
  • Super powerful framework