Computational statistics

Lecture 2

Iain R. Moodie

BIOC13 - Ekologi

Department of Biology, Lund University

2025-04-25

Review

Populations and Samples

Review

Populations and Samples

tephritis_data

region	host_plant	patry	sex	body_length_mm	ovipositor_length_mm	wing_length_mm	wing_width_mm	melanized_percent	baltic
estonia	heterophyllum	sympatry	male	3.94	NA	4.455026	1.873016	55.96312	east
estonia	heterophyllum	sympatry	female	4.48	1.65	4.710000	2.170000	63.89701	east
estonia	heterophyllum	sympatry	male	4.61	NA	4.990196	2.294118	62.52199	east
estonia	heterophyllum	sympatry	female	5.31	1.78	5.611650	2.582524	61.37781	east
estonia	heterophyllum	sympatry	male	4.51	NA	4.468750	2.093750	62.25657	east
estonia	heterophyllum	sympatry	male	4.74	NA	4.906250	2.333333	57.46406	east

Review

Populations and Samples

Review

Populations and Samples

Review

Populations and Samples

Review

Populations and Samples

Review

Research question

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?

Review

Observed statistic

obs_stat <-
  tephritis_data |> 
  specify(response = ovipositor_length_mm, explanatory = host_plant) |> 
  calculate(stat = "diff in means", order = c("heterophyllum", "oleraceum"))

obs_stat

Response: ovipositor_length_mm (numeric)
Explanatory: host_plant (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.108

Was this just luck? What would happen if I measured another 600 flies?

Review

The sampling distribution

Review

The sampling distribution

Problem: we (usually) only collect one sample

Review

The sampling distribution

Problem: we (usually) only collect one sample
One (of many) solutions: resample our sample (with replacement) to generate more samples
- Bootstrap resampling
- Assumes that our sample is representative of the population!

Review

The bootstrap sampling distibution

Review

The bootstrap sampling distibution

Example in R for a difference in means:

library(infer)

boot_dist <-
1  tephritis_data |>
2  specify(response = ovipositor_length_mm, explanatory = host_plant) |>
3  generate(reps = 10000, type = "bootstrap") |>
4  calculate(stat = "diff in means", order = c("heterophyllum", "oleraceum"))

1: Name of the dataset we are working with.
2: specify which variables we are interested in.
3: generate new samples with bootstrap resampling.
4: calculate the chosen statistic for each new sample generated.

Review

The bootstrap sampling distibution

visualize(boot_dist)

Review

Research question

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?

obs_stat

Response: ovipositor_length_mm (numeric)
Explanatory: host_plant (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.108

What would happen if we sampled again?

Review

Confidence intervals

Review

Confidence intervals

If we collected another sample of the same size, how much would our test statistic be likely to vary?
Percentile method
- middle X% of sampling distribution

Review

Confidence intervals

Example in R:

percentile_ci <-
1  boot_dist |>
2  get_confidence_interval(level = 0.95, type = "percentile")

1: The bootstrap sampling distribution
2: Function from infer to calculate a 95% CI with "percentile" method

percentile_ci

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1   0.0738    0.141

Review

Confidence intervals

visualize(boot_dist) +
  shade_confidence_interval(endpoints = percentile_ci)

Review

Research question

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?

obs_stat

Response: ovipositor_length_mm (numeric)
Explanatory: host_plant (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.108

What would happen if we sampled again?
- We are pretty sure the difference would be between:

percentile_ci

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1   0.0738    0.141

Review

Research question

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?
- Observed difference in means: 0.108
- 95% confidence interval: 0.074 - 0.141

Review

Research question

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?
- Observed difference in means: 0.108
- 95% confidence interval: 0.074 - 0.141
Could we have observed this if there was truly no difference?

Review

Hypothesis testing

Review

Hypothesis testing

Null hypothesis
- There is no difference in ovipositor_length_mm between flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant.
Alternative hypothesis
- There is difference in ovipositor_length_mm between flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant.

Review

Hypothesis testing

Null hypothesis
- There is no difference in ovipositor_length_mm between flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant.
We test to see if our sample allows us to reject the null hypothesis with sufficient confidence.

Review

Hypothesis testing

Need to know what sort of differences we could observe if the null hypothesis was true
Need to generate a null distribution which we will then compare our observed statistic against.

Review

Hypothesis testing

Review

Hypothesis testing

If null hypothesis is true:
- host_plant variable has no predictive value for ovipositor_length_mm
- Randomly shuffling (permuting) host_plant should produce samples similar to observed sample
If null hypothesis is false:
- host_plant variable has predictive value for ovipositor_length_mm
- Randomly shuffling (permuting) host_plant should produce samples different to observed sample

Review

Hypothesis testing

null_dist <-
  tephritis_data |> 
  specify(response = ovipositor_length_mm, explanatory = host_plant) |>
1  hypothesise(null = "independence") |>
2  generate(reps = 10000, type = "permute") |>
  calculate(stat = "diff in means", order = c("heterophyllum", "oleraceum"))

1: We declare our null hypothesis to be that the response and explanatory variables are independent of each other.
2: We generate 10000 permutated datasets under this hypothesis, and calculate the statistic for each dataset

Review

Hypothesis testing

visualize(null_dist)

Review

Hypothesis testing

visualize(null_dist) +
  shade_p_value(obs_stat, direction = "both")

Review

Hypothesis testing

null_dist |>
  get_p_value(obs_stat, direction = "both")

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

Review

Research question

Do flies that use "heterophyllum" as a host_plant and flies that use "oleraceum" as a host_plant differ in their ovipositor_length_mm?
- Observed difference in means: 0.108
- 95% confidence interval: 0.074 - 0.141
Could we have observed this if there was truly no difference?
- Very unlikely (p < 0.0001)

How to address other research questions

Difference between >2 groups in a continuous variable

Comparing means between >2 groups

Analysis of Variance (ANOVA)

Comparing means between >2 groups

Analysis of Variance (ANOVA)

Comparing means between >2 groups

Analysis of Variance (ANOVA)

Comparing means between >2 groups

The ratio of variances (F-statistic)

\[ F = \frac{\text{Variance 1}}{\text{Variance 2}} \]

Can (in general) be used to test if the variances are equal between sources of variance
- \(F=1\) if variances are equal
- \(F\) is very low if variance 1 < variance 2
- \(F\) is very high if variance 1 > variance 2

Comparing means between >2 groups

Analysis of Variance (ANOVA)

\[ F = \frac{\text{Mean variance between-group}}{\text{Mean variance within-group}} \]

If \(F\) is very big, more likely that means are different

Comparing means between >2 groups

Analysis of Variance (ANOVA)

If you want to know if >2 groups differ in their means, use F-statistic
Null hypothesis
- The means of all groups are the same
- Mean A == Mean B == Mean C
Alternative hypothesis
- The means of at least one group differ from the global mean

Comparing means between >2 groups

Analysis of Variance (ANOVA)

Different from a hypothesised value?

Continuous variable

Is the mean ovipositor_length_mm in the heterophyllum host race different from 1.78 mm?
Confidence interval approach
Hypothesis testing approach

Different from a hypothesised value?

Continuous variable

Observed difference:

het_data <-
  tephritis_data |> 
  filter(host_plant == "heterophyllum" & sex == "female")

obs_stat <-
  het_data |>
  specify(response = ovipositor_length_mm) |>
  calculate(stat = "mean")

obs_stat

Response: ovipositor_length_mm (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  1.76

Different from a hypothesised value?

Continuous variable (confidence interval approach)

Different from a hypothesised value?

Continuous variable (confidence interval approach)

boot_dist <-
  het_data |>
  specify(response = ovipositor_length_mm) |>
  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "mean")

visualise(boot_dist)

Different from a hypothesised value?

Continuous variable (confidence interval approach)

conf_int <-
  boot_dist |>
  get_confidence_interval(level = 0.95, type = "percentile")

visualise(boot_dist) +
  shade_confidence_interval(endpoints = conf_int)

Different from a hypothesised value?

Continuous variable (confidence interval approach)

visualise(boot_dist) +
  shade_confidence_interval(endpoints = conf_int) +
  geom_vline(xintercept = 1.78, colour = "red")

Different from a hypothesised value?

Continuous variable (confidence interval approach)

Is the mean ovipositor_length_mm in the heterophyllum host race different from 1.78 mm?
- Observed: 1.764 mm
- 95% confidence interval: 1.74 - 1.79 mm
- True mean is likely in this range
- Hypothesised value (1.78 mm) is also in the range
- We are not so confident that the true mean is not 1.78 mm

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Is the mean ovipositor_length_mm in the heterophyllum host race different from 1.78 mm?

01:30

Think 30 sec, discuss 60 sec

What is our null and alternative hypothesis?

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

How do we get a null distribution?
- Permuting is not an option here

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Bootstrap resample from our new dataset that is compatible with the null hypothesis

set.seed(123)
null_dist <-
  het_data |>
  specify(response = ovipositor_length_mm) |>
  hypothesise(null = "point", mu = 1.78) |>
  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "mean")

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Bootstrap resample from our new dataset that is compatible with the null hypothesis

visualise(null_dist)

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Compare with observed value

visualise(null_dist) +
  shade_p_value(obs_stat, direction = "both")

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

null_dist |> get_p_value(obs_stat, direction = "both")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.200

Different from a hypothesised value?

Continuous variable (hypothesis testing approach)

Is the mean ovipositor_length_mm in the heterophyllum host race different from 1.78 mm?
- Observed: 1.764 mm
- If the true mean was 1.78 mm, probability to observe 1.764 mm or a value more extreme is 0.2 (20% chance)

Categorical data

Tests of proportion

Is a proportion different from a hypothesised value?

In Sweden, 58% prefer dogs (according to a survey of 2000 people by Folksam)
Do our preferences differ from this?
- Does the proportion of biology students who prefer dogs differ from 0.58?

Tests of proportion

Is a proportion different from a hypothesised value?

Tests of proportion

Is a proportion different from a hypothesised value?

Calculate the test statistic:

obs_stat <- 
  animal_pref |>
  specify(response = preference, success = "dog") |>
  calculate(stat = "prop")

obs_stat

Response: preference (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1  0.56

Tests of proportion

Is a proportion different from a hypothesised value? (CI)

Tests of proportion

Is a proportion different from a hypothesised value? (CI)

boot_dist <-
  animal_pref |>
  specify(response = preference, success = "dog") |>
  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "prop")

visualise(boot_dist)

Tests of proportion

Is a proportion different from a hypothesised value? (CI)

conf_int <-
  boot_dist |>
  get_confidence_interval(level = 0.95, type = "percentile")

visualise(boot_dist) +
  shade_confidence_interval(endpoints = conf_int)

Tests of proportion

Is a proportion different from a hypothesised value? (Hyp. test)

Tests of proportion

Is a proportion different from a hypothesised value? (Hyp. test)

null_dist <-
  animal_pref |>
  specify(response = preference, success = "dog") |>
  hypothesize(null = "point", p = 0.58) |>
  generate(reps = 10000, type = "draw") |>
  calculate(stat = "prop")

visualise(null_dist)

Tests of proportion

Is a proportion different from a hypothesised value? (Hyp. test)

visualise(null_dist) +
  shade_p_value(obs_stat = obs_stat, direction = "two-sided")

Tests of proportion

Is a proportion different from a hypothesised value? (Hyp. test)

null_dist |>
  get_p_value(obs_stat = obs_stat, direction = "two-sided")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.984

Tests of proportion, but more

Are 2 or more distributions different?

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

Exactly what we have just done, but we extend it beyond binary (yes/no, cat/dog) data
Similar to how F-statistic allows for diff in means for many groups
\(\chi^2\) allows for difference in proportions for many categories

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

# A tibble: 120 × 1
   phenotype
   <chr>    
 1 A-B-     
 2 A-B-     
 3 A-B-     
 4 A-B-     
 5 A-B-     
 6 A-B-     
 7 A-B-     
 8 A-B-     
 9 A-B-     
10 A-B-     
# ℹ 110 more rows

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

\[ \chi^2 = \sum \frac{(Observed_i - Expected_i)^2}{Expected_i} \]

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

obs_stat <- 
  mendelian_data |>
  specify(response = phenotype) |>
  hypothesize(
    null = "point",
    p = c(
      "A-B-" = 9/16,
      "A-bb" = 3/16,
      "aaB-" = 3/16,
      "aabb" = 1/16
    )
   ) |>
  calculate(stat = "Chisq")

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

Null hypothesis:
- The sample came from the hypothesised distribution
- The sample distribution is not different from the hypothesised distribution
Alternative hypotheis:
- The sample came from a different distribution to the one hypothesised
- The sample distribution is different from the hypothesised distribution

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

null_dist <- 
  mendelian_data |>
  specify(response = phenotype) |>
  hypothesize(
    null = "point",
    p = c(
      "A-B-" = 9/16,
      "A-bb" = 3/16,
      "aaB-" = 3/16,
      "aabb" = 1/16
    )
   ) |>
  generate(reps = 10000, type = "draw") |>
  calculate(stat = "Chisq")

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

null_dist |>
  visualize() + 
  shade_p_value(obs_stat, direction = "greater")

\(\chi^2\) Goodness of fit

Does the observed data differ from an expected distribution?

null_dist |>
  get_p_value(obs_stat = obs_stat, direction = "greater")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.570

Are two categorical variables associated with each other?

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

germ_data

# A tibble: 120 × 2
   treatment germination_success
   <chr>     <chr>              
 1 coating_a germinated         
 2 coating_a germinated         
 3 coating_a germinated         
 4 coating_a failed_to_germinate
 5 coating_a germinated         
 6 coating_a germinated         
 7 coating_a germinated         
 8 coating_a germinated         
 9 coating_a germinated         
10 coating_a germinated         
# ℹ 110 more rows

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

Null hypothesis:
- The two categorical variables are not associated with each other
- The two categorical variables are independent
Alternative hypothesis:
- The two categorical variables are associated with each other
- The two categorical variables are not independent

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

obs_stat <- 
  germ_data |>
  specify(response = germination_success, explanatory = treatment) |>
  calculate(stat = "Chisq")

obs_stat

Response: germination_success (factor)
Explanatory: treatment (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1  2.43

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

null_dist <- 
  germ_data |>
  specify(response = germination_success, explanatory = treatment) |>
  hypothesize(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  calculate(stat = "Chisq")

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

visualize(null_dist) +
  shade_p_value(obs_stat, direction = "greater")

\(\chi^2\) Test of independence

Are two categorical variables associated with each other?

null_dist |>
  get_p_value(obs_stat = obs_stat, direction = "greater")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.497

Correlation

Do two continuous variables covary?

Used to assess if two continuous variables are independent, or if they covary (linearly)
We do not express one variable as a function of another:
- No “response” and “explanatory”
Usually assumed that both variables are “effects of a common cause” (Sokal and Rohlf 1995)
“Correlation does not mean causation”
- But you can measure a correlation between a cause and effect

Correlation

Do two continuous variables covary?

Correlation coefficient (Pearson):

\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

\[ r = \frac{\text{Covariance}(x,y)}{\text{Standard deviation}(x) \times \text{Standard deviation}(y)} \]

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Calculate standard deviation in \(x\):

\[ \sigma_x = \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \]

data |>
  observe(response = x, stat = "sd")

Response: x (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  10.9

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Calculate standard deviation in \(y\):

\[ \sigma_y = \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2} \]

data |>
  observe(response = y, stat = "sd")

Response: y (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  3.51

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

Calculate the covariance between \(x\) and \(y\):

\[ Cov(x,y) = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \]

data |>
  cov()

          x        y
x 118.23297 33.87468
y  33.87468 12.31625

Correlation

Do two continuous variables covary?

\[ r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} \]

data |>
  specify(x ~ y) |>
  calculate(stat = "correlation")

Response: x (numeric)
Explanatory: y (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.888

Correlation

Do two continuous variables covary?

Correlation

Do two continuous variables covary?

Correlation

Do two continuous variables covary?

obs_stat <-
  tephritis_data |>
  specify(ovipositor_length_mm ~ body_length_mm) |>
  calculate(stat = "correlation")

obs_stat

Response: ovipositor_length_mm (numeric)
Explanatory: body_length_mm (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.547

Correlation

Do two continuous variables covary? (CI)

Correlation

Do two continuous variables covary? (CI)

boot_dist <-
  tephritis_data |>
  specify(ovipositor_length_mm ~ body_length_mm) |>
  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "correlation")

visualise(boot_dist)

Correlation

Do two continuous variables covary? (CI)

conf_int <- 
  boot_dist |>
  get_confidence_interval(level = 0.95, type = "percentile")

visualise(boot_dist) +
  shade_confidence_interval(endpoints = conf_int)

Correlation

Do two continuous variables covary? (Hyp. test)

Correlation

Do two continuous variables covary? (Hyp. test)

02:00

Think 30 sec, discuss 90 sec

What is our null and alternative hypothesis? How could we generate data compatible with the null?

Correlation

Do two continuous variables covary? (Hyp. test)

null_dist <-
  tephritis_data |>
  specify(ovipositor_length_mm ~ body_length_mm) |>
  hypothesise(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  calculate(stat = "correlation")

Correlation

Do two continuous variables covary? (Hyp. test)

visualize(null_dist) +
  shade_p_value(obs_stat = obs_stat, direction = "both")

Regression

How does variable Y depend on variable X

A mathmatical function describing the relationship between two continuous variables
Y is dependant on X (the)
Y is a function of X
Predictive model
Very powerful framework

Regression

How does variable Y depend on variable X

\[ y = \text{Slope}\times x + \text{Intercept} \]

\[ y = mx+c \]

\[ y = \beta_1x+\beta_0 \]

\[ y = \beta_0+\beta_1x+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5 \]

Regression

How does variable Y depend on variable X

Regression

How does variable Y depend on variable X

Regression

How does variable Y depend on variable X

Example: For sexual selection to operate, an increase in mating success (number of mates) must result in an increase in reproductive success (number of offspring).

Regression

How does variable Y depend on variable X

\[ y = \beta_1x+\beta_0 \]

\[ y = 3.83x-0.68 \]

\(\beta_1\) = strength of sexual selection
For each additional mate, an individual (on average) gains \(\beta_1\) additional offspring
For 5 mates (\(x=5\)):
- \(y = 3.83\times5-0.68\)
- \(y = 18.47\)

Regression

How does variable Y depend on variable X

Regression

How does variable Y depend on variable X

Regression

How does variable Y depend on variable X

Regression

How does variable Y depend on variable X

Regression

How does variable Y depend on variable X

Fit by solving to minimise the sum of the squared residuals (SSR)
- Find \(\beta_1\) and \(\beta_0\) that minimise the SSR
- Called a “loss function”
- Many approaches to do this!

Regression

How does variable Y depend on variable X

Regression

How to fit in R?

Base R:

lm(data = mating_data, reproductive_success ~ mating_success)


Call:
lm(formula = reproductive_success ~ mating_success, data = mating_data)

Coefficients:
   (Intercept)  mating_success  
       -0.6844          3.8344

Regression

How to fit in R?

With infer:

observed_fit <-
  mating_data |>
  specify(reproductive_success ~ mating_success) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term           estimate
  <chr>             <dbl>
1 intercept        -0.684
2 mating_success    3.83

Regression

Confidence intervals

boot_dist <-
  mating_data |>
  specify(reproductive_success ~ mating_success) |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

boot_dist

# A tibble: 2,000 × 3
# Groups:   replicate [1,000]
   replicate term           estimate
       <int> <chr>             <dbl>
 1         1 intercept        -3.75 
 2         1 mating_success    4.32 
 3         2 intercept        -4.96 
 4         2 mating_success    4.73 
 5         3 intercept        -0.267
 6         3 mating_success    3.80 
 7         4 intercept         6.72 
 8         4 mating_success    2.84 
 9         5 intercept        -1.08 
10         5 mating_success    3.96 
# ℹ 1,990 more rows

Regression

Confidence intervals

conf_ints <- 
  get_confidence_interval(
    boot_dist, 
    level = .95, 
    point_estimate = observed_fit
  )

conf_ints

# A tibble: 2 × 3
  term           lower_ci upper_ci
  <chr>             <dbl>    <dbl>
1 intercept         -5.33     4.36
2 mating_success     2.99     4.66

Regression

Confidence intervals

visualize(boot_dist) +
  shade_confidence_interval(endpoints = conf_ints)

Regression

Hypothesis test

Null hypothesis:
- Slope = 0
Alternative hypothesis:
- Slope \(\neq\) 0
- Slope < 0
- Slope > 0

Regression

Hypothesis test

null_dist <-
  mating_data |>
  specify(reproductive_success ~ mating_success) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

null_dist

# A tibble: 2,000 × 3
# Groups:   replicate [1,000]
   replicate term           estimate
       <int> <chr>             <dbl>
 1         1 intercept       20.3   
 2         1 mating_success   0.0382
 3         2 intercept       23.9   
 4         2 mating_success  -0.603 
 5         3 intercept       25.8   
 6         3 mating_success  -0.951 
 7         4 intercept       23.7   
 8         4 mating_success  -0.570 
 9         5 intercept       27.2   
10         5 mating_success  -1.20  
# ℹ 1,990 more rows

Regression

Hypothesis test

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

Regression

Hypothesis test

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")

# A tibble: 2 × 2
  term           p_value
  <chr>            <dbl>
1 intercept            0
2 mating_success       0

Regression

Multiple regression

\[ y = \beta_1x+\beta_0 \]

\[ y = \beta_0+\beta_1x+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5 \]

data |>
specify(reproductive_success ~ mating_success + sex + feather_colour) ...

Take-away points

Introduced two new methods of generating a null hypothesis
- "permute"
- "draw"
Introduced four new statistics
- "F-statistic" (compare variances to infer if means differ)
- "Chi-sq" (compare distributions of categorical data)
- "correlation" (do two variables covary linearly?)
- Regression coefficients (via fit())

Exercises

References

Sokal, R. R., and F. J. Rohlf. 1995. Biometry. https://lubcat.lub.lu.se/cgi-bin/koha/opac-detail.pl?biblionumber=2483009.

Computational statistics

Review

Review

Populations and Samples

Review

Populations and Samples

Review

Populations and Samples

Review

Populations and Samples

Review

Populations and Samples

Review

Populations and Samples

Review

Research question

Review

Observed statistic

Review

The sampling distribution

Review

The sampling distribution

Review

The sampling distribution

Review

The bootstrap sampling distibution

Review

The bootstrap sampling distibution

Review

The bootstrap sampling distibution

Review

Research question

Review

Confidence intervals

Review

Confidence intervals

Review

Confidence intervals

Review

Confidence intervals

Review

Research question

Review

Research question

Review

Research question

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Hypothesis testing

Review

Research question

How to address other research questions

Other research questions

Already covered

Other research questions

Today

Difference between >2 groups in a continuous variable

Comparing means between >2 groups

Analysis of Variance (ANOVA)

Comparing means between >2 groups

Analysis of Variance (ANOVA)

Comparing means between >2 groups

Analysis of Variance (ANOVA)