The scientific method & experimental design

Lecture 3

Iain R. Moodie

Thursday 26th March, 2026

Populations and samples

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Populations and samples

Why we collect data

  • Recording some kind of observation or measurement
  • Example:
    • Measuring the heights of different trees in a forest
    • Measuring the carbon in the forest soil at different locations
  • We want to say something about the forest in general

A drone shot of a forest

Photo by Olena Bohovyk

Populations and samples

Why we collect data

  • Cannot measure every tree or soil at every location
  • Instead, we collect a sample of data
  • Use the sample to draw conclusions about the population
  • Statistics allows us to approximate properties of entire populations from a limited number of samples1

A drone shot of a forest

Photo by Olena Bohovyk

Populations and samples

Definitions

Population

The totality of individual observations about which inferences are to be made, existing anywhere in the world or at least within a definitely specified sampling area limited in space and time.

Sample

A collection of individual observations selected by a specified procedure.

Populations and samples

Examples

Populations

  • All the spruce (gran) trees in Skåne
  • All the blue tits (blåmes) in Sweden
  • All the genes in the common fruit fly (Drosophila melanogaster)
  • All the herring (sill) in the Baltic sea

Samples

  • 300 spruce trees from forests in Skåne
  • 100 caught blue tits from nest boxes in Sweden
  • 20 genes from the Drosophila melanogaster genome
  • 1000 herring caught by a fishing boat off the coast of Karlskrona

Populations and samples

Parameters and statistics

  • Many statistical analyses are focused on a numerical summary.
    • E.g. Mean, standard deviation, correlation
  • Can be exactly calculated (measurement error aside) from the population: population parameter
  • Can be inferred from a representative sample: sample statistic
  • If the data was collected representatively, then a sample statistic should be a good approximation of population parameter.

Populations and samples

Anecdotal evidence

“I saw a bumblebee in Skrylle that was huge! Therefore bumblebees in Skrylle must be unusually large.”

Populations and samples

Anecdotal evidence

“I saw a bumblebee in Skrylle that was huge! Therefore bumblebees in Skrylle must be unusually large.”

  • Few data points
  • Data collected haphazardly
  • Rare cases are more memorable than common ones

Populations and samples

How to sample from a population

How could we collect a representative sample of:

  • Students currently studying at the Department of Biology?
  • Students across the whole university?

Photo by Alexandra Roslund

02:30

Populations and samples

How to sample from a population

How could we collect a representative sample of:

  • DNA from red squirrels in Skåne?

02:30

Populations and samples

How to sample from a population

  • If we want to claim that our sample statistic is a good representation of the population parameter:
    • Sample is unbiased
    • Randomness is a good way to achieve that
      • But sometimes simple random sampling is not appropriate

Populations and samples

If you know the population parameter, no need for statistics

Experimental design

Experimental design

Observational vs experimental studies

What are the main difference between these two studies?

  1. I measure the biomass of wild Mercurialis annua plants found in sandy soils and in loamy soils in a nature reserve.
  2. I grow Mercurialis annua plants in either sandy or loamy soils from seeds, and measure there biomass after a 3 months.

Photo by Michael Becker

02:00

Experimental design

Principles of experimental design

Experiment:

  • When we assign treatments
  • When we make an intervention
  • When we manipulate something
  • No longer just observing

Experimental design

Principles of experimental design: controlling

Try to control for differences that we can control but are not interested in.

For example:

  • Water all plants the same amount
  • Keep the temperature in the greenhouse the same
  • Space out the plants evenly

Photo by Michael Becker

Experimental design

Principles of experimental design: randomisation

Try to account for differences that we cannot control and are not interested in.

For example:

  • Randomly assign seeds to soil type (treatment)
  • Randomly assign pots to rooms in a greenhouse

Photo by Michael Becker

Experimental design

Principles of experimental design: replication

Which statement gives you more confidence? Why?

“A clinical trial of a new blood pressure medication reduced the number of heart attacks in the treatment group by 96% and no negative side effects were reported (sample size = 14 people)”

“A clinical trial of a new blood pressure medication reduced the number of heart attacks in the treatment group by 82%, and 2% of participants reported negative side effects (sample size = 300 people)”

02:00

Experimental design

Principles of experimental design: replication

  • The larger the sample size, the more accurately we can assess the effect of our treatment (explanatory variable) on the response variable.
  • Each replicate should be independent of all others
    • Otherwise we risk pseudoreplication

Experimental design

Principles of experimental design: replication

Pseudoreplication

  • 50 plants in each treatment
  • I measure 10 leaves from each plant
  • Is my sample size per treatment:
    • n = 50
    • n = 500

Photo by Michael Becker

02:00

Experimental design

Principles of experimental design: blocking

When we suspect variables other than the treatment influence the treatment. Sometimes done for logistical reasons.

Examples:

  • Temporal blocks: split into experimental groups that are conducted at different times
  • Spatial blocks: split into experimental groups that are conducted in different locations
  • “Risk” blocks: split into experimental groups that you expect to react differently to the treatment

Experimental design

Common types of experimental design: factorial design

Experiments where multiple treatments are applied, and all combinations of treatments are used:

Soil type Fertiliser
Sandy None
Sandy Added
Loamy None
Loamy Added

Causation

Causation

Observational vs experimental studies

What are the main difference between these two studies?

  1. I measure the biomass of wild Mercurialis annua plants found in sandy soils and in loamy soils in a nature reserve.
  2. I grow Mercurialis annua plants in either sandy or loamy soils from seeds, and measure there biomass after a 3 months.

Photo by Michael Becker

Causation

Causal pathways

Soil type

Plant biomass

Causation

Confounding variables

  • Anything that confuses you about the causation
    • Can be ommitted variables

Soil type

Plant biomass

u

Causation

Confounding variables

  • Anything that confuses you about the causation
    • Can be ommitted variables
    • Can also be measured

Soil type

Plant biomass

Plant density

Causation

Causal reasoning via DAGs

  • Directed acyclic graphs
  • Causation flows along the arrows
    • A causes Y
    • B also causes Y
  • Used to define causal relationships to then:
    • Design experiments
    • Design statistical methods

A

Y

B

Causation

Causal reasoning via DAGs

Causation

Causal reasoning via DAGs

Causation

Causal reasoning via DAGs

Causation

Causal reasoning via DAGs: forks

  • X and Y are associated (not independent)
  • Z is a “common cause”
  • Once grouped by Z, no association between X and Y

Z

Y

X

Causation

Causal reasoning via DAGs: forks

  • X and Y are associated (not independent)
  • Z is a “common cause”
  • Once grouped by Z, no association between X and Y
set.seed(11)
N <- 300
Z <- rbinom(N, 1, 0.5)
X <- rnorm(N, 2 * Z - 1)
Y <- rnorm(N, 2 * Z - 1)

df <- tibble(X = X, Y = Y, Z = factor(Z))

ggplot(df, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
  geom_smooth(aes(color = NULL), method = "lm", se = FALSE, formula = y ~ x, color = "black", linetype = "solid", linewidth = 1.2)

Causation

Causal reasoning via DAGs: pipes

  • X and Y are associated (not independent)
  • The effect of X on Y is transmitted through Z
  • Once grouped by Z, no association between X and Y

Z

Y

X

Causation

Causal reasoning via DAGs: pipes

  • X and Y are associated (not independent)
  • The effect of X on Y is transmitted through Z
  • Once grouped by Z, no association between X and Y
set.seed(11)
N <- 300
X <- rnorm(N)
Z <- Rlab::rbern(N,rethinking::inv_logit(X))
Y <- rnorm(N,(2*Z-1))

df <- tibble(X = X, Y = Y, Z = factor(Z))

ggplot(df, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
  geom_smooth(aes(color = NULL), method = "lm", se = FALSE, formula = y ~ x, color = "black", linetype = "solid", linewidth = 1.2)

Causation

Causal reasoning via DAGs: colliders

  • X and Y are not associated (independent)
  • X and Y both influence Z
  • Once grouped by Z, X and Y are associated

Y

Z

X

Causation

Causal reasoning via DAGs: colliders

  • X and Y are not associated (independent)
  • X and Y both influence Z
  • Once grouped by Z, X and Y are associated
set.seed(11)

N <- 300
X <- rnorm(N)
Y <- rnorm(N)
Z <- Rlab::rbern(N,rethinking::inv_logit(2*X+2*Y-2))

df <- tibble(X = X, Y = Y, Z = factor(Z))

ggplot(df, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
  geom_smooth(aes(color = NULL), method = "lm", se = FALSE, formula = y ~ x, color = "black", linetype = "solid", linewidth = 1.2)

Causation

Causal reasoning via DAGs: descendants

  • X and Y are causally associated via Z
  • A contains information about Z
  • Once grouped by A, X and Y are less associated
  • A is a proxy for Z

Z

Y

X

A

The scientific method

The scientific method

Why do we do science?

  • Why do you do science?
  • Why should we (as a society) do science?
  • Who do we do science for (if anyone)?
08:00

The scientific method

How do we do science?

  • Why do we study what we study?
    • Who decides?
  • How do we find agreement?
    • How do we handle disagreement?
  • How do we go from unknowns to knowns?
  • How does a scientific field progress?
12:00