Welcome to the {tidyverse}

Open R Sessions 2023

Violeta Caballero López
Laura Hildesheim
Simon Jacobsen Ellerstrand
Iain Moodie
Pedro Rosero

Functions re-cap

  • a piece of code that takes an input, does something, and returns an output.
do_something <- function(input){


Stringing together multiple functions

  1. Multiple assignments
did_something <- do_something(data)

did_another_thing <- do_another_thing(did_something)

final_thing <- do_last_thing(did_another_thing)
  1. Nested functions
final_thing <- do_last_thing(do_another_thing(do_something(data)))
  1. Pipes!

Pipes in R

final_thing <-
  data |>
  do_something() |>
  do_another_thing() |>
  • Could be verbalised as “and then”
  • Base R pipe (> 4.1.0) = |>
  • magrittr package pipe = %>%
  • they differ slightly
  • my advice: use |>

Biology is a Data Science

Data Science

  1. “Wrangling” the dataset
  2. Analysing the data
  3. Reporting the results

50-80% of a data scientist’s time is spend wrangling data

(not fun)

A collection of opinionated R packages designed for data science

What tidyverse is designed for

Install tidyverse like any other R package from CRAN:


Load the package with library():

Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


  • read.delim() read_delim()
  • read_csv()
  • read_tsv()
  • write_*()


penguins <- read_csv("palmerpenguins_untidy.csv")
Rows: 1376 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): id, measurement
dbl (2): year, value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


The data is stored as a tibble

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
# A tibble: 1,376 × 4
   id                 year measurement value
   <chr>             <dbl> <chr>       <dbl>
 1 1_Adelie_male_C    2007 bm_lb        8.27
 2 1_Adelie_male_C    2007 fl_cm       18.1 
 3 1_Adelie_male_C    2007 bl_in        1.54
 4 1_Adelie_male_C    2007 bd_mm       18.7 
 5 2_Adelie_female_C  2007 bm_lb        8.38
 6 2_Adelie_female_C  2007 fl_cm       18.6 
 7 2_Adelie_female_C  2007 bl_in        1.56
 8 2_Adelie_female_C  2007 bd_mm       17.4 
 9 3_Adelie_female_C  2007 bm_lb        7.16
10 3_Adelie_female_C  2007 fl_cm       19.5 
# ℹ 1,366 more rows


The tidyverse is built around “tidy data”

A dataset is tidy if:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.


# A tibble: 1,376 × 4
   id                 year measurement value
   <chr>             <dbl> <chr>       <dbl>
 1 1_Adelie_male_C    2007 bm_lb        8.27
 2 1_Adelie_male_C    2007 fl_cm       18.1 
 3 1_Adelie_male_C    2007 bl_in        1.54
 4 1_Adelie_male_C    2007 bd_mm       18.7 
 5 2_Adelie_female_C  2007 bm_lb        8.38
 6 2_Adelie_female_C  2007 fl_cm       18.6 
 7 2_Adelie_female_C  2007 bl_in        1.56
 8 2_Adelie_female_C  2007 bd_mm       17.4 
 9 3_Adelie_female_C  2007 bm_lb        7.16
10 3_Adelie_female_C  2007 fl_cm       19.5 
# ℹ 1,366 more rows


penguins |>
  pivot_wider(names_from = measurement, values_from = value)
# A tibble: 344 × 6
   id                 year bm_lb fl_cm bl_in bd_mm
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl>
 1 1_Adelie_male_C    2007  8.27  18.1  1.54  18.7
 2 2_Adelie_female_C  2007  8.38  18.6  1.56  17.4
 3 3_Adelie_female_C  2007  7.16  19.5  1.59  18  
 4 4_Adelie_NA_C      2007 NA     NA   NA     NA  
 5 5_Adelie_female_C  2007  7.61  19.3  1.44  19.3
 6 6_Adelie_male_C    2007  8.05  19    1.55  20.6
 7 7_Adelie_female_C  2007  7.99  18.1  1.53  17.8
 8 8_Adelie_male_C    2007 10.3   19.5  1.54  19.6
 9 9_Adelie_NA_C      2007  7.66  19.3  1.34  18.1
10 10_Adelie_NA_C     2007  9.37  19    1.65  20.2
# ℹ 334 more rows


penguins_tidy <-
penguins |>
  pivot_wider(names_from = measurement, values_from = value) |>
  separate(col = id, into = c("penguin_id", "species", "sex", "island_id"), sep = "_", convert = TRUE)

# A tibble: 344 × 9
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18  
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA  
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2
# ℹ 334 more rows


  • pivot_wider() & pivot_longer()
  • separate(), extract() & unite()
  • nest() & unnest()
  • replace_na()

# A tibble: 344 × 9
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18  
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA  
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2
# ℹ 334 more rows


provides a grammer of data manipulation

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names
  • filter() picks cases based on their values
  • summarise() reduces multiple values down to a single summary
  • arrange() changes the ordering of the rows


  1. change body mass from lb into g
penguins_tidy |>
  mutate(body_mass_g = bm_lb * 453.6)
# A tibble: 344 × 10
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm body_mass_g
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7        3750
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4        3800
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18          3250
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA            NA
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3        3450
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6        3650
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8        3625
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6        4675
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1        3475
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2        4250
# ℹ 334 more rows


  1. change beak length from in to mm
penguins_tidy |>
  mutate(body_mass_g = bm_lb * 453.6) |>
  mutate(beak_length_mm = bl_in * 25.4)

within a single mutate() function

penguins_tidy |>
    body_mass_g = bm_lb * 453.6,
    beak_length_mm = bl_in * 25.4


penguins_tidy |>
    body_mass_g = bm_lb * 453.6,
    beak_length_mm = bl_in * 25.4,
    beak_depth_mm = bd_mm,
    flipper_length_mm = fl_cm * 10
# A tibble: 344 × 13
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm body_mass_g
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7        3750
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4        3800
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18          3250
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA            NA
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3        3450
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6        3650
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8        3625
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6        4675
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1        3475
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2        4250
# ℹ 334 more rows
# ℹ 3 more variables: beak_length_mm <dbl>, beak_depth_mm <dbl>,
#   flipper_length_mm <dbl>


penguins_tidy <-
  penguins_tidy |>
    body_mass_g = bm_lb * 453.6,
    beak_length_mm = bl_in * 25.4,
    beak_depth_mm = bd_mm,
    flipper_length_mm = fl_cm * 10
  ) |>
  select(-bl_in, -bd_mm, -fl_cm, -bm_lb)

# A tibble: 344 × 9
   penguin_id species sex    island_id  year body_mass_g beak_length_mm
        <int> <chr>   <chr>  <chr>     <dbl>       <dbl>          <dbl>
 1          1 Adelie  male   C          2007        3750           39.1
 2          2 Adelie  female C          2007        3800           39.5
 3          3 Adelie  female C          2007        3250           40.3
 4          4 Adelie  <NA>   C          2007          NA           NA  
 5          5 Adelie  female C          2007        3450           36.7
 6          6 Adelie  male   C          2007        3650           39.3
 7          7 Adelie  female C          2007        3625           38.9
 8          8 Adelie  male   C          2007        4675           39.2
 9          9 Adelie  <NA>   C          2007        3475           34.1
10         10 Adelie  <NA>   C          2007        4250           42  
# ℹ 334 more rows
# ℹ 2 more variables: beak_depth_mm <dbl>, flipper_length_mm <dbl>

island_data <- read_csv("palmerpenguins_island_data.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): island, island_id
lgl (2): iba_status, is_cold

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
  island    island_id iba_status is_cold
  <chr>     <chr>     <lgl>      <lgl>  
1 Torgersen C         TRUE       TRUE   
2 Biscoe    A         FALSE      TRUE   
3 Dream     B         TRUE       TRUE   


Joins add columns from y to x, matching observations based on a key.

  • A left_join() keeps all observations in x.
  • A right_join() keeps all observations in y.
  • A full_join() keeps all observations in x and y.


Using a left_join()

  x = penguins_tidy,
  y = island_data,
  by = join_by(island_id)

can also be written as:

penguins_tidy |>
left_join(island_data, by = join_by(island_id))


# A tibble: 344 × 12
   penguin_id species sex    island_id  year body_mass_g beak_length_mm
        <int> <chr>   <chr>  <chr>     <dbl>       <dbl>          <dbl>
 1          1 Adelie  male   C          2007        3750           39.1
 2          2 Adelie  female C          2007        3800           39.5
 3          3 Adelie  female C          2007        3250           40.3
 4          4 Adelie  <NA>   C          2007          NA           NA  
 5          5 Adelie  female C          2007        3450           36.7
 6          6 Adelie  male   C          2007        3650           39.3
 7          7 Adelie  female C          2007        3625           38.9
 8          8 Adelie  male   C          2007        4675           39.2
 9          9 Adelie  <NA>   C          2007        3475           34.1
10         10 Adelie  <NA>   C          2007        4250           42  
# ℹ 334 more rows
# ℹ 5 more variables: beak_depth_mm <dbl>, flipper_length_mm <dbl>,
#   island <chr>, iba_status <lgl>, is_cold <lgl>


penguins_tidy <-
  penguins_tidy |>
  left_join(island_data, by = join_by(island_id)) |>

Rows: 344
Columns: 11
$ penguin_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ beak_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ beak_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ iba_status        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ is_cold           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…


penguins_tidy |>
  filter(year == 2007)
# A tibble: 110 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          4 Adelie  <NA>    2007          NA           NA            NA  
 5          5 Adelie  female  2007        3450           36.7          19.3
 6          6 Adelie  male    2007        3650           39.3          20.6
 7          7 Adelie  female  2007        3625           38.9          17.8
 8          8 Adelie  male    2007        4675           39.2          19.6
 9          9 Adelie  <NA>    2007        3475           34.1          18.1
10         10 Adelie  <NA>    2007        4250           42            20.2
# ℹ 100 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  filter(year != 2007)
# A tibble: 234 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1         51 Adelie  female  2008        3500           39.6          17.7
 2         52 Adelie  male    2008        4300           40.1          18.9
 3         53 Adelie  female  2008        3450           35            17.9
 4         54 Adelie  male    2008        4050           42            19.5
 5         55 Adelie  female  2008        2900           34.5          18.1
 6         56 Adelie  male    2008        3700           41.4          18.6
 7         57 Adelie  female  2008        3550           39            17.5
 8         58 Adelie  male    2008        3800           40.6          18.8
 9         59 Adelie  female  2008        2850           36.5          16.6
10         60 Adelie  male    2008        3750           37.6          19.1
# ℹ 224 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  filter(year == 2007 & species == "Gentoo")
# A tibble: 34 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1        153 Gentoo  female  2007        4500           46.1          13.2
 2        154 Gentoo  male    2007        5700           50            16.3
 3        155 Gentoo  female  2007        4450           48.7          14.1
 4        156 Gentoo  male    2007        5700           50            15.2
 5        157 Gentoo  male    2007        5400           47.6          14.5
 6        158 Gentoo  female  2007        4550           46.5          13.5
 7        159 Gentoo  female  2007        4800           45.4          14.6
 8        160 Gentoo  male    2007        5200           46.7          15.3
 9        161 Gentoo  female  2007        4400           43.3          13.4
10        162 Gentoo  male    2007        5150           46.8          15.4
# ℹ 24 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  filter(beak_length_mm >= 55)
# A tibble: 5 × 11
  penguin_id species   sex     year body_mass_g beak_length_mm beak_depth_mm
       <int> <chr>     <chr>  <dbl>       <dbl>          <dbl>         <dbl>
1        186 Gentoo    male    2007        6050           59.6          17  
2        254 Gentoo    male    2009        5600           55.9          17  
3        268 Gentoo    male    2009        5850           55.1          16  
4        294 Chinstrap female  2007        3700           58            17.8
5        340 Chinstrap male    2009        4000           55.8          19.8
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
# A tibble: 333 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          5 Adelie  female  2007        3450           36.7          19.3
 5          6 Adelie  male    2007        3650           39.3          20.6
 6          7 Adelie  female  2007        3625           38.9          17.8
 7          8 Adelie  male    2007        4675           39.2          19.6
 8         13 Adelie  female  2007        3200           41.1          17.6
 9         14 Adelie  male    2007        3800           38.6          21.2
10         15 Adelie  male    2007        4400           34.6          21.1
# ℹ 323 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
# A tibble: 344 × 11
   penguin_id species   sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>     <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1        315 Chinstrap female  2008        2700           46.9          16.6
 2         59 Adelie    female  2008        2850           36.5          16.6
 3         65 Adelie    female  2008        2850           36.4          17.1
 4         55 Adelie    female  2008        2900           34.5          18.1
 5         99 Adelie    female  2008        2900           33.1          16.1
 6        117 Adelie    female  2009        2900           38.6          17  
 7        299 Chinstrap female  2007        2900           43.2          16.6
 8        105 Adelie    female  2009        2925           37.9          18.6
 9         48 Adelie    <NA>    2007        2975           37.5          18.9
10         45 Adelie    female  2007        3000           37            16.9
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
# A tibble: 344 × 11
   penguin_id species sex    year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr> <dbl>       <dbl>          <dbl>         <dbl>
 1        170 Gentoo  male   2007        6300           49.2          15.2
 2        186 Gentoo  male   2007        6050           59.6          17  
 3        230 Gentoo  male   2008        6000           51.1          16.3
 4        270 Gentoo  male   2009        6000           48.8          16.2
 5        232 Gentoo  male   2008        5950           45.2          16.4
 6        264 Gentoo  male   2009        5950           49.8          15.9
 7        166 Gentoo  male   2007        5850           48.4          14.6
 8        168 Gentoo  male   2007        5850           49.3          15.7
 9        268 Gentoo  male   2009        5850           55.1          16  
10        220 Gentoo  male   2008        5800           49.5          16.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  arrange(species, sex)
# A tibble: 344 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          2 Adelie  female  2007        3800           39.5          17.4
 2          3 Adelie  female  2007        3250           40.3          18  
 3          5 Adelie  female  2007        3450           36.7          19.3
 4          7 Adelie  female  2007        3625           38.9          17.8
 5         13 Adelie  female  2007        3200           41.1          17.6
 6         16 Adelie  female  2007        3700           36.6          17.8
 7         17 Adelie  female  2007        3450           38.7          19  
 8         19 Adelie  female  2007        3325           34.4          18.4
 9         21 Adelie  female  2007        3400           37.8          18.3
10         23 Adelie  female  2007        3800           35.9          19.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
# A tibble: 344 × 11
# Groups:   species [3]
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          4 Adelie  <NA>    2007          NA           NA            NA  
 5          5 Adelie  female  2007        3450           36.7          19.3
 6          6 Adelie  male    2007        3650           39.3          20.6
 7          7 Adelie  female  2007        3625           38.9          17.8
 8          8 Adelie  male    2007        4675           39.2          19.6
 9          9 Adelie  <NA>    2007        3475           34.1          18.1
10         10 Adelie  <NA>    2007        4250           42            20.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  group_by(species, sex)
# A tibble: 344 × 11
# Groups:   species, sex [8]
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          4 Adelie  <NA>    2007          NA           NA            NA  
 5          5 Adelie  female  2007        3450           36.7          19.3
 6          6 Adelie  male    2007        3650           39.3          20.6
 7          7 Adelie  female  2007        3625           38.9          17.8
 8          8 Adelie  male    2007        4675           39.2          19.6
 9          9 Adelie  <NA>    2007        3475           34.1          18.1
10         10 Adelie  <NA>    2007        4250           42            20.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  filter(!is.na(species) & !is.na(sex)) |>
  group_by(species, sex)
# A tibble: 333 × 11
# Groups:   species, sex [6]
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          5 Adelie  female  2007        3450           36.7          19.3
 5          6 Adelie  male    2007        3650           39.3          20.6
 6          7 Adelie  female  2007        3625           38.9          17.8
 7          8 Adelie  male    2007        4675           39.2          19.6
 8         13 Adelie  female  2007        3200           41.1          17.6
 9         14 Adelie  male    2007        3800           38.6          21.2
10         15 Adelie  male    2007        4400           34.6          21.1
# ℹ 323 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>


penguins_tidy |>
  filter(!is.na(species) & !is.na(sex)) |>
  group_by(species, sex) |>
  summarise(mean_body_mass_g = mean(body_mass_g))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    mean_body_mass_g
  <chr>     <chr>             <dbl>
1 Adelie    female            3369.
2 Adelie    male              4043.
3 Chinstrap female            3527.
4 Chinstrap male              3939.
5 Gentoo    female            4680.
6 Gentoo    male              5485.


  • grammer of graphics
    • learn once, use everywhere
  • aesthetic values
    • aes()
  • geometric objects
    • geom_*()



  • geom_point()
  • geom_line()
  • geom_bar()
  • geom_histogram()
  • geom_boxplot()
  • etc

mapping & aes()

  • x
  • y
  • fill
  • colour
  • shape
  • size
  • etc

ggplot(data = penguins_tidy)

  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)

  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
Warning: Removed 2 rows containing missing values (`geom_point()`).

  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, colour = species)
  ) +
Warning: Removed 2 rows containing missing values (`geom_point()`).

  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, colour = species)
  ) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_point(mapping = aes(colour = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_point(mapping = aes(colour = species, shape = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(y = "Body mass (g)", x  = "Flipper length (mm)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(y = "Body mass (g)", x  = "Flipper length (mm)") +
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(size = 1))
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(size = 2))
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(size = 3))
Warning: Removed 2 rows containing missing values (`geom_point()`).


  • Showing information: inside aes()
  • Not showing information (just style): outside aes()

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(size = 10)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(size = 5)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(colour = species), size = 5)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(colour = species), size = 5, alpha = 0.7)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 4
# Groups:   species [3]
  species   sex    mean_beak_length mean_beak_depth
  <chr>     <chr>             <dbl>           <dbl>
1 Adelie    female             37.3            17.6
2 Adelie    male               40.4            19.1
3 Chinstrap female             46.6            17.6
4 Chinstrap male               51.1            19.3
5 Gentoo    female             45.6            14.2
6 Gentoo    male               49.5            15.7

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length)) +
  geom_col(aes(fill = sex))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length)) +
  geom_col(aes(fill = sex), position = "dodge")
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length)) +
  geom_col(aes(fill = sex), position = "dodge") +
  labs(x = "Penguin species", y = "Mean beak length (mm)") +
  theme_classic() +
  scale_y_continuous(expand = c(0,0)) +
  theme(legend.position = "top")

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.


ggsave(filename = "my_plot.pdf")
  • by default, will save the last plot made
  • can change width, height, dpi, etc



  • R.Version() > 4.1.0
  • install.packages("tidyverse")
  • use |> where appropriate