Welcome to the {tidyverse}

Open R Sessions 2023

Violeta Caballero López
Laura Hildesheim
Simon Jacobsen Ellerstrand
Iain Moodie
Pedro Rosero

Functions re-cap

Functions re-cap

  • a piece of code that takes an input, does something, and returns an output.
do_something <- function(input){
  ...
  ...
  return(output)
}

do_something(my_data)

Stringing together multiple functions

  1. Multiple assignments
did_something <- do_something(data)

did_another_thing <- do_another_thing(did_something)

final_thing <- do_last_thing(did_another_thing)
  1. Nested functions
final_thing <- do_last_thing(do_another_thing(do_something(data)))
  1. Pipes!

Pipes in R

Pipes in R

final_thing <-
  data |>
  do_something() |>
  do_another_thing() |>
  do_last_thing()
  • Could be verbalised as “and then”
  • Base R pipe (> 4.1.0) = |>
  • magrittr package pipe = %>%
  • they differ slightly
  • my advice: use |>

Biology is a Data Science

Data Science

  1. “Wrangling” the dataset
  2. Analysing the data
  3. Reporting the results

Data Science

  1. “Wrangling” the dataset
  2. Analysing the data
  3. Reporting the results

50-80% of a data scientist’s time is spend wrangling data

(not fun)

A collection of opinionated R packages designed for data science

What tidyverse is designed for

Install tidyverse like any other R package from CRAN:

install.packages("tidyverse")

Load the package with library():

library(tidyverse)
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

readr

  • read.delim() read_delim()
  • read_csv()
  • read_tsv()
  • write_*()

readr::read_csv()

penguins <- read_csv("palmerpenguins_untidy.csv")
Rows: 1376 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): id, measurement
dbl (2): year, value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tibble

The data is stored as a tibble

class(penguins)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
penguins
# A tibble: 1,376 × 4
   id                 year measurement value
   <chr>             <dbl> <chr>       <dbl>
 1 1_Adelie_male_C    2007 bm_lb        8.27
 2 1_Adelie_male_C    2007 fl_cm       18.1 
 3 1_Adelie_male_C    2007 bl_in        1.54
 4 1_Adelie_male_C    2007 bd_mm       18.7 
 5 2_Adelie_female_C  2007 bm_lb        8.38
 6 2_Adelie_female_C  2007 fl_cm       18.6 
 7 2_Adelie_female_C  2007 bl_in        1.56
 8 2_Adelie_female_C  2007 bd_mm       17.4 
 9 3_Adelie_female_C  2007 bm_lb        7.16
10 3_Adelie_female_C  2007 fl_cm       19.5 
# ℹ 1,366 more rows

tidyr

The tidyverse is built around “tidy data”

A dataset is tidy if:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

tidyr

penguins
# A tibble: 1,376 × 4
   id                 year measurement value
   <chr>             <dbl> <chr>       <dbl>
 1 1_Adelie_male_C    2007 bm_lb        8.27
 2 1_Adelie_male_C    2007 fl_cm       18.1 
 3 1_Adelie_male_C    2007 bl_in        1.54
 4 1_Adelie_male_C    2007 bd_mm       18.7 
 5 2_Adelie_female_C  2007 bm_lb        8.38
 6 2_Adelie_female_C  2007 fl_cm       18.6 
 7 2_Adelie_female_C  2007 bl_in        1.56
 8 2_Adelie_female_C  2007 bd_mm       17.4 
 9 3_Adelie_female_C  2007 bm_lb        7.16
10 3_Adelie_female_C  2007 fl_cm       19.5 
# ℹ 1,366 more rows

tidyr::pivot_*()

penguins |>
  pivot_wider(names_from = measurement, values_from = value)
# A tibble: 344 × 6
   id                 year bm_lb fl_cm bl_in bd_mm
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl>
 1 1_Adelie_male_C    2007  8.27  18.1  1.54  18.7
 2 2_Adelie_female_C  2007  8.38  18.6  1.56  17.4
 3 3_Adelie_female_C  2007  7.16  19.5  1.59  18  
 4 4_Adelie_NA_C      2007 NA     NA   NA     NA  
 5 5_Adelie_female_C  2007  7.61  19.3  1.44  19.3
 6 6_Adelie_male_C    2007  8.05  19    1.55  20.6
 7 7_Adelie_female_C  2007  7.99  18.1  1.53  17.8
 8 8_Adelie_male_C    2007 10.3   19.5  1.54  19.6
 9 9_Adelie_NA_C      2007  7.66  19.3  1.34  18.1
10 10_Adelie_NA_C     2007  9.37  19    1.65  20.2
# ℹ 334 more rows

tidyr::separate()

penguins_tidy <-
penguins |>
  pivot_wider(names_from = measurement, values_from = value) |>
  separate(col = id, into = c("penguin_id", "species", "sex", "island_id"), sep = "_", convert = TRUE)

penguins_tidy
# A tibble: 344 × 9
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18  
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA  
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2
# ℹ 334 more rows

tidyr

  • pivot_wider() & pivot_longer()
  • separate(), extract() & unite()
  • nest() & unnest()
  • replace_na()

penguins_tidy
# A tibble: 344 × 9
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18  
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA  
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2
# ℹ 334 more rows

dplyr

provides a grammer of data manipulation

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names
  • filter() picks cases based on their values
  • summarise() reduces multiple values down to a single summary
  • arrange() changes the ordering of the rows

dplyr::mutate()

  1. change body mass from lb into g
penguins_tidy |>
  mutate(body_mass_g = bm_lb * 453.6)
# A tibble: 344 × 10
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm body_mass_g
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7        3750
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4        3800
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18          3250
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA            NA
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3        3450
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6        3650
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8        3625
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6        4675
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1        3475
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2        4250
# ℹ 334 more rows

dplyr::mutate()

  1. change beak length from in to mm
penguins_tidy |>
  mutate(body_mass_g = bm_lb * 453.6) |>
  mutate(beak_length_mm = bl_in * 25.4)

within a single mutate() function

penguins_tidy |>
  mutate(
    body_mass_g = bm_lb * 453.6,
    beak_length_mm = bl_in * 25.4
  )

dplyr::mutate()

penguins_tidy |>
  mutate(
    body_mass_g = bm_lb * 453.6,
    beak_length_mm = bl_in * 25.4,
    beak_depth_mm = bd_mm,
    flipper_length_mm = fl_cm * 10
  )
# A tibble: 344 × 13
   penguin_id species sex    island_id  year bm_lb fl_cm bl_in bd_mm body_mass_g
        <int> <chr>   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
 1          1 Adelie  male   C          2007  8.27  18.1  1.54  18.7        3750
 2          2 Adelie  female C          2007  8.38  18.6  1.56  17.4        3800
 3          3 Adelie  female C          2007  7.16  19.5  1.59  18          3250
 4          4 Adelie  <NA>   C          2007 NA     NA   NA     NA            NA
 5          5 Adelie  female C          2007  7.61  19.3  1.44  19.3        3450
 6          6 Adelie  male   C          2007  8.05  19    1.55  20.6        3650
 7          7 Adelie  female C          2007  7.99  18.1  1.53  17.8        3625
 8          8 Adelie  male   C          2007 10.3   19.5  1.54  19.6        4675
 9          9 Adelie  <NA>   C          2007  7.66  19.3  1.34  18.1        3475
10         10 Adelie  <NA>   C          2007  9.37  19    1.65  20.2        4250
# ℹ 334 more rows
# ℹ 3 more variables: beak_length_mm <dbl>, beak_depth_mm <dbl>,
#   flipper_length_mm <dbl>

dplyr::select()

penguins_tidy <-
  penguins_tidy |>
  mutate(
    body_mass_g = bm_lb * 453.6,
    beak_length_mm = bl_in * 25.4,
    beak_depth_mm = bd_mm,
    flipper_length_mm = fl_cm * 10
  ) |>
  select(-bl_in, -bd_mm, -fl_cm, -bm_lb)

penguins_tidy
# A tibble: 344 × 9
   penguin_id species sex    island_id  year body_mass_g beak_length_mm
        <int> <chr>   <chr>  <chr>     <dbl>       <dbl>          <dbl>
 1          1 Adelie  male   C          2007        3750           39.1
 2          2 Adelie  female C          2007        3800           39.5
 3          3 Adelie  female C          2007        3250           40.3
 4          4 Adelie  <NA>   C          2007          NA           NA  
 5          5 Adelie  female C          2007        3450           36.7
 6          6 Adelie  male   C          2007        3650           39.3
 7          7 Adelie  female C          2007        3625           38.9
 8          8 Adelie  male   C          2007        4675           39.2
 9          9 Adelie  <NA>   C          2007        3475           34.1
10         10 Adelie  <NA>   C          2007        4250           42  
# ℹ 334 more rows
# ℹ 2 more variables: beak_depth_mm <dbl>, flipper_length_mm <dbl>

island_data <- read_csv("palmerpenguins_island_data.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): island, island_id
lgl (2): iba_status, is_cold

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
island_data
# A tibble: 3 × 4
  island    island_id iba_status is_cold
  <chr>     <chr>     <lgl>      <lgl>  
1 Torgersen C         TRUE       TRUE   
2 Biscoe    A         FALSE      TRUE   
3 Dream     B         TRUE       TRUE   

dplyr::*_join()

Joins add columns from y to x, matching observations based on a key.

  • A left_join() keeps all observations in x.
  • A right_join() keeps all observations in y.
  • A full_join() keeps all observations in x and y.

dplyr::*_join()

Using a left_join()

left_join(
  x = penguins_tidy,
  y = island_data,
  by = join_by(island_id)
)

can also be written as:

penguins_tidy |>
left_join(island_data, by = join_by(island_id))

dplyr::*_join()

# A tibble: 344 × 12
   penguin_id species sex    island_id  year body_mass_g beak_length_mm
        <int> <chr>   <chr>  <chr>     <dbl>       <dbl>          <dbl>
 1          1 Adelie  male   C          2007        3750           39.1
 2          2 Adelie  female C          2007        3800           39.5
 3          3 Adelie  female C          2007        3250           40.3
 4          4 Adelie  <NA>   C          2007          NA           NA  
 5          5 Adelie  female C          2007        3450           36.7
 6          6 Adelie  male   C          2007        3650           39.3
 7          7 Adelie  female C          2007        3625           38.9
 8          8 Adelie  male   C          2007        4675           39.2
 9          9 Adelie  <NA>   C          2007        3475           34.1
10         10 Adelie  <NA>   C          2007        4250           42  
# ℹ 334 more rows
# ℹ 5 more variables: beak_depth_mm <dbl>, flipper_length_mm <dbl>,
#   island <chr>, iba_status <lgl>, is_cold <lgl>

dplyr::glimpse()

penguins_tidy <-
  penguins_tidy |>
  left_join(island_data, by = join_by(island_id)) |>
  select(-island_id)

glimpse(penguins_tidy)
Rows: 344
Columns: 11
$ penguin_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ beak_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ beak_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ iba_status        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ is_cold           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

dplyr::filter()

penguins_tidy |>
  filter(year == 2007)
# A tibble: 110 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          4 Adelie  <NA>    2007          NA           NA            NA  
 5          5 Adelie  female  2007        3450           36.7          19.3
 6          6 Adelie  male    2007        3650           39.3          20.6
 7          7 Adelie  female  2007        3625           38.9          17.8
 8          8 Adelie  male    2007        4675           39.2          19.6
 9          9 Adelie  <NA>    2007        3475           34.1          18.1
10         10 Adelie  <NA>    2007        4250           42            20.2
# ℹ 100 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::filter()

penguins_tidy |>
  filter(year != 2007)
# A tibble: 234 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1         51 Adelie  female  2008        3500           39.6          17.7
 2         52 Adelie  male    2008        4300           40.1          18.9
 3         53 Adelie  female  2008        3450           35            17.9
 4         54 Adelie  male    2008        4050           42            19.5
 5         55 Adelie  female  2008        2900           34.5          18.1
 6         56 Adelie  male    2008        3700           41.4          18.6
 7         57 Adelie  female  2008        3550           39            17.5
 8         58 Adelie  male    2008        3800           40.6          18.8
 9         59 Adelie  female  2008        2850           36.5          16.6
10         60 Adelie  male    2008        3750           37.6          19.1
# ℹ 224 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::filter()

penguins_tidy |>
  filter(year == 2007 & species == "Gentoo")
# A tibble: 34 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1        153 Gentoo  female  2007        4500           46.1          13.2
 2        154 Gentoo  male    2007        5700           50            16.3
 3        155 Gentoo  female  2007        4450           48.7          14.1
 4        156 Gentoo  male    2007        5700           50            15.2
 5        157 Gentoo  male    2007        5400           47.6          14.5
 6        158 Gentoo  female  2007        4550           46.5          13.5
 7        159 Gentoo  female  2007        4800           45.4          14.6
 8        160 Gentoo  male    2007        5200           46.7          15.3
 9        161 Gentoo  female  2007        4400           43.3          13.4
10        162 Gentoo  male    2007        5150           46.8          15.4
# ℹ 24 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::filter()

penguins_tidy |>
  filter(beak_length_mm >= 55)
# A tibble: 5 × 11
  penguin_id species   sex     year body_mass_g beak_length_mm beak_depth_mm
       <int> <chr>     <chr>  <dbl>       <dbl>          <dbl>         <dbl>
1        186 Gentoo    male    2007        6050           59.6          17  
2        254 Gentoo    male    2009        5600           55.9          17  
3        268 Gentoo    male    2009        5850           55.1          16  
4        294 Chinstrap female  2007        3700           58            17.8
5        340 Chinstrap male    2009        4000           55.8          19.8
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::filter()

penguins_tidy |>
  filter(!is.na(sex))
# A tibble: 333 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          5 Adelie  female  2007        3450           36.7          19.3
 5          6 Adelie  male    2007        3650           39.3          20.6
 6          7 Adelie  female  2007        3625           38.9          17.8
 7          8 Adelie  male    2007        4675           39.2          19.6
 8         13 Adelie  female  2007        3200           41.1          17.6
 9         14 Adelie  male    2007        3800           38.6          21.2
10         15 Adelie  male    2007        4400           34.6          21.1
# ℹ 323 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::arrange()

penguins_tidy |>
  arrange(body_mass_g)
# A tibble: 344 × 11
   penguin_id species   sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>     <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1        315 Chinstrap female  2008        2700           46.9          16.6
 2         59 Adelie    female  2008        2850           36.5          16.6
 3         65 Adelie    female  2008        2850           36.4          17.1
 4         55 Adelie    female  2008        2900           34.5          18.1
 5         99 Adelie    female  2008        2900           33.1          16.1
 6        117 Adelie    female  2009        2900           38.6          17  
 7        299 Chinstrap female  2007        2900           43.2          16.6
 8        105 Adelie    female  2009        2925           37.9          18.6
 9         48 Adelie    <NA>    2007        2975           37.5          18.9
10         45 Adelie    female  2007        3000           37            16.9
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::arrange()

penguins_tidy |>
  arrange(desc(body_mass_g))
# A tibble: 344 × 11
   penguin_id species sex    year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr> <dbl>       <dbl>          <dbl>         <dbl>
 1        170 Gentoo  male   2007        6300           49.2          15.2
 2        186 Gentoo  male   2007        6050           59.6          17  
 3        230 Gentoo  male   2008        6000           51.1          16.3
 4        270 Gentoo  male   2009        6000           48.8          16.2
 5        232 Gentoo  male   2008        5950           45.2          16.4
 6        264 Gentoo  male   2009        5950           49.8          15.9
 7        166 Gentoo  male   2007        5850           48.4          14.6
 8        168 Gentoo  male   2007        5850           49.3          15.7
 9        268 Gentoo  male   2009        5850           55.1          16  
10        220 Gentoo  male   2008        5800           49.5          16.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::arrange()

penguins_tidy |>
  arrange(species, sex)
# A tibble: 344 × 11
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          2 Adelie  female  2007        3800           39.5          17.4
 2          3 Adelie  female  2007        3250           40.3          18  
 3          5 Adelie  female  2007        3450           36.7          19.3
 4          7 Adelie  female  2007        3625           38.9          17.8
 5         13 Adelie  female  2007        3200           41.1          17.6
 6         16 Adelie  female  2007        3700           36.6          17.8
 7         17 Adelie  female  2007        3450           38.7          19  
 8         19 Adelie  female  2007        3325           34.4          18.4
 9         21 Adelie  female  2007        3400           37.8          18.3
10         23 Adelie  female  2007        3800           35.9          19.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::group_by()

penguins_tidy |>
  group_by(species)
# A tibble: 344 × 11
# Groups:   species [3]
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          4 Adelie  <NA>    2007          NA           NA            NA  
 5          5 Adelie  female  2007        3450           36.7          19.3
 6          6 Adelie  male    2007        3650           39.3          20.6
 7          7 Adelie  female  2007        3625           38.9          17.8
 8          8 Adelie  male    2007        4675           39.2          19.6
 9          9 Adelie  <NA>    2007        3475           34.1          18.1
10         10 Adelie  <NA>    2007        4250           42            20.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::group_by()

penguins_tidy |>
  group_by(species, sex)
# A tibble: 344 × 11
# Groups:   species, sex [8]
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          4 Adelie  <NA>    2007          NA           NA            NA  
 5          5 Adelie  female  2007        3450           36.7          19.3
 6          6 Adelie  male    2007        3650           39.3          20.6
 7          7 Adelie  female  2007        3625           38.9          17.8
 8          8 Adelie  male    2007        4675           39.2          19.6
 9          9 Adelie  <NA>    2007        3475           34.1          18.1
10         10 Adelie  <NA>    2007        4250           42            20.2
# ℹ 334 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::group_by()

penguins_tidy |>
  filter(!is.na(species) & !is.na(sex)) |>
  group_by(species, sex)
# A tibble: 333 × 11
# Groups:   species, sex [6]
   penguin_id species sex     year body_mass_g beak_length_mm beak_depth_mm
        <int> <chr>   <chr>  <dbl>       <dbl>          <dbl>         <dbl>
 1          1 Adelie  male    2007        3750           39.1          18.7
 2          2 Adelie  female  2007        3800           39.5          17.4
 3          3 Adelie  female  2007        3250           40.3          18  
 4          5 Adelie  female  2007        3450           36.7          19.3
 5          6 Adelie  male    2007        3650           39.3          20.6
 6          7 Adelie  female  2007        3625           38.9          17.8
 7          8 Adelie  male    2007        4675           39.2          19.6
 8         13 Adelie  female  2007        3200           41.1          17.6
 9         14 Adelie  male    2007        3800           38.6          21.2
10         15 Adelie  male    2007        4400           34.6          21.1
# ℹ 323 more rows
# ℹ 4 more variables: flipper_length_mm <dbl>, island <chr>, iba_status <lgl>,
#   is_cold <lgl>

dplyr::summarise()

penguins_tidy |>
  filter(!is.na(species) & !is.na(sex)) |>
  group_by(species, sex) |>
  summarise(mean_body_mass_g = mean(body_mass_g))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    mean_body_mass_g
  <chr>     <chr>             <dbl>
1 Adelie    female            3369.
2 Adelie    male              4043.
3 Chinstrap female            3527.
4 Chinstrap male              3939.
5 Gentoo    female            4680.
6 Gentoo    male              5485.

ggplot2

  • grammer of graphics
    • learn once, use everywhere
  • aesthetic values
    • aes()
  • geometric objects
    • geom_*()

ggplot2

geom_*

  • geom_point()
  • geom_line()
  • geom_bar()
  • geom_histogram()
  • geom_boxplot()
  • etc

mapping & aes()

  • x
  • y
  • fill
  • colour
  • shape
  • size
  • etc

ggplot(data = penguins_tidy)

ggplot(
  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) 

ggplot(
  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(
  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, colour = species)
  ) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(
  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, colour = species)
  ) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(
  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_point(mapping = aes(colour = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(
  data = penguins_tidy,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_point(mapping = aes(colour = species, shape = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(y = "Body mass (g)", x  = "Flipper length (mm)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(y = "Body mass (g)", x  = "Flipper length (mm)") +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(size = 1))
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(size = 2))
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(size = 3))
Warning: Removed 2 rows containing missing values (`geom_point()`).

aes()

  • Showing information: inside aes()
  • Not showing information (just style): outside aes()

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(size = 10)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(size = 5)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(colour = species), size = 5)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  ggplot(aes(x = beak_length_mm, y = beak_depth_mm)) +
  geom_point(aes(colour = species), size = 5, alpha = 0.7)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
  summarise(
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 4
# Groups:   species [3]
  species   sex    mean_beak_length mean_beak_depth
  <chr>     <chr>             <dbl>           <dbl>
1 Adelie    female             37.3            17.6
2 Adelie    male               40.4            19.1
3 Chinstrap female             46.6            17.6
4 Chinstrap male               51.1            19.3
5 Gentoo    female             45.6            14.2
6 Gentoo    male               49.5            15.7

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
  summarise(
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
  summarise(
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length)) +
  geom_col(aes(fill = sex))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
  summarise(
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length)) +
  geom_col(aes(fill = sex), position = "dodge")
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

penguins_tidy |>
  filter(!is.na(sex) & !is.na(species)) |>
  group_by(species, sex) |>
  summarise(
    mean_beak_length = mean(beak_length_mm),
    mean_beak_depth = mean(beak_depth_mm)
  ) |>
  ggplot(aes(x = species, y = mean_beak_length)) +
  geom_col(aes(fill = sex), position = "dodge") +
  labs(x = "Penguin species", y = "Mean beak length (mm)") +
  theme_classic() +
  scale_y_continuous(expand = c(0,0)) +
  theme(legend.position = "top")

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

ggplot2::ggsave()

ggsave(filename = "my_plot.pdf")
  • by default, will save the last plot made
  • can change width, height, dpi, etc

Exercises

Exercises

  • R.Version() > 4.1.0
  • install.packages("tidyverse")
  • use |> where appropriate