Last updated: 2020-11-21

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6e7b3db. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt
    Untracked:  data/at_health_facilities.csv
    Untracked:  data/infant_hiv.csv
    Untracked:  data/measurements.csv
    Untracked:  data/person.csv
    Untracked:  data/ranking.csv
    Untracked:  data/visited.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch9_tidy_data.Rmd) and HTML (docs/ch9_tidy_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 7ed0458 sciencificity 2020-11-10 Build site.
html 86457fa sciencificity 2020-11-10 Build site.
html 4879249 sciencificity 2020-11-09 Build site.
html e423967 sciencificity 2020-11-08 Build site.
html 0d223fb sciencificity 2020-11-08 Build site.
html ecd1d8e sciencificity 2020-11-07 Build site.
html 274005c sciencificity 2020-11-06 Build site.
html 60e7ce2 sciencificity 2020-11-02 Build site.
html db5a796 sciencificity 2020-11-01 Build site.
html d8813e9 sciencificity 2020-11-01 Build site.
html bf15f3b sciencificity 2020-11-01 Build site.
html 0aef1b0 sciencificity 2020-10-31 Build site.
html bdc0881 sciencificity 2020-10-26 Build site.
html 8224544 sciencificity 2020-10-26 Build site.
html 2f8dcc0 sciencificity 2020-10-25 Build site.
Rmd 648d291 sciencificity 2020-10-25 completed Ch9
html 61e2324 sciencificity 2020-10-25 Build site.
Rmd 57f23a8 sciencificity 2020-10-25 added Ch9

options(scipen=10000)
library(tidyverse)
library(flair)
library(emo)
library(lubridate)
library(magrittr)
library(tidyquant)
theme_set(theme_tq())

What’s tidy data anyway?

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

In all the examples tidyr::table to tidyr::table4b, only tidyr::table1 is tidy.

(
  # practising the read_csv function to create table1
  # just note however that table1 is in tidyr ;)
  # tidyr::table1 etc. 
  # In all honesty, I only figured this out after "practising" :P
  table1 <- read_csv("country, year, cases, population
                     Afghanistan, 1999, 745, 19987071
                     Afghanistan, 2000, 2666, 20595360
                     Brazil, 1999, 37737, 172006362
                     Brazil, 2000, 80488, 174504898
                     China, 1999, 212258, 1272915272
                     China, 2000, 213766, 1280428583")
)
# A tibble: 6 x 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Working with tidy data

table1 %>% 
  mutate(rate = cases/population * 10000)
# A tibble: 6 x 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67 
table1 %>% 
  count(year, wt=cases) # same as group_by and sum
# A tibble: 2 x 2
   year      n
  <dbl>  <dbl>
1  1999 250740
2  2000 296920
table1 %>% 
  group_by(year) %>% 
  summarise(sum(cases))
# A tibble: 2 x 2
   year `sum(cases)`
  <dbl>        <dbl>
1  1999       250740
2  2000       296920
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(colour = country)) +
  scale_colour_tq()

Exercises

  1. Using prose, describe how the variables and observations are organised in each of the sample tables.

    tidyr::table1
    # A tibble: 6 x 4
      country      year  cases population
      <chr>       <int>  <int>      <int>
    1 Afghanistan  1999    745   19987071
    2 Afghanistan  2000   2666   20595360
    3 Brazil       1999  37737  172006362
    4 Brazil       2000  80488  174504898
    5 China        1999 212258 1272915272
    6 China        2000 213766 1280428583
    • Each column is a variable ✅
    • Each observation is in a row ✅
    • Each value is in a cell ✅
    • The table describes the number of cases, and the population count (each in its own column) for each country and year combination.


    tidyr::table2
    # A tibble: 12 x 4
       country      year type            count
       <chr>       <int> <chr>           <int>
     1 Afghanistan  1999 cases             745
     2 Afghanistan  1999 population   19987071
     3 Afghanistan  2000 cases            2666
     4 Afghanistan  2000 population   20595360
     5 Brazil       1999 cases           37737
     6 Brazil       1999 population  172006362
     7 Brazil       2000 cases           80488
     8 Brazil       2000 population  174504898
     9 China        1999 cases          212258
    10 China        1999 population 1272915272
    11 China        2000 cases          213766
    12 China        2000 population 1280428583
    • Each column is a variable ❌
    • Each observation is in a row ✅
    • Each value is in a cell ✅
    • Each row contains either the number of cases, or the population count for each country and year combination.


    tidyr::table3
    # A tibble: 6 x 3
      country      year rate             
    * <chr>       <int> <chr>            
    1 Afghanistan  1999 745/19987071     
    2 Afghanistan  2000 2666/20595360    
    3 Brazil       1999 37737/172006362  
    4 Brazil       2000 80488/174504898  
    5 China        1999 212258/1272915272
    6 China        2000 213766/1280428583
    • Each column is a variable ❌
    • Each observation is in a row ✅
    • Each value is in a cell ❌
    • Each row considers the country and year combination, and shows the number of cases and the population count (separated by a /) in one variable named rate.


    tidyr::table4a
    # A tibble: 3 x 3
      country     `1999` `2000`
    * <chr>        <int>  <int>
    1 Afghanistan    745   2666
    2 Brazil       37737  80488
    3 China       212258 213766
    • Each column is a variable ✅
    • Each observation is in a row ❌
    • Each value is in a cell ✅


    tidyr::table4b
    # A tibble: 3 x 3
      country         `1999`     `2000`
    * <chr>            <int>      <int>
    1 Afghanistan   19987071   20595360
    2 Brazil       172006362  174504898
    3 China       1272915272 1280428583
    • Each column is a variable ✅
    • Each observation is in a row ❌
    • Each value is in a cell ✅
    • Each table houses either the number of cases, or the population count for each country, in separated columns for each year.


    tidyr::table5
    # A tibble: 6 x 4
      country     century year  rate             
    * <chr>       <chr>   <chr> <chr>            
    1 Afghanistan 19      99    745/19987071     
    2 Afghanistan 20      00    2666/20595360    
    3 Brazil      19      99    37737/172006362  
    4 Brazil      20      00    80488/174504898  
    5 China       19      99    212258/1272915272
    6 China       20      00    213766/1280428583
    • Each column is a variable ❌
    • Each observation is in a row ✅
    • Each value is in a cell ❌
    • The table considers each year separated into century and year for each country and then similar to table3 it combines the cases and population count in one variable rate (separated within the column by /)


  2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

    1. Extract the number of TB cases per country per year.
    2. Extract the matching population per country per year.
    3. Divide cases by population, and multiply by 10000.
    4. Store back in the appropriate place.

    Which representation is easiest to work with? Which is hardest? Why?

    (
      tbl1 <- tidyr::table2 %>% 
      filter(type == "cases") %>% 
      group_by(country, year) %>% 
      mutate(cases = count) %>% 
      ungroup() %>% 
      select(country, year, cases) %>% 
      arrange(country, year)
    )
    # A tibble: 6 x 3
      country      year  cases
      <chr>       <int>  <int>
    1 Afghanistan  1999    745
    2 Afghanistan  2000   2666
    3 Brazil       1999  37737
    4 Brazil       2000  80488
    5 China        1999 212258
    6 China        2000 213766
    (
      tbl2 <- tidyr::table2 %>% 
      filter(type == "population") %>% 
      group_by(country, year) %>% 
      mutate(population = count) %>% 
      ungroup() %>% 
      select(country_temp = country, 
             year_temp = year, 
             population) %>% 
      arrange(country_temp, year_temp)
    )
    # A tibble: 6 x 3
      country_temp year_temp population
      <chr>            <int>      <int>
    1 Afghanistan       1999   19987071
    2 Afghanistan       2000   20595360
    3 Brazil            1999  172006362
    4 Brazil            2000  174504898
    5 China             1999 1272915272
    6 China             2000 1280428583
    (
    tbl3 <- tbl1 %>% 
      bind_cols(tbl2) %>% 
      select(c(1:3,6)) %>% 
      mutate(rate = (cases / population) * 10000) %>% 
      arrange(country, year) %>% 
      select(country, year, rate) %>% 
      mutate(type = "rate",
             count = rate) %>% 
      select(c(1,2,4,5))
    )
    # A tibble: 6 x 4
      country      year type  count
      <chr>       <int> <chr> <dbl>
    1 Afghanistan  1999 rate  0.373
    2 Afghanistan  2000 rate  1.29 
    3 Brazil       1999 rate  2.19 
    4 Brazil       2000 rate  4.61 
    5 China        1999 rate  1.67 
    6 China        2000 rate  1.67 
    tidyr::table2 %>% 
      bind_rows(tbl3) %>% 
      mutate(count = round(count, 2)) %>% 
      arrange(country, year, type) %>% 
      gt::gt()
    country year type count
    Afghanistan 1999 cases 745.00
    Afghanistan 1999 population 19987071.00
    Afghanistan 1999 rate 0.37
    Afghanistan 2000 cases 2666.00
    Afghanistan 2000 population 20595360.00
    Afghanistan 2000 rate 1.29
    Brazil 1999 cases 37737.00
    Brazil 1999 population 172006362.00
    Brazil 1999 rate 2.19
    Brazil 2000 cases 80488.00
    Brazil 2000 population 174504898.00
    Brazil 2000 rate 4.61
    China 1999 cases 212258.00
    China 1999 population 1272915272.00
    China 1999 rate 1.67
    China 2000 cases 213766.00
    China 2000 population 1280428583.00
    China 2000 rate 1.67



    (
      tbl1_cases <- tidyr::table4a %>% 
      select(country, `1999`) %>% 
      mutate(year = 1999,
             cases  = `1999`) %>% 
      select(country, year, cases)
    )
    # A tibble: 3 x 3
      country      year  cases
      <chr>       <dbl>  <int>
    1 Afghanistan  1999    745
    2 Brazil       1999  37737
    3 China        1999 212258
    (
      tbl2_cases <- tidyr::table4a %>% 
      select(country, "2000") %>% 
      mutate(year = 2000, 
             cases = `2000`) %>% 
      select(country, year, cases)
    )
    # A tibble: 3 x 3
      country      year  cases
      <chr>       <dbl>  <int>
    1 Afghanistan  2000   2666
    2 Brazil       2000  80488
    3 China        2000 213766
    (
      tbl_cases <- tbl1_cases %>% 
      bind_rows(tbl2_cases) %>% 
      arrange(country, year)
    )
    # A tibble: 6 x 3
      country      year  cases
      <chr>       <dbl>  <int>
    1 Afghanistan  1999    745
    2 Afghanistan  2000   2666
    3 Brazil       1999  37737
    4 Brazil       2000  80488
    5 China        1999 212258
    6 China        2000 213766
    (
      tbl1_pop <- tidyr::table4b %>% 
      select(country, `1999`) %>% 
      mutate(year = 1999,
             population  = `1999`) %>% 
      select(country, year, population)
    )
    # A tibble: 3 x 3
      country      year population
      <chr>       <dbl>      <int>
    1 Afghanistan  1999   19987071
    2 Brazil       1999  172006362
    3 China        1999 1272915272
    (
      tbl2_pop <- tidyr::table4b %>% 
      select(country, "2000") %>% 
      mutate(year = 2000, 
             population = `2000`) %>% 
      select(country, year, population)
    )
    # A tibble: 3 x 3
      country      year population
      <chr>       <dbl>      <int>
    1 Afghanistan  2000   20595360
    2 Brazil       2000  174504898
    3 China        2000 1280428583
    (
      tbl_pop <- tbl1_pop %>% 
      bind_rows(tbl2_pop) %>% 
      arrange(country, year)
    )
    # A tibble: 6 x 3
      country      year population
      <chr>       <dbl>      <int>
    1 Afghanistan  1999   19987071
    2 Afghanistan  2000   20595360
    3 Brazil       1999  172006362
    4 Brazil       2000  174504898
    5 China        1999 1272915272
    6 China        2000 1280428583
    (
      tbl_rate <- tbl_cases %>% 
        bind_cols(tbl_pop) %>% 
        janitor::clean_names() %>% 
        select(country = country_1,
               year = year_2,
               cases, population) %>% 
        mutate(rate = cases / population * 10000)
    )
    # A tibble: 6 x 5
      country      year  cases population  rate
      <chr>       <dbl>  <int>      <int> <dbl>
    1 Afghanistan  1999    745   19987071 0.373
    2 Afghanistan  2000   2666   20595360 1.29 
    3 Brazil       1999  37737  172006362 2.19 
    4 Brazil       2000  80488  174504898 4.61 
    5 China        1999 212258 1272915272 1.67 
    6 China        2000 213766 1280428583 1.67 
    (
      tbl_1999 <- tbl_rate %>% 
        select(country, year, rate) %>% 
        filter(year == 1999) %>% 
        mutate(`1999` = rate) %>% 
        select(country, `1999`)
    )
    # A tibble: 3 x 2
      country     `1999`
      <chr>        <dbl>
    1 Afghanistan  0.373
    2 Brazil       2.19 
    3 China        1.67 
    (
      tbl_2000 <- tbl_rate %>% 
        select(country, year, rate) %>% 
        filter(year == 2000) %>% 
        mutate(`2000` = rate) %>% 
        select(country_temp = country, `2000`)
    )
    # A tibble: 3 x 2
      country_temp `2000`
      <chr>         <dbl>
    1 Afghanistan    1.29
    2 Brazil         4.61
    3 China          1.67
    (
      tbl_4c <- 
        tbl_1999 %>% 
        bind_cols(tbl_2000) %>% 
        select(country, `1999`, `2000`)
    )
    # A tibble: 3 x 3
      country     `1999` `2000`
      <chr>        <dbl>  <dbl>
    1 Afghanistan  0.373   1.29
    2 Brazil       2.19    4.61
    3 China        1.67    1.67
  3. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

    tidyr::table1
    # A tibble: 6 x 4
      country      year  cases population
      <chr>       <int>  <int>      <int>
    1 Afghanistan  1999    745   19987071
    2 Afghanistan  2000   2666   20595360
    3 Brazil       1999  37737  172006362
    4 Brazil       2000  80488  174504898
    5 China        1999 212258 1272915272
    6 China        2000 213766 1280428583
    ggplot(table1, aes(year, cases)) +
      geom_line(aes(group = country), colour = "grey50") +
      geom_point(aes(colour = country)) +
      scale_colour_tq()

    tidyr::table2
    # A tibble: 12 x 4
       country      year type            count
       <chr>       <int> <chr>           <int>
     1 Afghanistan  1999 cases             745
     2 Afghanistan  1999 population   19987071
     3 Afghanistan  2000 cases            2666
     4 Afghanistan  2000 population   20595360
     5 Brazil       1999 cases           37737
     6 Brazil       1999 population  172006362
     7 Brazil       2000 cases           80488
     8 Brazil       2000 population  174504898
     9 China        1999 cases          212258
    10 China        1999 population 1272915272
    11 China        2000 cases          213766
    12 China        2000 population 1280428583
    table2 %>% 
      filter(type == "cases") %>% 
      ggplot(aes(year, count)) +
      geom_line(aes(group = country), colour = "grey50") +
      geom_point(aes(colour = country)) +
      scale_colour_tq()

Sometimes you will have to resolve one of two common problems:

  1. One variable might be spread across multiple columns.

  2. One observation might be scattered across multiple rows.

Pivot Longer / Gather

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

table4a

# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4a %>% 
  # gather(list out columns you want to gather like dplyr::select() style,
  #        key = what do you want to call the column
  #              these column names go into,
  #        value = the values of the columns will go here)
  gather(`1999`, `2000`,
         key = "year",
         value = "cases" )

# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Brazil      1999   37737
3 China       1999  212258
4 Afghanistan 2000    2666
5 Brazil      2000   80488
6 China       2000  213766
(tidy_4a <- table4a %>% 
  # cols = list the columns you want to pivot
  # names_to = what will you call the new column these
  #            column names go into
  # values_to = the values in the columns will go here
  pivot_longer(cols = c(`1999`, `2000`),
               names_to = "year",
               values_to = "cases"))

# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766
table4b

# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583
table4b %>% 
  gather(`1999`, `2000`,
         key = "year",
         value = "population")

# A tibble: 6 x 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Brazil      1999   172006362
3 China       1999  1272915272
4 Afghanistan 2000    20595360
5 Brazil      2000   174504898
6 China       2000  1280428583
(tidy_4b <- table4b %>% 
  pivot_longer(cols = c(`1999`, `2000`),
               names_to = "year",
               values_to = "population"))

# A tibble: 6 x 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583
left_join(tidy_4a, tidy_4b) %>% 
  arrange(country, year)

# A tibble: 6 x 4
  country     year   cases population
  <chr>       <chr>  <int>      <int>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

Pivot Wider / Spread

pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows.

table2

# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table2 %>% 
  # key = column with the variable name, here `type`
  spread(key = type,
  # value = column with the value that will be assigned
  # to new columns
         value = count)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2 %>% 
  pivot_wider(names_from = type,
              values_from = count)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Exercises

  1. Why are pivot_longer() and pivot_wider() not perfectly symmetrical?
    Carefully consider the following example:

    (stocks <- tibble(
      year   = c(2015, 2015, 2016, 2016),
      half  = c(   1,    2,     1,    2),
      return = c(1.88, 0.59, 0.92, 0.17)
    ))
    # A tibble: 4 x 3
       year  half return
      <dbl> <dbl>  <dbl>
    1  2015     1   1.88
    2  2015     2   0.59
    3  2016     1   0.92
    4  2016     2   0.17
    stocks %>% 
      pivot_wider(names_from = year, values_from = return) %>% 
      pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
    # A tibble: 4 x 3
       half year  return
      <dbl> <chr>  <dbl>
    1     1 2015    1.88
    2     1 2016    0.92
    3     2 2015    0.59
    4     2 2016    0.17

    (Hint: look at the variable types and think about column names.)

    pivot_longer() has a names_ptypes argument, e.g.  names_ptypes = list(year = double()). What does it do?

    # vignette("pivot")
    stocks %>% 
        pivot_wider(names_from = year, values_from = return)
    # A tibble: 2 x 3
       half `2015` `2016`
      <dbl>  <dbl>  <dbl>
    1     1   1.88   0.92
    2     2   0.59   0.17

    Let’s have a look at the first part - here we take the year and make it a variable. That means that 2015 and 2016 become variables (new columns) in our new tibble, and the return gets pulled into the appropriate column (2015/2016) against the appropriate half. By nature of this move we changed year which was a double into two new column names which are 2015 and 2016 and hence “character”.

    (stocks_ <- stocks %>% 
      pivot_wider(names_from = year, values_from = return) %>% 
      pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return"))
    # A tibble: 4 x 3
       half year  return
      <dbl> <chr>  <dbl>
    1     1 2015    1.88
    2     1 2016    0.92
    3     2 2015    0.59
    4     2 2016    0.17
    colnames(stocks)
    [1] "year"   "half"   "return"
    colnames(stocks_)
    [1] "half"   "year"   "return"

    So following on that we take these new columns and then collapse them into a column year again. But now we have changed the type given we made them columns in the pivot_wider() step. So they keep their “character” nature when they are made longer again. Final result is year started off double (when we created it) but ends up character (after the pivot_wider and pivot_longer steps).

    Th columns also get rearranged since the pivot_wider spreads the year column into 2015 and 2016 which come after half in that initial step. When we subsequently pivot_longer half remains as the first column, followed by the names_to = column (year in this case), and finally the values_to = column (return in this case).

    Q: pivot_longer() has a names_ptypes argument, e.g.  names_ptypes = list(year = double()). What does it do?

    Okay so upon reading the help page and the info I expected that this function would convert my character column year created after the pivot_wider() step into a double, but instead it throws an error. 😕

    stocks %>% 
      pivot_wider(names_from = year, values_from = return) %>% 
      pivot_longer(`2015`:`2016`, 
                   names_to = "year", 
                   names_ptypes = list(year = double()),
                   values_to = "return"
                   )
    Error: Can't convert <character> to <double>.

    We use this to confirm that the columns we create are of the type / class we expect - so here it provides a check it seems 🤷.

    To transform the column from character to double you would need to use the names_transform.

    (stocks_ptypes <- stocks %>% 
      pivot_wider(names_from = year, values_from = return) %>% 
      pivot_longer(`2015`:`2016`, 
                   names_to = "year", 
                   names_transform = list(year = as.double),
                   values_to = "return",
                   # is the value column of the type expected
                   values_ptypes = list(return = double())
                   ))
    # A tibble: 4 x 3
       half  year return
      <dbl> <dbl>  <dbl>
    1     1  2015   1.88
    2     1  2016   0.92
    3     2  2015   0.59
    4     2  2016   0.17

    Strangely though I would expect that if I transform a column from x to y (using names_transform), and then use names_ptypes to check if my name column is indeed now of type y that would be fine? It still throws an error, so my thinking is flawed here.

    stocks %>% 
      pivot_wider(names_from = year, values_from = return) %>% 
      pivot_longer(`2015`:`2016`, 
                   names_to = "year", 
                   names_transform = list(year = as.double),
                   names_ptypes = list(year = double()),
                   values_to = "return",
                   # is the value column of the type expected
                   values_ptypes = list(return = double())
                   )
    Error: Can't convert <character> to <double>.
  2. Why does this code fail?

    table4a %>% 
      pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
    Error: Can't subset columns that don't exist.
    x Locations 1999 and 2000 don't exist.
    i There are only 3 columns.
      # Error: Can't subset columns that don't exist.
      # x Locations 1999 and 2000 don't exist.
      # i There are only 3 columns.
    # fixing it
    table4a %>% 
      pivot_longer(c("1999", `2000`), names_to = "year", values_to = "cases")
    # A tibble: 6 x 3
      country     year   cases
      <chr>       <chr>  <int>
    1 Afghanistan 1999     745
    2 Afghanistan 2000    2666
    3 Brazil      1999   37737
    4 Brazil      2000   80488
    5 China       1999  212258
    6 China       2000  213766

    The 1999 and 2000 are non-syntactically named columns. These have to be surrounded by backticks (``) or quotations "". Here tidyr is trying to read columns numbered 1999, and 2000 which don’t exist.

  3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

    people <- tribble(
      ~name,             ~names,  ~values,
      #-----------------|--------|------
      "Phillip Woods",   "age",       45,
      "Phillip Woods",   "height",   186,
      "Phillip Woods",   "age",       50,
      "Jessica Cordero", "age",       37,
      "Jessica Cordero", "height",   156
    )

    You get a warning and it has a list for each variable age and height since Philips Woods has two ages which are different.

    people %>% 
      pivot_wider(names_from = names,
                  values_from = "values")
    # A tibble: 2 x 3
      name            age       height   
      <chr>           <list>    <list>   
    1 Phillip Woods   <dbl [2]> <dbl [1]>
    2 Jessica Cordero <dbl [1]> <dbl [1]>
    people2 <- tribble(
      ~name,             ~names,  ~values,
      #-----------------|--------|------
      "Phillip Woods",   "age",       45,
      "Phillip Woods",   "height",   186,
      "Phillip Woods",   "age2",      50, # second age gets diff col name
      "Jessica Cordero", "age",       37,
      "Jessica Cordero", "height",   156
    )
    people2 %>% 
      pivot_wider(names_from = names,
                  values_from = "values")
    # A tibble: 2 x 4
      name              age height  age2
      <chr>           <dbl>  <dbl> <dbl>
    1 Phillip Woods      45    186    50
    2 Jessica Cordero    37    156    NA
  4. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

    (preg <- tribble(
      ~pregnant, ~male, ~female,
      "yes",     NA,    10,
      "no",      20,    12
    ))
    # A tibble: 2 x 3
      pregnant  male female
      <chr>    <dbl>  <dbl>
    1 yes         NA     10
    2 no          20     12

    We need to make it longer. The variable’s are pregnant (yes or no), and the number of male(s)/female(s) in each outcome of pregnant.

    preg %>% 
      pivot_longer(c('male', 'female'), 
                   names_to = 'sex',
                   values_to = 'count')
    # A tibble: 4 x 3
      pregnant sex    count
      <chr>    <chr>  <dbl>
    1 yes      male      NA
    2 yes      female    10
    3 no       male      20
    4 no       female    12

Separating and Uniting

Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears, and by default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter e.g. ,, ;, etc.).

In tidyr::table3 we have a column rate that has a combination of the cases and population in one column separated by a /.

table3
# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

No sep specified

table3 %>% 
  # separate without specifying the separator
  separate(rate, into = c("cases", "population"))

# A tibble: 6 x 4
  country      year cases  population
  <chr>       <int> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

sep specified as single value

sep is a regular expression, so more complicated sequences than “/” maybe supplied, e.g. sep = '[0-9]' will separate on any digit encountered.

table3 %>% 
  # separate with specifying the separator
  separate(rate, into = c("cases", "population"),
           sep = '/')

# A tibble: 6 x 4
  country      year cases  population
  <chr>       <int> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
as_tibble("This0is1a2test.with3a4deliberate5mistake.") %>%
  separate(value, into = c(glue::glue("col_{c(1:7)}")),
         sep = '[0-9]')

# A tibble: 1 x 7
  col_1 col_2 col_3 col_4     col_5 col_6      col_7   
  <chr> <chr> <chr> <chr>     <chr> <chr>      <chr>   
1 This  is    a     test.with a     deliberate mistake.

sep specified as vector of integers / position

You can also pass a vector of integers to sep. separate() will interpret the integers as positions to split at.

  • Positive values start at 1 and start counting from the left of the string
  • Negative values start at -1 and start counting from the right of the string
table3 %>% 
  # separate with specifying the separator
  separate(year, into = c("century", "year"),
           # here we use a single value
           # use c(2, 5) to split in multiple places
           sep = 2,
           convert = TRUE)

# A tibble: 6 x 4
  country     century  year rate             
  <chr>         <int> <int> <chr>            
1 Afghanistan      19    99 745/19987071     
2 Afghanistan      20     0 2666/20595360    
3 Brazil           19    99 37737/172006362  
4 Brazil           20     0 80488/174504898  
5 China            19    99 212258/1272915272
6 China            20     0 213766/1280428583

convert = TRUE

separate() retains the character type of the original field. Use convert = TRUE to get separate() to make a guess at a better type for the resulting columns.

table3 %>% 
  # separate with specifying the separator
  separate(rate, into = c("cases", "population"),
           sep = '/',
           convert = TRUE)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Unite

unite() is the inverse of separate(): it combines multiple columns into a single column.

The year column was deliberately made untidy by splitting it into century and year in tidyr::table5. We can concat these using unite.

table5 %>% 
  unite(new, century, year)

# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 19_99 745/19987071     
2 Afghanistan 20_00 2666/20595360    
3 Brazil      19_99 37737/172006362  
4 Brazil      20_00 80488/174504898  
5 China       19_99 212258/1272915272
6 China       20_00 213766/1280428583

If you look at the help page of unite (?unite) you will see that the default separator is ___ (sep = "_"). To override this behaviour we need to specify the sep argument.

table5 %>% 
  unite(new, century, year, sep = '')

# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

Exercises

  1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
      separate(x, c("one", "two", "three"))
    
    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
      separate(x, c("one", "two", "three"))
    • The default for extra is “warn”.
    • extra warns if during the splitting more columns are found than those specified in into. For example I may have into = c("year_of_birth", "month_of_birth") but upon splitting on - three columns are found. If I don’t specify extra then separate function will warn me there are more columns found, but these will be missing from my result.
    • extra can be warn (default), drop (drop any extra values without a warning) or merge (only splits at most length(into) times).
    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
          separate(x, c("one", "two", "three"))
    
        Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
    
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 d     e     f    
        3 h     i     j    
    
    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
          separate(x, c("one", "two", "three"),
                   extra = 'drop')
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 d     e     f    
        3 h     i     j    
    
    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
          separate(x, c("one", "two", "three"),
                   extra = 'merge')
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 d     e     f,g  
        3 h     i     j    
    
    • The default for fill is “warn”.

    • fill warns if during the splitting less columns are found than those specified in into.
      For example I may have into = c("year_of_birth", "month_of_birth", "day_of_birth", "sex") but upon splitting on - only three columns are found. If I don’t specify fill then the separate function will warn me there are less columns than expected and these will be filled with NA.

    • fill can be:

      • warn (default)
      • right (fill with missing values on the right) - i.e. if there are not enough values for the columns in the tibble the right hand side columns must be filled with NA. You’d use this if you expect that more of the end columns are missing.
      • left (fill with missing values on the left) - i.e. if there are not enough values for the columns in the tibble the left hand side columns must be filled with NA. You’d use this if you expect that more of the beginning columns you’re making are likely to be missing.
    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
          separate(x, c("one", "two", "three"))
    
        Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
    
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 d     e     <NA> 
        3 f     g     i    
    
    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
          separate(x, c("one", "two", "three"),
                   fill = 'right')
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 d     e     <NA> 
        3 f     g     i    
    
    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
          separate(x, c("one", "two", "three"),
                   fill = 'left')
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 <NA>  d     e    
        3 f     g     i    
    
  2. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

    • remove is by default TRUE.
    • It means remove the column(s) from which you are separating / uniting.
    • You would set it to FALSE if you’d like to perform the action (separate / unite) but also keep the original column(s).
    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
          separate(x, c("one", "two", "three"))
    
        Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
    
    
        # A tibble: 3 x 3
          one   two   three
          <chr> <chr> <chr>
        1 a     b     c    
        2 d     e     f    
        3 h     i     j    
    
    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
          separate(x, c("one", "two", "three"),
                   remove = FALSE)
    
        Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
    
    
        # A tibble: 3 x 4
          x       one   two   three
          <chr>   <chr> <chr> <chr>
        1 a,b,c   a     b     c    
        2 d,e,f,g d     e     f    
        3 h,i,j   h     i     j    
    

    Notice that initially x does not exist. In the second call we added the remove = FALSE and x is kept.

  3. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

    • separate has by position, by separator, and with groups because it takes a string and splits it into multiple parts; unite on the other hand combines separate columns and there is only one way to do it - specify the columns you want to join, in the order they should be joined, and with the separator that is appropriate.

    • separate does not throw away any part of the string unless it “falls over the edge” so to say. I.e. if we have extra string left over after the splitting, and it has run out of columns to place it in. You may kinda simulate this by saying “Hey, keep only the second column” as this help example shows. While sep is a regex it tells you how to split the string.

      # If you want to split by any non-alphanumeric value (the default):
      df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
      df %>% separate(x, c("A", "B"))
           A    B
      1 <NA> <NA>
      2    a    b
      3    a    d
      4    b    c
      # If you just want the second variable:
      df %>% separate(x, c(NA, "B"))
           B
      1 <NA>
      2    b
      3    d
      4    c
    • extract will give you back only the “groups” you are interested in, and “throws” away the rest, and maybe better to use when you have complex patterns you want to extract. 🤷

    df <- tibble(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
    
    df %>% tidyr::extract(x,
                         # make 2 columns
                         c("A", "B"), 
                         # here's the groups I want - surrounded by ()
                         "([a-d]+)-([a-d]+)")
    # A tibble: 5 x 2
      A     B    
      <chr> <chr>
    1 <NA>  <NA> 
    2 a     b    
    3 a     d    
    4 b     c    
    5 <NA>  <NA> 
    df %>% 
      tidyr::extract(col = x,
                    into = c("A"),
                    regex = "([a-d]+)")
    # A tibble: 5 x 1
      A    
      <chr>
    1 <NA> 
    2 a    
    3 a    
    4 b    
    5 d    
    df <- tibble(x = c(NA, "a-b-c", "a-d-e", "b-c-d", "d-e-f"))
    
    df %>% tidyr::extract(x,
                         # make 2 columns
                         c("A", "C"), 
                         # here's the groups I want - surrounded by ()
                         # only first, and last
                         "([a-z]+)-[a-z]-([a-z]+)")
    # A tibble: 5 x 2
      A     C    
      <chr> <chr>
    1 <NA>  <NA> 
    2 a     c    
    3 a     e    
    4 b     d    
    5 d     f    
    df %>% tidyr::separate(x,
                         # You can also do it using separate
                         # make 2 columns
                         c("A", "C"), 
                         # here's the groups I want - surrounded by ()
                         # only first, and last
                         "-[a-z]-")
    # A tibble: 5 x 2
      A     C    
      <chr> <chr>
    1 <NA>  <NA> 
    2 a     c    
    3 a     e    
    4 b     d    
    5 d     f    

Missing values

A value can be missing in one of two possible ways:

  • Explicitly, i.e. flagged with NA.
  • Implicitly, i.e. simply not present in the data.

For example here we have explicit NA as well as 2016 is missing quarter 1.

(stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
))

# A tibble: 7 x 3
   year   qtr return
  <dbl> <dbl>  <dbl>
1  2015     1   1.88
2  2015     2   0.59
3  2015     3   0.35
4  2015     4  NA   
5  2016     2   0.92
6  2016     3   0.17
7  2016     4   2.66

Representing the dataset can make the missing values more clear.

  1. Show implicit missing values.

    stocks %>% 
      pivot_wider(names_from = year, values_from = return)
    # A tibble: 4 x 3
        qtr `2015` `2016`
      <dbl>  <dbl>  <dbl>
    1     1   1.88  NA   
    2     2   0.59   0.92
    3     3   0.35   0.17
    4     4  NA      2.66
  2. Turn explicit missing values, into implicit missing values.

    stocks %>% 
      pivot_wider(names_from = year, values_from = return) %>% 
      pivot_longer(
        cols = c(`2015`, `2016`), 
        names_to = "year", 
        values_to = "return", 
        values_drop_na = TRUE
      )
    # A tibble: 6 x 3
        qtr year  return
      <dbl> <chr>  <dbl>
    1     1 2015    1.88
    2     2 2015    0.59
    3     2 2016    0.92
    4     3 2015    0.35
    5     3 2016    0.17
    6     4 2016    2.66
  3. Use complete() : make missing values explicit, by looking at a set of columns and finding all combinations. The gaps are filled with NA explicitly.

    stocks %>% 
      complete(year, qtr)
    # A tibble: 8 x 3
       year   qtr return
      <dbl> <dbl>  <dbl>
    1  2015     1   1.88
    2  2015     2   0.59
    3  2015     3   0.35
    4  2015     4  NA   
    5  2016     1  NA   
    6  2016     2   0.92
    7  2016     3   0.17
    8  2016     4   2.66
  4. Sometimes when working with data the missing are an indication that the previous rows value must be filled down.

    (treatment <- tribble(
      ~ person,           ~ treatment, ~response,
      "Derrick Whitmore", 1,           7,
      NA,                 2,           10,
      NA,                 3,           9,
      "Katherine Burke",  1,           4
    ))
    # A tibble: 4 x 3
      person           treatment response
      <chr>                <dbl>    <dbl>
    1 Derrick Whitmore         1        7
    2 <NA>                     2       10
    3 <NA>                     3        9
    4 Katherine Burke          1        4

    Using fill() allows you to fill a columns values with the most recent non-missing value.

    treatment %>% 
      fill(person)
    # A tibble: 4 x 3
      person           treatment response
      <chr>                <dbl>    <dbl>
    1 Derrick Whitmore         1        7
    2 Derrick Whitmore         2       10
    3 Derrick Whitmore         3        9
    4 Katherine Burke          1        4

Exercises

  1. Compare and contrast the fill arguments to pivot_wider() and complete().

    • Both fill missing values with something other than NA.
    • In pivot_wider() it is called values_fill, and the spec is either a scalar or named list.
    • In complete() it is called fill, and the spec is a named list.
    (df <- tibble(
         group = c(1:2, NA_real_, 1),
         item_id = c(1:2, 1, 2),
         item_name = c("a", "NA", "b", "b"),
         value1 = 1:4,
         value2 = 4:7
     ))
    # A tibble: 4 x 5
      group item_id item_name value1 value2
      <dbl>   <dbl> <chr>      <int>  <int>
    1     1       1 a              1      4
    2     2       2 NA             2      5
    3    NA       1 b              3      6
    4     1       2 b              4      7
    df %>% 
      pivot_wider(names_from = item_name, 
                  values_from = value1,
                  # the values are filled with 0 if missing
                  values_fill = 0)
    # A tibble: 4 x 6
      group item_id value2     a  `NA`     b
      <dbl>   <dbl>  <int> <int> <int> <int>
    1     1       1      4     1     0     0
    2     2       2      5     0     2     0
    3    NA       1      6     0     0     3
    4     1       2      7     0     0     4
    df %>% 
      complete(group, item_id,
               # explicitly list cols you want filled
               # with alternate to NA
               fill = list(group = 0,
               item_name = "0"))
    # A tibble: 6 x 5
      group item_id item_name value1 value2
      <dbl>   <dbl> <chr>      <int>  <int>
    1     1       1 a              1      4
    2     1       2 b              4      7
    3     2       1 0             NA     NA
    4     2       2 NA             2      5
    5     0       1 b              3      6
    6     0       2 0             NA     NA
  2. What does the direction argument to fill() do?

    The argument is .direction and tells you in which direction the fill() must occur, and can be one of 4 different values.

    .direction = c(“down”, “up”, “downup”, “updown”)

    # Value (n_squirrels) is missing above and below within a group
    (squirrels <- tibble::tribble(
      ~group,    ~name,     ~role,     ~n_squirrels,
      1,      "Sam",    "Observer",   NA,
      1,     "Mara", "Scorekeeper",    8,
      1,    "Jesse",    "Observer",   NA,
      1,      "Tom",    "Observer",   NA,
      2,     "Mike",    "Observer",   NA,
      2,  "Rachael",    "Observer",   NA,
      2,  "Sydekea", "Scorekeeper",   14,
      2, "Gabriela",    "Observer",   NA,
      3,  "Derrick",    "Observer",   NA,
      3,     "Kara", "Scorekeeper",    9,
      3,    "Emily",    "Observer",   NA,
      3, "Danielle",    "Observer",   NA
    ))
    # A tibble: 12 x 4
       group name     role        n_squirrels
       <dbl> <chr>    <chr>             <dbl>
     1     1 Sam      Observer             NA
     2     1 Mara     Scorekeeper           8
     3     1 Jesse    Observer             NA
     4     1 Tom      Observer             NA
     5     2 Mike     Observer             NA
     6     2 Rachael  Observer             NA
     7     2 Sydekea  Scorekeeper          14
     8     2 Gabriela Observer             NA
     9     3 Derrick  Observer             NA
    10     3 Kara     Scorekeeper           9
    11     3 Emily    Observer             NA
    12     3 Danielle Observer             NA
    # The values are inconsistently missing by position within the group
    # Use .direction = "downup" to fill missing values in both directions
    squirrels %>%
      dplyr::group_by(group) %>%
      fill(n_squirrels, .direction = "downup") %>%
      dplyr::ungroup()
    # A tibble: 12 x 4
       group name     role        n_squirrels
       <dbl> <chr>    <chr>             <dbl>
     1     1 Sam      Observer              8
     2     1 Mara     Scorekeeper           8
     3     1 Jesse    Observer              8
     4     1 Tom      Observer              8
     5     2 Mike     Observer             14
     6     2 Rachael  Observer             14
     7     2 Sydekea  Scorekeeper          14
     8     2 Gabriela Observer             14
     9     3 Derrick  Observer              9
    10     3 Kara     Scorekeeper           9
    11     3 Emily    Observer              9
    12     3 Danielle Observer              9
    # contrast with "updown"
    squirrels %>%
      dplyr::group_by(group) %>%
      # should be the same in this e.g.
      fill(n_squirrels, .direction = "updown") %>%
      dplyr::ungroup()
    # A tibble: 12 x 4
       group name     role        n_squirrels
       <dbl> <chr>    <chr>             <dbl>
     1     1 Sam      Observer              8
     2     1 Mara     Scorekeeper           8
     3     1 Jesse    Observer              8
     4     1 Tom      Observer              8
     5     2 Mike     Observer             14
     6     2 Rachael  Observer             14
     7     2 Sydekea  Scorekeeper          14
     8     2 Gabriela Observer             14
     9     3 Derrick  Observer              9
    10     3 Kara     Scorekeeper           9
    11     3 Emily    Observer              9
    12     3 Danielle Observer              9
    # contrast with "down"
    squirrels %>%
      dplyr::group_by(group) %>%
      fill(n_squirrels, .direction = "down") %>%
      dplyr::ungroup()
    # A tibble: 12 x 4
       group name     role        n_squirrels
       <dbl> <chr>    <chr>             <dbl>
     1     1 Sam      Observer             NA
     2     1 Mara     Scorekeeper           8
     3     1 Jesse    Observer              8
     4     1 Tom      Observer              8
     5     2 Mike     Observer             NA
     6     2 Rachael  Observer             NA
     7     2 Sydekea  Scorekeeper          14
     8     2 Gabriela Observer             14
     9     3 Derrick  Observer             NA
    10     3 Kara     Scorekeeper           9
    11     3 Emily    Observer              9
    12     3 Danielle Observer              9
    # contrast with "up"
    squirrels %>%
      dplyr::group_by(group) %>%
      fill(n_squirrels, .direction = "up") %>%
      dplyr::ungroup()
    # A tibble: 12 x 4
       group name     role        n_squirrels
       <dbl> <chr>    <chr>             <dbl>
     1     1 Sam      Observer              8
     2     1 Mara     Scorekeeper           8
     3     1 Jesse    Observer             NA
     4     1 Tom      Observer             NA
     5     2 Mike     Observer             14
     6     2 Rachael  Observer             14
     7     2 Sydekea  Scorekeeper          14
     8     2 Gabriela Observer             NA
     9     3 Derrick  Observer              9
    10     3 Kara     Scorekeeper           9
    11     3 Emily    Observer             NA
    12     3 Danielle Observer             NA

Case Study

who
# A tibble: 7,240 x 60
   country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
   <chr>   <chr> <chr> <int>       <int>        <int>        <int>        <int>
 1 Afghan~ AF    AFG    1980          NA           NA           NA           NA
 2 Afghan~ AF    AFG    1981          NA           NA           NA           NA
 3 Afghan~ AF    AFG    1982          NA           NA           NA           NA
 4 Afghan~ AF    AFG    1983          NA           NA           NA           NA
 5 Afghan~ AF    AFG    1984          NA           NA           NA           NA
 6 Afghan~ AF    AFG    1985          NA           NA           NA           NA
 7 Afghan~ AF    AFG    1986          NA           NA           NA           NA
 8 Afghan~ AF    AFG    1987          NA           NA           NA           NA
 9 Afghan~ AF    AFG    1988          NA           NA           NA           NA
10 Afghan~ AF    AFG    1989          NA           NA           NA           NA
# ... with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
#   new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
#   new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
#   new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
#   new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
#   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
#   new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
#   new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
#   new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
#   new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
#   new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
#   new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
#   new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
#   new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
#   newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
#   newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
#   newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
#   newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
(
  who1 <- who %>% 
          # change the wide data to long data
          pivot_longer(
            cols = new_sp_m014:newrel_f65, 
            names_to = "key", 
            values_to = "cases", 
            values_drop_na = TRUE
          )
)
# A tibble: 76,046 x 6
   country     iso2  iso3   year key          cases
   <chr>       <chr> <chr> <int> <chr>        <int>
 1 Afghanistan AF    AFG    1997 new_sp_m014      0
 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
 4 Afghanistan AF    AFG    1997 new_sp_m3544     3
 5 Afghanistan AF    AFG    1997 new_sp_m4554     5
 6 Afghanistan AF    AFG    1997 new_sp_m5564     2
 7 Afghanistan AF    AFG    1997 new_sp_m65       0
 8 Afghanistan AF    AFG    1997 new_sp_f014      5
 9 Afghanistan AF    AFG    1997 new_sp_f1524    38
10 Afghanistan AF    AFG    1997 new_sp_f2534    36
# ... with 76,036 more rows
who1 %>% 
  # how many keys do we have?
  count(key, sort = TRUE) %>% 
  DT::datatable()

To parse the columns, refer to Chaper 9 (Chapter 12 online).

(who2 <- who1 %>% 
  # Look in key. Do you see any strings that have "newrel"?
  # If you do replace these with new_rel for consistent
  # key values for easier parsing
  mutate(key = stringr::str_replace(key,
                                    "newrel",
                                    "new_rel")))
# A tibble: 76,046 x 6
   country     iso2  iso3   year key          cases
   <chr>       <chr> <chr> <int> <chr>        <int>
 1 Afghanistan AF    AFG    1997 new_sp_m014      0
 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
 4 Afghanistan AF    AFG    1997 new_sp_m3544     3
 5 Afghanistan AF    AFG    1997 new_sp_m4554     5
 6 Afghanistan AF    AFG    1997 new_sp_m5564     2
 7 Afghanistan AF    AFG    1997 new_sp_m65       0
 8 Afghanistan AF    AFG    1997 new_sp_f014      5
 9 Afghanistan AF    AFG    1997 new_sp_f1524    38
10 Afghanistan AF    AFG    1997 new_sp_f2534    36
# ... with 76,036 more rows
(
  who3 <- who2 %>% 
    separate(key,
             c("new", "type", "sexage"),
             sep = "_")
)
# A tibble: 76,046 x 8
   country     iso2  iso3   year new   type  sexage cases
   <chr>       <chr> <chr> <int> <chr> <chr> <chr>  <int>
 1 Afghanistan AF    AFG    1997 new   sp    m014       0
 2 Afghanistan AF    AFG    1997 new   sp    m1524     10
 3 Afghanistan AF    AFG    1997 new   sp    m2534      6
 4 Afghanistan AF    AFG    1997 new   sp    m3544      3
 5 Afghanistan AF    AFG    1997 new   sp    m4554      5
 6 Afghanistan AF    AFG    1997 new   sp    m5564      2
 7 Afghanistan AF    AFG    1997 new   sp    m65        0
 8 Afghanistan AF    AFG    1997 new   sp    f014       5
 9 Afghanistan AF    AFG    1997 new   sp    f1524     38
10 Afghanistan AF    AFG    1997 new   sp    f2534     36
# ... with 76,036 more rows

Ok, from the chapter text it seems we only have new cases, so let us verify that.

who3 %>% 
  count(new)
# A tibble: 1 x 2
  new       n
  <chr> <int>
1 new   76046

This is indeed the case! The count == the dataset count so we can drop this column

(
  who4 <- who3 %>% 
    select(-new, -iso2, iso3)
)
# A tibble: 76,046 x 6
   country     iso3   year type  sexage cases
   <chr>       <chr> <int> <chr> <chr>  <int>
 1 Afghanistan AFG    1997 sp    m014       0
 2 Afghanistan AFG    1997 sp    m1524     10
 3 Afghanistan AFG    1997 sp    m2534      6
 4 Afghanistan AFG    1997 sp    m3544      3
 5 Afghanistan AFG    1997 sp    m4554      5
 6 Afghanistan AFG    1997 sp    m5564      2
 7 Afghanistan AFG    1997 sp    m65        0
 8 Afghanistan AFG    1997 sp    f014       5
 9 Afghanistan AFG    1997 sp    f1524     38
10 Afghanistan AFG    1997 sp    f2534     36
# ... with 76,036 more rows

Let’s separate the sexage column.

(
  who5 <- who4 %>% 
    separate(sexage, c("sex", "age"),
             sep = 1)
)
# A tibble: 76,046 x 7
   country     iso3   year type  sex   age   cases
   <chr>       <chr> <int> <chr> <chr> <chr> <int>
 1 Afghanistan AFG    1997 sp    m     014       0
 2 Afghanistan AFG    1997 sp    m     1524     10
 3 Afghanistan AFG    1997 sp    m     2534      6
 4 Afghanistan AFG    1997 sp    m     3544      3
 5 Afghanistan AFG    1997 sp    m     4554      5
 6 Afghanistan AFG    1997 sp    m     5564      2
 7 Afghanistan AFG    1997 sp    m     65        0
 8 Afghanistan AFG    1997 sp    f     014       5
 9 Afghanistan AFG    1997 sp    f     1524     38
10 Afghanistan AFG    1997 sp    f     2534     36
# ... with 76,036 more rows

Pipeline version

who %>% 
  pivot_longer(cols = new_sp_m014:newrel_f65,
               names_to = "key",
               values_to = "cases",
               values_drop_na = TRUE) %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>% 
  separate(key, c("new", "type", "sexage"),
                  sep = "_") %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"),
           # first pos = sex, rest = age group
           sep = 1)
# A tibble: 76,046 x 6
   country      year type  sex   age   cases
   <chr>       <int> <chr> <chr> <chr> <int>
 1 Afghanistan  1997 sp    m     014       0
 2 Afghanistan  1997 sp    m     1524     10
 3 Afghanistan  1997 sp    m     2534      6
 4 Afghanistan  1997 sp    m     3544      3
 5 Afghanistan  1997 sp    m     4554      5
 6 Afghanistan  1997 sp    m     5564      2
 7 Afghanistan  1997 sp    m     65        0
 8 Afghanistan  1997 sp    f     014       5
 9 Afghanistan  1997 sp    f     1524     38
10 Afghanistan  1997 sp    f     2534     36
# ... with 76,036 more rows

Exercises

  1. In this case study I set values_drop_na = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

    If we consider the missing values in this data set, it does seem reasonable.

    who %>% 
      select(new_sp_m014:newrel_f65) %>% 
      skimr::skim()
    Data summary
    Name Piped data
    Number of rows 7240
    Number of columns 56
    _______________________
    Column type frequency:
    numeric 56
    ________________________
    Group variables None

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    new_sp_m014 4067 0.44 83.71 316.14 0 0.00 5.0 37.00 5001 ▇▁▁▁▁
    new_sp_m1524 4031 0.44 1015.66 4885.38 0 9.00 90.0 502.00 78278 ▇▁▁▁▁
    new_sp_m2534 4034 0.44 1403.80 5718.39 0 14.00 150.0 715.50 84003 ▇▁▁▁▁
    new_sp_m3544 4021 0.44 1315.88 6003.26 0 13.00 130.0 583.50 90830 ▇▁▁▁▁
    new_sp_m4554 4017 0.45 1103.86 5441.06 0 12.00 102.0 440.00 82921 ▇▁▁▁▁
    new_sp_m5564 4022 0.44 800.70 4418.31 0 8.00 63.0 279.00 63814 ▇▁▁▁▁
    new_sp_m65 4031 0.44 682.82 4089.14 0 8.00 53.0 232.00 70376 ▇▁▁▁▁
    new_sp_f014 4066 0.44 114.33 504.63 0 1.00 7.0 50.75 8576 ▇▁▁▁▁
    new_sp_f1524 4046 0.44 826.11 3552.02 0 7.00 66.0 421.00 53975 ▇▁▁▁▁
    new_sp_f2534 4040 0.44 917.30 3580.15 0 9.00 84.0 476.25 49887 ▇▁▁▁▁
    new_sp_f3544 4041 0.44 640.43 2542.51 0 6.00 57.0 308.00 34698 ▇▁▁▁▁
    new_sp_f4554 4036 0.44 445.78 1799.23 0 4.00 38.0 211.00 23977 ▇▁▁▁▁
    new_sp_f5564 4045 0.44 313.87 1381.25 0 3.00 25.0 146.50 18203 ▇▁▁▁▁
    new_sp_f65 4043 0.44 283.93 1267.94 0 4.00 30.0 129.00 21339 ▇▁▁▁▁
    new_sn_m014 6195 0.14 308.75 1727.25 0 1.00 9.0 61.00 22355 ▇▁▁▁▁
    new_sn_m1524 6210 0.14 513.02 3643.27 0 2.00 15.5 102.00 60246 ▇▁▁▁▁
    new_sn_m2534 6218 0.14 653.69 3430.03 0 2.00 23.0 135.50 50282 ▇▁▁▁▁
    new_sn_m3544 6215 0.14 837.87 8524.53 0 2.00 19.0 132.00 250051 ▇▁▁▁▁
    new_sn_m4554 6213 0.14 520.79 3301.70 0 2.00 19.0 127.50 57181 ▇▁▁▁▁
    new_sn_m5564 6219 0.14 448.62 3488.68 0 2.00 16.0 101.00 64972 ▇▁▁▁▁
    new_sn_m65 6220 0.14 460.36 3991.90 0 2.00 20.5 111.75 74282 ▇▁▁▁▁
    new_sn_f014 6200 0.14 291.95 1647.30 0 1.00 8.0 58.00 21406 ▇▁▁▁▁
    new_sn_f1524 6218 0.14 407.90 2379.13 0 1.00 12.0 89.00 35518 ▇▁▁▁▁
    new_sn_f2534 6224 0.14 466.26 2272.86 0 2.00 18.0 103.25 28753 ▇▁▁▁▁
    new_sn_f3544 6220 0.14 506.59 5013.53 0 1.00 11.0 82.25 148811 ▇▁▁▁▁
    new_sn_f4554 6222 0.14 271.16 1511.72 0 1.00 10.0 76.75 23869 ▇▁▁▁▁
    new_sn_f5564 6223 0.14 213.39 1468.62 0 1.00 8.0 56.00 26085 ▇▁▁▁▁
    new_sn_f65 6221 0.14 230.75 1597.70 0 1.00 13.0 74.00 29630 ▇▁▁▁▁
    new_ep_m014 6202 0.14 128.61 460.14 0 0.00 6.0 55.00 7869 ▇▁▁▁▁
    new_ep_m1524 6214 0.14 158.30 537.74 0 1.00 11.0 88.00 8558 ▇▁▁▁▁
    new_ep_m2534 6220 0.14 201.23 764.05 0 1.00 13.0 124.00 11843 ▇▁▁▁▁
    new_ep_m3544 6216 0.14 272.72 3381.41 0 1.00 10.5 91.25 105825 ▇▁▁▁▁
    new_ep_m4554 6220 0.14 108.11 380.61 0 1.00 8.5 63.25 5875 ▇▁▁▁▁
    new_ep_m5564 6225 0.14 72.17 234.55 0 1.00 7.0 46.00 3957 ▇▁▁▁▁
    new_ep_m65 6222 0.14 78.94 227.34 0 1.00 10.0 55.00 3061 ▇▁▁▁▁
    new_ep_f014 6208 0.14 112.89 446.55 0 0.00 5.0 50.00 6960 ▇▁▁▁▁
    new_ep_f1524 6219 0.14 149.17 543.89 0 1.00 9.0 78.00 7866 ▇▁▁▁▁
    new_ep_f2534 6219 0.14 189.52 761.79 0 1.00 12.0 95.00 10759 ▇▁▁▁▁
    new_ep_f3544 6219 0.14 241.70 3218.50 0 1.00 9.0 77.00 101015 ▇▁▁▁▁
    new_ep_f4554 6223 0.14 93.77 339.33 0 1.00 8.0 56.00 6759 ▇▁▁▁▁
    new_ep_f5564 6223 0.14 63.04 212.95 0 1.00 6.0 42.00 4684 ▇▁▁▁▁
    new_ep_f65 6226 0.14 72.31 202.72 0 0.00 10.0 51.00 2548 ▇▁▁▁▁
    newrel_m014 7050 0.03 538.18 2082.18 0 5.00 32.5 210.00 18617 ▇▁▁▁▁
    newrel_m1524 7058 0.03 1489.51 6848.18 0 17.50 171.0 684.25 84785 ▇▁▁▁▁
    newrel_m2534 7057 0.03 2139.72 7539.87 0 25.00 217.0 1091.00 76917 ▇▁▁▁▁
    newrel_m3544 7056 0.03 2036.40 7847.94 0 24.75 208.0 851.25 84565 ▇▁▁▁▁
    newrel_m4554 7056 0.03 1835.07 8324.28 0 19.00 175.0 688.50 100297 ▇▁▁▁▁
    newrel_m5564 7055 0.03 1525.30 8760.27 0 13.00 136.0 536.00 112558 ▇▁▁▁▁
    newrel_m65 7058 0.03 1426.00 9431.99 0 17.00 117.0 453.50 124476 ▇▁▁▁▁
    newrel_f014 7050 0.03 532.84 2117.78 0 5.00 32.5 226.00 18054 ▇▁▁▁▁
    newrel_f1524 7056 0.03 1161.85 4606.76 0 10.75 123.0 587.75 49491 ▇▁▁▁▁
    newrel_f2534 7058 0.03 1472.80 5259.59 0 18.00 161.0 762.50 44985 ▇▁▁▁▁
    newrel_f3544 7057 0.03 1125.01 4210.58 0 12.50 125.0 544.50 38804 ▇▁▁▁▁
    newrel_f4554 7057 0.03 877.27 3556.18 0 10.00 92.0 400.50 37138 ▇▁▁▁▁
    newrel_f5564 7057 0.03 686.41 3379.33 0 8.00 69.0 269.00 40892 ▇▁▁▁▁
    newrel_f65 7055 0.03 683.76 3618.47 0 9.00 69.0 339.00 47438 ▇▁▁▁▁
    who %>% 
      select(new_sp_m014) %>% 
      filter(new_sp_m014 == 0 |
             is.na(new_sp_m014)) %>% 
      skimr::skim()   
    Data summary
    Name Piped data
    Number of rows 4929
    Number of columns 1
    _______________________
    Column type frequency:
    numeric 1
    ________________________
    Group variables None

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    new_sp_m014 4067 0.17 0 0 0 0 0 0 0 ▁▁▇▁▁
    who %>% 
      select(new_sp_m014) %>% 
      filter(new_sp_m014 == 0) %>% 
      skimr::skim()   
    Data summary
    Name Piped data
    Number of rows 862
    Number of columns 1
    _______________________
    Column type frequency:
    numeric 1
    ________________________
    Group variables None

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    new_sp_m014 0 1 0 0 0 0 0 0 0 ▁▁▇▁▁
    who %>% 
      select(new_sp_m014) %>% 
      filter(is.na(new_sp_m014)) %>% 
      skimr::skim()   
    Data summary
    Name Piped data
    Number of rows 4067
    Number of columns 1
    _______________________
    Column type frequency:
    numeric 1
    ________________________
    Group variables None

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    new_sp_m014 4067 0 NaN NA NA NA NA NA NA

    There may be implicit values if a year is missing for a country.

    0 means I received a measure for that variable and it is no cases. NA could mean I received no information for that variable so cases may or may not exist.

  2. What happens if you neglect the mutate() step? (mutate(names_from = stringr::str_replace(key, "newrel", "new_rel")))

    For some of the steps we will have misaligned data when we separate these later in the pipeline.

    Let’s take for example newrel_m3544

    • in the separate(key, …) step new = newrel, type = m3544
    • this will complicate the dropping of “new”
    • it will also mean that type would be m3544.
  3. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

    who %>% 
      count(country, sort = TRUE)
    # A tibble: 219 x 2
       country                 n
       <chr>               <int>
     1 Afghanistan            34
     2 Albania                34
     3 Algeria                34
     4 American Samoa         34
     5 Andorra                34
     6 Angola                 34
     7 Anguilla               34
     8 Antigua and Barbuda    34
     9 Argentina              34
    10 Armenia                34
    # ... with 209 more rows
    who %>% 
      count(iso2, sort = TRUE)
    # A tibble: 219 x 2
       iso2      n
       <chr> <int>
     1 AD       34
     2 AE       34
     3 AF       34
     4 AG       34
     5 AI       34
     6 AL       34
     7 AM       34
     8 AO       34
     9 AR       34
    10 AS       34
    # ... with 209 more rows
    who %>% 
      count(iso3, sort = TRUE)
    # A tibble: 219 x 2
       iso3      n
       <chr> <int>
     1 ABW      34
     2 AFG      34
     3 AGO      34
     4 AIA      34
     5 ALB      34
     6 AND      34
     7 ARE      34
     8 ARG      34
     9 ARM      34
    10 ASM      34
    # ... with 209 more rows
    who %>% 
      select(country, iso2, iso3) %>% 
      skimr::skim()
    Data summary
    Name Piped data
    Number of rows 7240
    Number of columns 3
    _______________________
    Column type frequency:
    character 3
    ________________________
    Group variables None

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    country 0 1 4 52 0 219 0
    iso2 0 1 2 2 0 219 0
    iso3 0 1 3 3 0 219 0

    In the skim for character variables the min tells you the min chars in a observation, and max is the max chars in a variable.

  4. For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.

    who5 %>% 
      group_by(country, year, sex) %>% 
      summarise(total_cases = sum(cases)) %>% 
      ungroup() %>% 
      arrange(country, year, sex) %>% 
      filter(country %in% c("South Africa", "Botswana",
                           "Zimbabwe", "Namibia", "Swaziland",
                           "Lesotho", "Mozambique")) %>% 
      ggplot(aes(x = year, y = total_cases, colour = sex)) +
      geom_point(alpha = 0.5) +
      facet_wrap(~ country, scales = "free_y") +
      labs(title = "Total Cases Across South Africa 
           and neighbouring countries over time",
           subtitle = "The number of cases is growing quite a bit over time",
           x = "", 
           y = "Cases Per Year") +
      scale_colour_tq()

    who5 %>% 
      group_by(country, year, sex) %>% 
      summarise(total_cases = sum(cases)) %>% 
      ungroup() %>% 
      arrange(country, year, sex) %>% 
      group_by(year, sex) %>% 
      summarise(cases_per_year = sum(total_cases)) %>% 
      ungroup() %>% 
      ggplot(aes(x = year, y = cases_per_year, fill = sex)) +
      geom_col(alpha = 0.5) +
      scale_fill_tq() +
      scale_y_log10() +
      labs(title = "Total Cases Across all countries over time",
           subtitle = "The number of cases is growing quite a bit over time",
           x = "", 
           y = "Cases Per Year")


sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
[3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_South Africa.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyquant_1.0.0            quantmod_0.4.17           
 [3] TTR_0.23-6                 PerformanceAnalytics_2.0.4
 [5] xts_0.12-0                 zoo_1.8-7                 
 [7] magrittr_1.5               lubridate_1.7.9           
 [9] emo_0.0.0.9000             flair_0.0.2               
[11] forcats_0.5.0              stringr_1.4.0             
[13] dplyr_1.0.2                purrr_0.3.4               
[15] readr_1.4.0                tidyr_1.1.2               
[17] tibble_3.0.3               ggplot2_3.3.2             
[19] tidyverse_1.3.0            workflowr_1.6.2           

loaded via a namespace (and not attached):
 [1] fs_1.5.0          httr_1.4.2        rprojroot_1.3-2   repr_1.1.0       
 [5] tools_3.6.3       backports_1.1.6   utf8_1.1.4        R6_2.4.1         
 [9] DT_0.16           DBI_1.1.0         colorspace_1.4-1  withr_2.2.0      
[13] tidyselect_1.1.0  curl_4.3          compiler_3.6.3    git2r_0.26.1     
[17] cli_2.1.0         rvest_0.3.6       gt_0.2.2          xml2_1.3.2       
[21] labeling_0.3      sass_0.2.0        scales_1.1.0      checkmate_2.0.0  
[25] quadprog_1.5-8    digest_0.6.27     rmarkdown_2.4     base64enc_0.1-3  
[29] pkgconfig_2.0.3   htmltools_0.5.0   dbplyr_2.0.0      highr_0.8        
[33] htmlwidgets_1.5.1 rlang_0.4.8       readxl_1.3.1      rstudioapi_0.11  
[37] farver_2.0.3      generics_0.0.2    jsonlite_1.7.1    crosstalk_1.1.0.1
[41] Rcpp_1.0.4.6      Quandl_2.10.0     munsell_0.5.0     fansi_0.4.1      
[45] lifecycle_0.2.0   stringi_1.5.3     whisker_0.4       yaml_2.2.1       
[49] snakecase_0.11.0  grid_3.6.3        promises_1.1.0    crayon_1.3.4     
[53] lattice_0.20-38   haven_2.3.1       hms_0.5.3         knitr_1.28       
[57] ps_1.3.2          pillar_1.4.6      reprex_0.3.0      glue_1.4.2       
[61] evaluate_0.14     modelr_0.1.8      vctrs_0.3.2       httpuv_1.5.2     
[65] cellranger_1.1.0  gtable_0.3.0      assertthat_0.2.1  xfun_0.13        
[69] skimr_2.1.1       janitor_2.0.1     broom_0.7.2       later_1.0.0      
[73] ellipsis_0.3.1