Last updated: 2020-11-21
Checks: 7 0
Knit directory: r4ds_book/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200814)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 6e7b3db. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Untracked files:
Untracked: analysis/images/
Untracked: code_snipp.txt
Untracked: data/at_health_facilities.csv
Untracked: data/infant_hiv.csv
Untracked: data/measurements.csv
Untracked: data/person.csv
Untracked: data/ranking.csv
Untracked: data/visited.csv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch9_tidy_data.Rmd
) and HTML (docs/ch9_tidy_data.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 7ed0458 | sciencificity | 2020-11-10 | Build site. |
html | 86457fa | sciencificity | 2020-11-10 | Build site. |
html | 4879249 | sciencificity | 2020-11-09 | Build site. |
html | e423967 | sciencificity | 2020-11-08 | Build site. |
html | 0d223fb | sciencificity | 2020-11-08 | Build site. |
html | ecd1d8e | sciencificity | 2020-11-07 | Build site. |
html | 274005c | sciencificity | 2020-11-06 | Build site. |
html | 60e7ce2 | sciencificity | 2020-11-02 | Build site. |
html | db5a796 | sciencificity | 2020-11-01 | Build site. |
html | d8813e9 | sciencificity | 2020-11-01 | Build site. |
html | bf15f3b | sciencificity | 2020-11-01 | Build site. |
html | 0aef1b0 | sciencificity | 2020-10-31 | Build site. |
html | bdc0881 | sciencificity | 2020-10-26 | Build site. |
html | 8224544 | sciencificity | 2020-10-26 | Build site. |
html | 2f8dcc0 | sciencificity | 2020-10-25 | Build site. |
Rmd | 648d291 | sciencificity | 2020-10-25 | completed Ch9 |
html | 61e2324 | sciencificity | 2020-10-25 | Build site. |
Rmd | 57f23a8 | sciencificity | 2020-10-25 | added Ch9 |
options(scipen=10000)
library(tidyverse)
library(flair)
library(emo)
library(lubridate)
library(magrittr)
library(tidyquant)
theme_set(theme_tq())
In all the examples tidyr::table to tidyr::table4b, only tidyr::table1
is tidy.
(
# practising the read_csv function to create table1
# just note however that table1 is in tidyr ;)
# tidyr::table1 etc.
# In all honesty, I only figured this out after "practising" :P
table1 <- read_csv("country, year, cases, population
Afghanistan, 1999, 745, 19987071
Afghanistan, 2000, 2666, 20595360
Brazil, 1999, 37737, 172006362
Brazil, 2000, 80488, 174504898
China, 1999, 212258, 1272915272
China, 2000, 213766, 1280428583")
)
# A tibble: 6 x 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table1 %>%
mutate(rate = cases/population * 10000)
# A tibble: 6 x 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
table1 %>%
count(year, wt=cases) # same as group_by and sum
# A tibble: 2 x 2
year n
<dbl> <dbl>
1 1999 250740
2 2000 296920
table1 %>%
group_by(year) %>%
summarise(sum(cases))
# A tibble: 2 x 2
year `sum(cases)`
<dbl> <dbl>
1 1999 250740
2 2000 296920
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_colour_tq()
Using prose, describe how the variables and observations are organised in each of the sample tables.
tidyr::table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
tidyr::table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
tidyr::table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
rate
.tidyr::table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
tidyr::table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
tidyr::table5
# A tibble: 6 x 4
country century year rate
* <chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
rate
(separated within the column by /)Compute the rate
for table2
, and table4a
+ table4b
. You will need to perform four operations:
Which representation is easiest to work with? Which is hardest? Why?
(
tbl1 <- tidyr::table2 %>%
filter(type == "cases") %>%
group_by(country, year) %>%
mutate(cases = count) %>%
ungroup() %>%
select(country, year, cases) %>%
arrange(country, year)
)
# A tibble: 6 x 3
country year cases
<chr> <int> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
(
tbl2 <- tidyr::table2 %>%
filter(type == "population") %>%
group_by(country, year) %>%
mutate(population = count) %>%
ungroup() %>%
select(country_temp = country,
year_temp = year,
population) %>%
arrange(country_temp, year_temp)
)
# A tibble: 6 x 3
country_temp year_temp population
<chr> <int> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
(
tbl3 <- tbl1 %>%
bind_cols(tbl2) %>%
select(c(1:3,6)) %>%
mutate(rate = (cases / population) * 10000) %>%
arrange(country, year) %>%
select(country, year, rate) %>%
mutate(type = "rate",
count = rate) %>%
select(c(1,2,4,5))
)
# A tibble: 6 x 4
country year type count
<chr> <int> <chr> <dbl>
1 Afghanistan 1999 rate 0.373
2 Afghanistan 2000 rate 1.29
3 Brazil 1999 rate 2.19
4 Brazil 2000 rate 4.61
5 China 1999 rate 1.67
6 China 2000 rate 1.67
tidyr::table2 %>%
bind_rows(tbl3) %>%
mutate(count = round(count, 2)) %>%
arrange(country, year, type) %>%
gt::gt()
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745.00 |
Afghanistan | 1999 | population | 19987071.00 |
Afghanistan | 1999 | rate | 0.37 |
Afghanistan | 2000 | cases | 2666.00 |
Afghanistan | 2000 | population | 20595360.00 |
Afghanistan | 2000 | rate | 1.29 |
Brazil | 1999 | cases | 37737.00 |
Brazil | 1999 | population | 172006362.00 |
Brazil | 1999 | rate | 2.19 |
Brazil | 2000 | cases | 80488.00 |
Brazil | 2000 | population | 174504898.00 |
Brazil | 2000 | rate | 4.61 |
China | 1999 | cases | 212258.00 |
China | 1999 | population | 1272915272.00 |
China | 1999 | rate | 1.67 |
China | 2000 | cases | 213766.00 |
China | 2000 | population | 1280428583.00 |
China | 2000 | rate | 1.67 |
(
tbl1_cases <- tidyr::table4a %>%
select(country, `1999`) %>%
mutate(year = 1999,
cases = `1999`) %>%
select(country, year, cases)
)
# A tibble: 3 x 3
country year cases
<chr> <dbl> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
(
tbl2_cases <- tidyr::table4a %>%
select(country, "2000") %>%
mutate(year = 2000,
cases = `2000`) %>%
select(country, year, cases)
)
# A tibble: 3 x 3
country year cases
<chr> <dbl> <int>
1 Afghanistan 2000 2666
2 Brazil 2000 80488
3 China 2000 213766
(
tbl_cases <- tbl1_cases %>%
bind_rows(tbl2_cases) %>%
arrange(country, year)
)
# A tibble: 6 x 3
country year cases
<chr> <dbl> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
(
tbl1_pop <- tidyr::table4b %>%
select(country, `1999`) %>%
mutate(year = 1999,
population = `1999`) %>%
select(country, year, population)
)
# A tibble: 3 x 3
country year population
<chr> <dbl> <int>
1 Afghanistan 1999 19987071
2 Brazil 1999 172006362
3 China 1999 1272915272
(
tbl2_pop <- tidyr::table4b %>%
select(country, "2000") %>%
mutate(year = 2000,
population = `2000`) %>%
select(country, year, population)
)
# A tibble: 3 x 3
country year population
<chr> <dbl> <int>
1 Afghanistan 2000 20595360
2 Brazil 2000 174504898
3 China 2000 1280428583
(
tbl_pop <- tbl1_pop %>%
bind_rows(tbl2_pop) %>%
arrange(country, year)
)
# A tibble: 6 x 3
country year population
<chr> <dbl> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
(
tbl_rate <- tbl_cases %>%
bind_cols(tbl_pop) %>%
janitor::clean_names() %>%
select(country = country_1,
year = year_2,
cases, population) %>%
mutate(rate = cases / population * 10000)
)
# A tibble: 6 x 5
country year cases population rate
<chr> <dbl> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
(
tbl_1999 <- tbl_rate %>%
select(country, year, rate) %>%
filter(year == 1999) %>%
mutate(`1999` = rate) %>%
select(country, `1999`)
)
# A tibble: 3 x 2
country `1999`
<chr> <dbl>
1 Afghanistan 0.373
2 Brazil 2.19
3 China 1.67
(
tbl_2000 <- tbl_rate %>%
select(country, year, rate) %>%
filter(year == 2000) %>%
mutate(`2000` = rate) %>%
select(country_temp = country, `2000`)
)
# A tibble: 3 x 2
country_temp `2000`
<chr> <dbl>
1 Afghanistan 1.29
2 Brazil 4.61
3 China 1.67
(
tbl_4c <-
tbl_1999 %>%
bind_cols(tbl_2000) %>%
select(country, `1999`, `2000`)
)
# A tibble: 3 x 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 0.373 1.29
2 Brazil 2.19 4.61
3 China 1.67 1.67
Recreate the plot showing change in cases over time using table2
instead of table1
. What do you need to do first?
tidyr::table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_colour_tq()
tidyr::table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_colour_tq()
Sometimes you will have to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4a %>%
# gather(list out columns you want to gather like dplyr::select() style,
# key = what do you want to call the column
# these column names go into,
# value = the values of the columns will go here)
gather(`1999`, `2000`,
key = "year",
value = "cases" )
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
(tidy_4a <- table4a %>%
# cols = list the columns you want to pivot
# names_to = what will you call the new column these
# column names go into
# values_to = the values in the columns will go here
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"))
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
table4b %>%
gather(`1999`, `2000`,
key = "year",
value = "population")
# A tibble: 6 x 3
country year population
<chr> <chr> <int>
1 Afghanistan 1999 19987071
2 Brazil 1999 172006362
3 China 1999 1272915272
4 Afghanistan 2000 20595360
5 Brazil 2000 174504898
6 China 2000 1280428583
(tidy_4b <- table4b %>%
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "population"))
# A tibble: 6 x 3
country year population
<chr> <chr> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
left_join(tidy_4a, tidy_4b) %>%
arrange(country, year)
# A tibble: 6 x 4
country year cases population
<chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
pivot_wider()
is the opposite of pivot_longer()
. You use it when an observation is scattered across multiple rows.
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table2 %>%
# key = column with the variable name, here `type`
spread(key = type,
# value = column with the value that will be assigned
# to new columns
value = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2 %>%
pivot_wider(names_from = type,
values_from = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Why are pivot_longer()
and pivot_wider()
not perfectly symmetrical?
Carefully consider the following example:
(stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
))
# A tibble: 4 x 3
year half return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2016 1 0.92
4 2016 2 0.17
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
# A tibble: 4 x 3
half year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 1 2016 0.92
3 2 2015 0.59
4 2 2016 0.17
(Hint: look at the variable types and think about column names.)
pivot_longer()
has a names_ptypes
argument, e.g. names_ptypes = list(year = double())
. What does it do?
# vignette("pivot")
stocks %>%
pivot_wider(names_from = year, values_from = return)
# A tibble: 2 x 3
half `2015` `2016`
<dbl> <dbl> <dbl>
1 1 1.88 0.92
2 2 0.59 0.17
Let’s have a look at the first part - here we take the year and make it a variable. That means that 2015
and 2016
become variables (new columns) in our new tibble, and the return gets pulled into the appropriate column (2015
/2016
) against the appropriate half
. By nature of this move we changed year which was a double into two new column names which are 2015
and 2016
and hence “character”.
(stocks_ <- stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return"))
# A tibble: 4 x 3
half year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 1 2016 0.92
3 2 2015 0.59
4 2 2016 0.17
colnames(stocks)
[1] "year" "half" "return"
colnames(stocks_)
[1] "half" "year" "return"
So following on that we take these new columns and then collapse them into a column year
again. But now we have changed the type given we made them columns in the pivot_wider()
step. So they keep their “character” nature when they are made longer again. Final result is year
started off double (when we created it) but ends up character (after the pivot_wider and pivot_longer steps).
Th columns also get rearranged since the pivot_wider spreads the year
column into 2015
and 2016
which come after half
in that initial step. When we subsequently pivot_longer half
remains as the first column, followed by the names_to =
column (year in this case), and finally the values_to =
column (return in this case).
Q: pivot_longer()
has a names_ptypes
argument, e.g. names_ptypes = list(year = double())
. What does it do?
Okay so upon reading the help page and the info I expected that this function would convert my character
column year created after the pivot_wider() step into a double, but instead it throws an error. 😕
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`,
names_to = "year",
names_ptypes = list(year = double()),
values_to = "return"
)
Error: Can't convert <character> to <double>.
We use this to confirm that the columns we create are of the type / class we expect - so here it provides a check it seems 🤷.
To transform the column from character to double you would need to use the names_transform
.
(stocks_ptypes <- stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`,
names_to = "year",
names_transform = list(year = as.double),
values_to = "return",
# is the value column of the type expected
values_ptypes = list(return = double())
))
# A tibble: 4 x 3
half year return
<dbl> <dbl> <dbl>
1 1 2015 1.88
2 1 2016 0.92
3 2 2015 0.59
4 2 2016 0.17
Strangely though I would expect that if I transform a column from x to y (using names_transform
), and then use names_ptypes
to check if my name column is indeed now of type y that would be fine? It still throws an error, so my thinking is flawed here.
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`,
names_to = "year",
names_transform = list(year = as.double),
names_ptypes = list(year = double()),
values_to = "return",
# is the value column of the type expected
values_ptypes = list(return = double())
)
Error: Can't convert <character> to <double>.
Why does this code fail?
table4a %>%
pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
Error: Can't subset columns that don't exist.
[31mx[39m Locations 1999 and 2000 don't exist.
[34mi[39m There are only 3 columns.
# Error: Can't subset columns that don't exist.
# x Locations 1999 and 2000 don't exist.
# i There are only 3 columns.
# fixing it
table4a %>%
pivot_longer(c("1999", `2000`), names_to = "year", values_to = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
The 1999
and 2000
are non-syntactically named columns. These have to be surrounded by backticks (``) or quotations ""
. Here tidyr is trying to read columns numbered 1999, and 2000 which don’t exist.
What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
people <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
You get a warning and it has a list for each variable age
and height
since Philips Woods has two ages which are different.
people %>%
pivot_wider(names_from = names,
values_from = "values")
# A tibble: 2 x 3
name age height
<chr> <list> <list>
1 Phillip Woods <dbl [2]> <dbl [1]>
2 Jessica Cordero <dbl [1]> <dbl [1]>
people2 <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age2", 50, # second age gets diff col name
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
people2 %>%
pivot_wider(names_from = names,
values_from = "values")
# A tibble: 2 x 4
name age height age2
<chr> <dbl> <dbl> <dbl>
1 Phillip Woods 45 186 50
2 Jessica Cordero 37 156 NA
Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?
(preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
))
# A tibble: 2 x 3
pregnant male female
<chr> <dbl> <dbl>
1 yes NA 10
2 no 20 12
We need to make it longer. The variable’s are pregnant
(yes or no), and the number of male(s)/female(s) in each outcome of pregnant
.
preg %>%
pivot_longer(c('male', 'female'),
names_to = 'sex',
values_to = 'count')
# A tibble: 4 x 3
pregnant sex count
<chr> <chr> <dbl>
1 yes male NA
2 yes female 10
3 no male 20
4 no female 12
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears, and by default, separate()
will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter e.g. ,
, ;
, etc.).
In tidyr::table3
we have a column rate
that has a combination of the cases and population in one column separated by a /.
table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table3 %>%
# separate without specifying the separator
separate(rate, into = c("cases", "population"))
# A tibble: 6 x 4
country year cases population
<chr> <int> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
sep
is a regular expression, so more complicated sequences than “/” maybe supplied, e.g. sep = '[0-9]'
will separate on any digit encountered.
table3 %>%
# separate with specifying the separator
separate(rate, into = c("cases", "population"),
sep = '/')
# A tibble: 6 x 4
country year cases population
<chr> <int> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
as_tibble("This0is1a2test.with3a4deliberate5mistake.") %>%
separate(value, into = c(glue::glue("col_{c(1:7)}")),
sep = '[0-9]')
# A tibble: 1 x 7
col_1 col_2 col_3 col_4 col_5 col_6 col_7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 This is a test.with a deliberate mistake.
You can also pass a vector of integers to sep
. separate()
will interpret the integers as positions to split at.
table3 %>%
# separate with specifying the separator
separate(year, into = c("century", "year"),
# here we use a single value
# use c(2, 5) to split in multiple places
sep = 2,
convert = TRUE)
# A tibble: 6 x 4
country century year rate
<chr> <int> <int> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 0 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 0 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 0 213766/1280428583
separate()
retains the character type of the original field. Use convert = TRUE
to get separate()
to make a guess at a better type for the resulting columns.
table3 %>%
# separate with specifying the separator
separate(rate, into = c("cases", "population"),
sep = '/',
convert = TRUE)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
unite()
is the inverse of separate()
: it combines multiple columns into a single column.
The year column was deliberately made untidy by splitting it into century and year in tidyr::table5
. We can concat these using unite
.
table5 %>%
unite(new, century, year)
# A tibble: 6 x 3
country new rate
<chr> <chr> <chr>
1 Afghanistan 19_99 745/19987071
2 Afghanistan 20_00 2666/20595360
3 Brazil 19_99 37737/172006362
4 Brazil 20_00 80488/174504898
5 China 19_99 212258/1272915272
6 China 20_00 213766/1280428583
If you look at the help page of unite (?unite) you will see that the default separator is ___ (sep = "_"
). To override this behaviour we need to specify the sep
argument.
table5 %>%
unite(new, century, year, sep = '')
# A tibble: 6 x 3
country new rate
<chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
What do the extra
and fill
arguments do in separate()
? Experiment with the various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
extra
warns if during the splitting more columns are found than those specified in into. For example I may have into = c("year_of_birth", "month_of_birth")
but upon splitting on -
three columns are found. If I don’t specify extra
then separate function will warn me there are more columns found, but these will be missing from my result.extra
can be warn (default), drop (drop any extra values without a warning) or merge (only splits at most length(into) times).tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 d e f
3 h i j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),
extra = 'drop')
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 d e f
3 h i j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),
extra = 'merge')
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 d e f,g
3 h i j
The default for fill is “warn”.
fill
warns if during the splitting less columns are found than those specified in into.
For example I may have into = c("year_of_birth", "month_of_birth", "day_of_birth", "sex")
but upon splitting on -
only three columns are found. If I don’t specify fill
then the separate function will warn me there are less columns than expected and these will be filled with NA.
fill
can be:
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 d e <NA>
3 f g i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"),
fill = 'right')
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 d e <NA>
3 f g i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"),
fill = 'left')
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 <NA> d e
3 f g i
Both unite()
and separate()
have a remove
argument. What does it do? Why would you set it to FALSE
?
remove
is by default TRUE.tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
# A tibble: 3 x 3
one two three
<chr> <chr> <chr>
1 a b c
2 d e f
3 h i j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),
remove = FALSE)
Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
# A tibble: 3 x 4
x one two three
<chr> <chr> <chr> <chr>
1 a,b,c a b c
2 d,e,f,g d e f
3 h,i,j h i j
Notice that initially x
does not exist. In the second call we added the remove = FALSE and x
is kept.
Compare and contrast separate()
and extract()
. Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
separate has by position, by separator, and with groups because it takes a string and splits it into multiple parts; unite on the other hand combines separate columns and there is only one way to do it - specify the columns you want to join, in the order they should be joined, and with the separator that is appropriate.
separate does not throw away any part of the string unless it “falls over the edge” so to say. I.e. if we have extra string left over after the splitting, and it has run out of columns to place it in. You may kinda simulate this by saying “Hey, keep only the second column” as this help example shows. While sep is a regex it tells you how to split the string.
# If you want to split by any non-alphanumeric value (the default):
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
A B
1 <NA> <NA>
2 a b
3 a d
4 b c
# If you just want the second variable:
df %>% separate(x, c(NA, "B"))
B
1 <NA>
2 b
3 d
4 c
extract will give you back only the “groups” you are interested in, and “throws” away the rest, and maybe better to use when you have complex patterns you want to extract. 🤷
df <- tibble(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% tidyr::extract(x,
# make 2 columns
c("A", "B"),
# here's the groups I want - surrounded by ()
"([a-d]+)-([a-d]+)")
# A tibble: 5 x 2
A B
<chr> <chr>
1 <NA> <NA>
2 a b
3 a d
4 b c
5 <NA> <NA>
df %>%
tidyr::extract(col = x,
into = c("A"),
regex = "([a-d]+)")
# A tibble: 5 x 1
A
<chr>
1 <NA>
2 a
3 a
4 b
5 d
df <- tibble(x = c(NA, "a-b-c", "a-d-e", "b-c-d", "d-e-f"))
df %>% tidyr::extract(x,
# make 2 columns
c("A", "C"),
# here's the groups I want - surrounded by ()
# only first, and last
"([a-z]+)-[a-z]-([a-z]+)")
# A tibble: 5 x 2
A C
<chr> <chr>
1 <NA> <NA>
2 a c
3 a e
4 b d
5 d f
df %>% tidyr::separate(x,
# You can also do it using separate
# make 2 columns
c("A", "C"),
# here's the groups I want - surrounded by ()
# only first, and last
"-[a-z]-")
# A tibble: 5 x 2
A C
<chr> <chr>
1 <NA> <NA>
2 a c
3 a e
4 b d
5 d f
A value can be missing in one of two possible ways:
NA
.For example here we have explicit NA as well as 2016 is missing quarter 1.
(stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
))
# A tibble: 7 x 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 2 0.92
6 2016 3 0.17
7 2016 4 2.66
Representing the dataset can make the missing values more clear.
Show implicit missing values.
stocks %>%
pivot_wider(names_from = year, values_from = return)
# A tibble: 4 x 3
qtr `2015` `2016`
<dbl> <dbl> <dbl>
1 1 1.88 NA
2 2 0.59 0.92
3 3 0.35 0.17
4 4 NA 2.66
Turn explicit missing values, into implicit missing values.
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
# A tibble: 6 x 3
qtr year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 2 2015 0.59
3 2 2016 0.92
4 3 2015 0.35
5 3 2016 0.17
6 4 2016 2.66
Use complete()
: make missing values explicit, by looking at a set of columns and finding all combinations. The gaps are filled with NA explicitly.
stocks %>%
complete(year, qtr)
# A tibble: 8 x 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 1 NA
6 2016 2 0.92
7 2016 3 0.17
8 2016 4 2.66
Sometimes when working with data the missing are an indication that the previous rows value must be filled down.
(treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
))
# A tibble: 4 x 3
person treatment response
<chr> <dbl> <dbl>
1 Derrick Whitmore 1 7
2 <NA> 2 10
3 <NA> 3 9
4 Katherine Burke 1 4
Using fill()
allows you to fill a columns values with the most recent non-missing value.
treatment %>%
fill(person)
# A tibble: 4 x 3
person treatment response
<chr> <dbl> <dbl>
1 Derrick Whitmore 1 7
2 Derrick Whitmore 2 10
3 Derrick Whitmore 3 9
4 Katherine Burke 1 4
Compare and contrast the fill
arguments to pivot_wider()
and complete()
.
values_fill
, and the spec is either a scalar
or named list
.named list
.(df <- tibble(
group = c(1:2, NA_real_, 1),
item_id = c(1:2, 1, 2),
item_name = c("a", "NA", "b", "b"),
value1 = 1:4,
value2 = 4:7
))
# A tibble: 4 x 5
group item_id item_name value1 value2
<dbl> <dbl> <chr> <int> <int>
1 1 1 a 1 4
2 2 2 NA 2 5
3 NA 1 b 3 6
4 1 2 b 4 7
df %>%
pivot_wider(names_from = item_name,
values_from = value1,
# the values are filled with 0 if missing
values_fill = 0)
# A tibble: 4 x 6
group item_id value2 a `NA` b
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 4 1 0 0
2 2 2 5 0 2 0
3 NA 1 6 0 0 3
4 1 2 7 0 0 4
df %>%
complete(group, item_id,
# explicitly list cols you want filled
# with alternate to NA
fill = list(group = 0,
item_name = "0"))
# A tibble: 6 x 5
group item_id item_name value1 value2
<dbl> <dbl> <chr> <int> <int>
1 1 1 a 1 4
2 1 2 b 4 7
3 2 1 0 NA NA
4 2 2 NA 2 5
5 0 1 b 3 6
6 0 2 0 NA NA
What does the direction argument to fill()
do?
The argument is .direction
and tells you in which direction the fill() must occur, and can be one of 4 different values.
.direction = c(“down”, “up”, “downup”, “updown”)
# Value (n_squirrels) is missing above and below within a group
(squirrels <- tibble::tribble(
~group, ~name, ~role, ~n_squirrels,
1, "Sam", "Observer", NA,
1, "Mara", "Scorekeeper", 8,
1, "Jesse", "Observer", NA,
1, "Tom", "Observer", NA,
2, "Mike", "Observer", NA,
2, "Rachael", "Observer", NA,
2, "Sydekea", "Scorekeeper", 14,
2, "Gabriela", "Observer", NA,
3, "Derrick", "Observer", NA,
3, "Kara", "Scorekeeper", 9,
3, "Emily", "Observer", NA,
3, "Danielle", "Observer", NA
))
# A tibble: 12 x 4
group name role n_squirrels
<dbl> <chr> <chr> <dbl>
1 1 Sam Observer NA
2 1 Mara Scorekeeper 8
3 1 Jesse Observer NA
4 1 Tom Observer NA
5 2 Mike Observer NA
6 2 Rachael Observer NA
7 2 Sydekea Scorekeeper 14
8 2 Gabriela Observer NA
9 3 Derrick Observer NA
10 3 Kara Scorekeeper 9
11 3 Emily Observer NA
12 3 Danielle Observer NA
# The values are inconsistently missing by position within the group
# Use .direction = "downup" to fill missing values in both directions
squirrels %>%
dplyr::group_by(group) %>%
fill(n_squirrels, .direction = "downup") %>%
dplyr::ungroup()
# A tibble: 12 x 4
group name role n_squirrels
<dbl> <chr> <chr> <dbl>
1 1 Sam Observer 8
2 1 Mara Scorekeeper 8
3 1 Jesse Observer 8
4 1 Tom Observer 8
5 2 Mike Observer 14
6 2 Rachael Observer 14
7 2 Sydekea Scorekeeper 14
8 2 Gabriela Observer 14
9 3 Derrick Observer 9
10 3 Kara Scorekeeper 9
11 3 Emily Observer 9
12 3 Danielle Observer 9
# contrast with "updown"
squirrels %>%
dplyr::group_by(group) %>%
# should be the same in this e.g.
fill(n_squirrels, .direction = "updown") %>%
dplyr::ungroup()
# A tibble: 12 x 4
group name role n_squirrels
<dbl> <chr> <chr> <dbl>
1 1 Sam Observer 8
2 1 Mara Scorekeeper 8
3 1 Jesse Observer 8
4 1 Tom Observer 8
5 2 Mike Observer 14
6 2 Rachael Observer 14
7 2 Sydekea Scorekeeper 14
8 2 Gabriela Observer 14
9 3 Derrick Observer 9
10 3 Kara Scorekeeper 9
11 3 Emily Observer 9
12 3 Danielle Observer 9
# contrast with "down"
squirrels %>%
dplyr::group_by(group) %>%
fill(n_squirrels, .direction = "down") %>%
dplyr::ungroup()
# A tibble: 12 x 4
group name role n_squirrels
<dbl> <chr> <chr> <dbl>
1 1 Sam Observer NA
2 1 Mara Scorekeeper 8
3 1 Jesse Observer 8
4 1 Tom Observer 8
5 2 Mike Observer NA
6 2 Rachael Observer NA
7 2 Sydekea Scorekeeper 14
8 2 Gabriela Observer 14
9 3 Derrick Observer NA
10 3 Kara Scorekeeper 9
11 3 Emily Observer 9
12 3 Danielle Observer 9
# contrast with "up"
squirrels %>%
dplyr::group_by(group) %>%
fill(n_squirrels, .direction = "up") %>%
dplyr::ungroup()
# A tibble: 12 x 4
group name role n_squirrels
<dbl> <chr> <chr> <dbl>
1 1 Sam Observer 8
2 1 Mara Scorekeeper 8
3 1 Jesse Observer NA
4 1 Tom Observer NA
5 2 Mike Observer 14
6 2 Rachael Observer 14
7 2 Sydekea Scorekeeper 14
8 2 Gabriela Observer NA
9 3 Derrick Observer 9
10 3 Kara Scorekeeper 9
11 3 Emily Observer NA
12 3 Danielle Observer NA
who
# A tibble: 7,240 x 60
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
<chr> <chr> <chr> <int> <int> <int> <int> <int>
1 Afghan~ AF AFG 1980 NA NA NA NA
2 Afghan~ AF AFG 1981 NA NA NA NA
3 Afghan~ AF AFG 1982 NA NA NA NA
4 Afghan~ AF AFG 1983 NA NA NA NA
5 Afghan~ AF AFG 1984 NA NA NA NA
6 Afghan~ AF AFG 1985 NA NA NA NA
7 Afghan~ AF AFG 1986 NA NA NA NA
8 Afghan~ AF AFG 1987 NA NA NA NA
9 Afghan~ AF AFG 1988 NA NA NA NA
10 Afghan~ AF AFG 1989 NA NA NA NA
# ... with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
# new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
# new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
# new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
# new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
# new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
# new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
# new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
# new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
# new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
# new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
# new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
# new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
# new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
# newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
# newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
# newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
# newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
(
who1 <- who %>%
# change the wide data to long data
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
)
# A tibble: 76,046 x 6
country iso2 iso3 year key cases
<chr> <chr> <chr> <int> <chr> <int>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
7 Afghanistan AF AFG 1997 new_sp_m65 0
8 Afghanistan AF AFG 1997 new_sp_f014 5
9 Afghanistan AF AFG 1997 new_sp_f1524 38
10 Afghanistan AF AFG 1997 new_sp_f2534 36
# ... with 76,036 more rows
who1 %>%
# how many keys do we have?
count(key, sort = TRUE) %>%
DT::datatable()
To parse the columns, refer to Chaper 9 (Chapter 12 online).
(who2 <- who1 %>%
# Look in key. Do you see any strings that have "newrel"?
# If you do replace these with new_rel for consistent
# key values for easier parsing
mutate(key = stringr::str_replace(key,
"newrel",
"new_rel")))
# A tibble: 76,046 x 6
country iso2 iso3 year key cases
<chr> <chr> <chr> <int> <chr> <int>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
7 Afghanistan AF AFG 1997 new_sp_m65 0
8 Afghanistan AF AFG 1997 new_sp_f014 5
9 Afghanistan AF AFG 1997 new_sp_f1524 38
10 Afghanistan AF AFG 1997 new_sp_f2534 36
# ... with 76,036 more rows
(
who3 <- who2 %>%
separate(key,
c("new", "type", "sexage"),
sep = "_")
)
# A tibble: 76,046 x 8
country iso2 iso3 year new type sexage cases
<chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
1 Afghanistan AF AFG 1997 new sp m014 0
2 Afghanistan AF AFG 1997 new sp m1524 10
3 Afghanistan AF AFG 1997 new sp m2534 6
4 Afghanistan AF AFG 1997 new sp m3544 3
5 Afghanistan AF AFG 1997 new sp m4554 5
6 Afghanistan AF AFG 1997 new sp m5564 2
7 Afghanistan AF AFG 1997 new sp m65 0
8 Afghanistan AF AFG 1997 new sp f014 5
9 Afghanistan AF AFG 1997 new sp f1524 38
10 Afghanistan AF AFG 1997 new sp f2534 36
# ... with 76,036 more rows
Ok, from the chapter text it seems we only have new cases, so let us verify that.
who3 %>%
count(new)
# A tibble: 1 x 2
new n
<chr> <int>
1 new 76046
This is indeed the case! The count == the dataset count so we can drop this column
(
who4 <- who3 %>%
select(-new, -iso2, iso3)
)
# A tibble: 76,046 x 6
country iso3 year type sexage cases
<chr> <chr> <int> <chr> <chr> <int>
1 Afghanistan AFG 1997 sp m014 0
2 Afghanistan AFG 1997 sp m1524 10
3 Afghanistan AFG 1997 sp m2534 6
4 Afghanistan AFG 1997 sp m3544 3
5 Afghanistan AFG 1997 sp m4554 5
6 Afghanistan AFG 1997 sp m5564 2
7 Afghanistan AFG 1997 sp m65 0
8 Afghanistan AFG 1997 sp f014 5
9 Afghanistan AFG 1997 sp f1524 38
10 Afghanistan AFG 1997 sp f2534 36
# ... with 76,036 more rows
Let’s separate the sexage
column.
(
who5 <- who4 %>%
separate(sexage, c("sex", "age"),
sep = 1)
)
# A tibble: 76,046 x 7
country iso3 year type sex age cases
<chr> <chr> <int> <chr> <chr> <chr> <int>
1 Afghanistan AFG 1997 sp m 014 0
2 Afghanistan AFG 1997 sp m 1524 10
3 Afghanistan AFG 1997 sp m 2534 6
4 Afghanistan AFG 1997 sp m 3544 3
5 Afghanistan AFG 1997 sp m 4554 5
6 Afghanistan AFG 1997 sp m 5564 2
7 Afghanistan AFG 1997 sp m 65 0
8 Afghanistan AFG 1997 sp f 014 5
9 Afghanistan AFG 1997 sp f 1524 38
10 Afghanistan AFG 1997 sp f 2534 36
# ... with 76,036 more rows
Pipeline version
who %>%
pivot_longer(cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "type", "sexage"),
sep = "_") %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"),
# first pos = sex, rest = age group
sep = 1)
# A tibble: 76,046 x 6
country year type sex age cases
<chr> <int> <chr> <chr> <chr> <int>
1 Afghanistan 1997 sp m 014 0
2 Afghanistan 1997 sp m 1524 10
3 Afghanistan 1997 sp m 2534 6
4 Afghanistan 1997 sp m 3544 3
5 Afghanistan 1997 sp m 4554 5
6 Afghanistan 1997 sp m 5564 2
7 Afghanistan 1997 sp m 65 0
8 Afghanistan 1997 sp f 014 5
9 Afghanistan 1997 sp f 1524 38
10 Afghanistan 1997 sp f 2534 36
# ... with 76,036 more rows
In this case study I set values_drop_na = TRUE
just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA
and zero?
If we consider the missing values in this data set, it does seem reasonable.
who %>%
select(new_sp_m014:newrel_f65) %>%
skimr::skim()
Name | Piped data |
Number of rows | 7240 |
Number of columns | 56 |
_______________________ | |
Column type frequency: | |
numeric | 56 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
new_sp_m014 | 4067 | 0.44 | 83.71 | 316.14 | 0 | 0.00 | 5.0 | 37.00 | 5001 | ▇▁▁▁▁ |
new_sp_m1524 | 4031 | 0.44 | 1015.66 | 4885.38 | 0 | 9.00 | 90.0 | 502.00 | 78278 | ▇▁▁▁▁ |
new_sp_m2534 | 4034 | 0.44 | 1403.80 | 5718.39 | 0 | 14.00 | 150.0 | 715.50 | 84003 | ▇▁▁▁▁ |
new_sp_m3544 | 4021 | 0.44 | 1315.88 | 6003.26 | 0 | 13.00 | 130.0 | 583.50 | 90830 | ▇▁▁▁▁ |
new_sp_m4554 | 4017 | 0.45 | 1103.86 | 5441.06 | 0 | 12.00 | 102.0 | 440.00 | 82921 | ▇▁▁▁▁ |
new_sp_m5564 | 4022 | 0.44 | 800.70 | 4418.31 | 0 | 8.00 | 63.0 | 279.00 | 63814 | ▇▁▁▁▁ |
new_sp_m65 | 4031 | 0.44 | 682.82 | 4089.14 | 0 | 8.00 | 53.0 | 232.00 | 70376 | ▇▁▁▁▁ |
new_sp_f014 | 4066 | 0.44 | 114.33 | 504.63 | 0 | 1.00 | 7.0 | 50.75 | 8576 | ▇▁▁▁▁ |
new_sp_f1524 | 4046 | 0.44 | 826.11 | 3552.02 | 0 | 7.00 | 66.0 | 421.00 | 53975 | ▇▁▁▁▁ |
new_sp_f2534 | 4040 | 0.44 | 917.30 | 3580.15 | 0 | 9.00 | 84.0 | 476.25 | 49887 | ▇▁▁▁▁ |
new_sp_f3544 | 4041 | 0.44 | 640.43 | 2542.51 | 0 | 6.00 | 57.0 | 308.00 | 34698 | ▇▁▁▁▁ |
new_sp_f4554 | 4036 | 0.44 | 445.78 | 1799.23 | 0 | 4.00 | 38.0 | 211.00 | 23977 | ▇▁▁▁▁ |
new_sp_f5564 | 4045 | 0.44 | 313.87 | 1381.25 | 0 | 3.00 | 25.0 | 146.50 | 18203 | ▇▁▁▁▁ |
new_sp_f65 | 4043 | 0.44 | 283.93 | 1267.94 | 0 | 4.00 | 30.0 | 129.00 | 21339 | ▇▁▁▁▁ |
new_sn_m014 | 6195 | 0.14 | 308.75 | 1727.25 | 0 | 1.00 | 9.0 | 61.00 | 22355 | ▇▁▁▁▁ |
new_sn_m1524 | 6210 | 0.14 | 513.02 | 3643.27 | 0 | 2.00 | 15.5 | 102.00 | 60246 | ▇▁▁▁▁ |
new_sn_m2534 | 6218 | 0.14 | 653.69 | 3430.03 | 0 | 2.00 | 23.0 | 135.50 | 50282 | ▇▁▁▁▁ |
new_sn_m3544 | 6215 | 0.14 | 837.87 | 8524.53 | 0 | 2.00 | 19.0 | 132.00 | 250051 | ▇▁▁▁▁ |
new_sn_m4554 | 6213 | 0.14 | 520.79 | 3301.70 | 0 | 2.00 | 19.0 | 127.50 | 57181 | ▇▁▁▁▁ |
new_sn_m5564 | 6219 | 0.14 | 448.62 | 3488.68 | 0 | 2.00 | 16.0 | 101.00 | 64972 | ▇▁▁▁▁ |
new_sn_m65 | 6220 | 0.14 | 460.36 | 3991.90 | 0 | 2.00 | 20.5 | 111.75 | 74282 | ▇▁▁▁▁ |
new_sn_f014 | 6200 | 0.14 | 291.95 | 1647.30 | 0 | 1.00 | 8.0 | 58.00 | 21406 | ▇▁▁▁▁ |
new_sn_f1524 | 6218 | 0.14 | 407.90 | 2379.13 | 0 | 1.00 | 12.0 | 89.00 | 35518 | ▇▁▁▁▁ |
new_sn_f2534 | 6224 | 0.14 | 466.26 | 2272.86 | 0 | 2.00 | 18.0 | 103.25 | 28753 | ▇▁▁▁▁ |
new_sn_f3544 | 6220 | 0.14 | 506.59 | 5013.53 | 0 | 1.00 | 11.0 | 82.25 | 148811 | ▇▁▁▁▁ |
new_sn_f4554 | 6222 | 0.14 | 271.16 | 1511.72 | 0 | 1.00 | 10.0 | 76.75 | 23869 | ▇▁▁▁▁ |
new_sn_f5564 | 6223 | 0.14 | 213.39 | 1468.62 | 0 | 1.00 | 8.0 | 56.00 | 26085 | ▇▁▁▁▁ |
new_sn_f65 | 6221 | 0.14 | 230.75 | 1597.70 | 0 | 1.00 | 13.0 | 74.00 | 29630 | ▇▁▁▁▁ |
new_ep_m014 | 6202 | 0.14 | 128.61 | 460.14 | 0 | 0.00 | 6.0 | 55.00 | 7869 | ▇▁▁▁▁ |
new_ep_m1524 | 6214 | 0.14 | 158.30 | 537.74 | 0 | 1.00 | 11.0 | 88.00 | 8558 | ▇▁▁▁▁ |
new_ep_m2534 | 6220 | 0.14 | 201.23 | 764.05 | 0 | 1.00 | 13.0 | 124.00 | 11843 | ▇▁▁▁▁ |
new_ep_m3544 | 6216 | 0.14 | 272.72 | 3381.41 | 0 | 1.00 | 10.5 | 91.25 | 105825 | ▇▁▁▁▁ |
new_ep_m4554 | 6220 | 0.14 | 108.11 | 380.61 | 0 | 1.00 | 8.5 | 63.25 | 5875 | ▇▁▁▁▁ |
new_ep_m5564 | 6225 | 0.14 | 72.17 | 234.55 | 0 | 1.00 | 7.0 | 46.00 | 3957 | ▇▁▁▁▁ |
new_ep_m65 | 6222 | 0.14 | 78.94 | 227.34 | 0 | 1.00 | 10.0 | 55.00 | 3061 | ▇▁▁▁▁ |
new_ep_f014 | 6208 | 0.14 | 112.89 | 446.55 | 0 | 0.00 | 5.0 | 50.00 | 6960 | ▇▁▁▁▁ |
new_ep_f1524 | 6219 | 0.14 | 149.17 | 543.89 | 0 | 1.00 | 9.0 | 78.00 | 7866 | ▇▁▁▁▁ |
new_ep_f2534 | 6219 | 0.14 | 189.52 | 761.79 | 0 | 1.00 | 12.0 | 95.00 | 10759 | ▇▁▁▁▁ |
new_ep_f3544 | 6219 | 0.14 | 241.70 | 3218.50 | 0 | 1.00 | 9.0 | 77.00 | 101015 | ▇▁▁▁▁ |
new_ep_f4554 | 6223 | 0.14 | 93.77 | 339.33 | 0 | 1.00 | 8.0 | 56.00 | 6759 | ▇▁▁▁▁ |
new_ep_f5564 | 6223 | 0.14 | 63.04 | 212.95 | 0 | 1.00 | 6.0 | 42.00 | 4684 | ▇▁▁▁▁ |
new_ep_f65 | 6226 | 0.14 | 72.31 | 202.72 | 0 | 0.00 | 10.0 | 51.00 | 2548 | ▇▁▁▁▁ |
newrel_m014 | 7050 | 0.03 | 538.18 | 2082.18 | 0 | 5.00 | 32.5 | 210.00 | 18617 | ▇▁▁▁▁ |
newrel_m1524 | 7058 | 0.03 | 1489.51 | 6848.18 | 0 | 17.50 | 171.0 | 684.25 | 84785 | ▇▁▁▁▁ |
newrel_m2534 | 7057 | 0.03 | 2139.72 | 7539.87 | 0 | 25.00 | 217.0 | 1091.00 | 76917 | ▇▁▁▁▁ |
newrel_m3544 | 7056 | 0.03 | 2036.40 | 7847.94 | 0 | 24.75 | 208.0 | 851.25 | 84565 | ▇▁▁▁▁ |
newrel_m4554 | 7056 | 0.03 | 1835.07 | 8324.28 | 0 | 19.00 | 175.0 | 688.50 | 100297 | ▇▁▁▁▁ |
newrel_m5564 | 7055 | 0.03 | 1525.30 | 8760.27 | 0 | 13.00 | 136.0 | 536.00 | 112558 | ▇▁▁▁▁ |
newrel_m65 | 7058 | 0.03 | 1426.00 | 9431.99 | 0 | 17.00 | 117.0 | 453.50 | 124476 | ▇▁▁▁▁ |
newrel_f014 | 7050 | 0.03 | 532.84 | 2117.78 | 0 | 5.00 | 32.5 | 226.00 | 18054 | ▇▁▁▁▁ |
newrel_f1524 | 7056 | 0.03 | 1161.85 | 4606.76 | 0 | 10.75 | 123.0 | 587.75 | 49491 | ▇▁▁▁▁ |
newrel_f2534 | 7058 | 0.03 | 1472.80 | 5259.59 | 0 | 18.00 | 161.0 | 762.50 | 44985 | ▇▁▁▁▁ |
newrel_f3544 | 7057 | 0.03 | 1125.01 | 4210.58 | 0 | 12.50 | 125.0 | 544.50 | 38804 | ▇▁▁▁▁ |
newrel_f4554 | 7057 | 0.03 | 877.27 | 3556.18 | 0 | 10.00 | 92.0 | 400.50 | 37138 | ▇▁▁▁▁ |
newrel_f5564 | 7057 | 0.03 | 686.41 | 3379.33 | 0 | 8.00 | 69.0 | 269.00 | 40892 | ▇▁▁▁▁ |
newrel_f65 | 7055 | 0.03 | 683.76 | 3618.47 | 0 | 9.00 | 69.0 | 339.00 | 47438 | ▇▁▁▁▁ |
who %>%
select(new_sp_m014) %>%
filter(new_sp_m014 == 0 |
is.na(new_sp_m014)) %>%
skimr::skim()
Name | Piped data |
Number of rows | 4929 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
new_sp_m014 | 4067 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ▁▁▇▁▁ |
who %>%
select(new_sp_m014) %>%
filter(new_sp_m014 == 0) %>%
skimr::skim()
Name | Piped data |
Number of rows | 862 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
new_sp_m014 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ▁▁▇▁▁ |
who %>%
select(new_sp_m014) %>%
filter(is.na(new_sp_m014)) %>%
skimr::skim()
Name | Piped data |
Number of rows | 4067 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
new_sp_m014 | 4067 | 0 | NaN | NA | NA | NA | NA | NA | NA |
There may be implicit values if a year is missing for a country.
0 means I received a measure for that variable and it is no cases. NA could mean I received no information for that variable so cases may or may not exist.
What happens if you neglect the mutate()
step? (mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
)
For some of the steps we will have misaligned data when we separate these later in the pipeline.
Let’s take for example newrel_m3544
I claimed that iso2
and iso3
were redundant with country
. Confirm this claim.
who %>%
count(country, sort = TRUE)
# A tibble: 219 x 2
country n
<chr> <int>
1 Afghanistan 34
2 Albania 34
3 Algeria 34
4 American Samoa 34
5 Andorra 34
6 Angola 34
7 Anguilla 34
8 Antigua and Barbuda 34
9 Argentina 34
10 Armenia 34
# ... with 209 more rows
who %>%
count(iso2, sort = TRUE)
# A tibble: 219 x 2
iso2 n
<chr> <int>
1 AD 34
2 AE 34
3 AF 34
4 AG 34
5 AI 34
6 AL 34
7 AM 34
8 AO 34
9 AR 34
10 AS 34
# ... with 209 more rows
who %>%
count(iso3, sort = TRUE)
# A tibble: 219 x 2
iso3 n
<chr> <int>
1 ABW 34
2 AFG 34
3 AGO 34
4 AIA 34
5 ALB 34
6 AND 34
7 ARE 34
8 ARG 34
9 ARM 34
10 ASM 34
# ... with 209 more rows
who %>%
select(country, iso2, iso3) %>%
skimr::skim()
Name | Piped data |
Number of rows | 7240 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 4 | 52 | 0 | 219 | 0 |
iso2 | 0 | 1 | 2 | 2 | 0 | 219 | 0 |
iso3 | 0 | 1 | 3 | 3 | 0 | 219 | 0 |
In the skim for character variables the min tells you the min chars in a observation, and max is the max chars in a variable.
For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.
who5 %>%
group_by(country, year, sex) %>%
summarise(total_cases = sum(cases)) %>%
ungroup() %>%
arrange(country, year, sex) %>%
filter(country %in% c("South Africa", "Botswana",
"Zimbabwe", "Namibia", "Swaziland",
"Lesotho", "Mozambique")) %>%
ggplot(aes(x = year, y = total_cases, colour = sex)) +
geom_point(alpha = 0.5) +
facet_wrap(~ country, scales = "free_y") +
labs(title = "Total Cases Across South Africa
and neighbouring countries over time",
subtitle = "The number of cases is growing quite a bit over time",
x = "",
y = "Cases Per Year") +
scale_colour_tq()
who5 %>%
group_by(country, year, sex) %>%
summarise(total_cases = sum(cases)) %>%
ungroup() %>%
arrange(country, year, sex) %>%
group_by(year, sex) %>%
summarise(cases_per_year = sum(total_cases)) %>%
ungroup() %>%
ggplot(aes(x = year, y = cases_per_year, fill = sex)) +
geom_col(alpha = 0.5) +
scale_fill_tq() +
scale_y_log10() +
labs(title = "Total Cases Across all countries over time",
subtitle = "The number of cases is growing quite a bit over time",
x = "",
y = "Cases Per Year")
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_South Africa.1252 LC_CTYPE=English_South Africa.1252
[3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C
[5] LC_TIME=English_South Africa.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tidyquant_1.0.0 quantmod_0.4.17
[3] TTR_0.23-6 PerformanceAnalytics_2.0.4
[5] xts_0.12-0 zoo_1.8-7
[7] magrittr_1.5 lubridate_1.7.9
[9] emo_0.0.0.9000 flair_0.0.2
[11] forcats_0.5.0 stringr_1.4.0
[13] dplyr_1.0.2 purrr_0.3.4
[15] readr_1.4.0 tidyr_1.1.2
[17] tibble_3.0.3 ggplot2_3.3.2
[19] tidyverse_1.3.0 workflowr_1.6.2
loaded via a namespace (and not attached):
[1] fs_1.5.0 httr_1.4.2 rprojroot_1.3-2 repr_1.1.0
[5] tools_3.6.3 backports_1.1.6 utf8_1.1.4 R6_2.4.1
[9] DT_0.16 DBI_1.1.0 colorspace_1.4-1 withr_2.2.0
[13] tidyselect_1.1.0 curl_4.3 compiler_3.6.3 git2r_0.26.1
[17] cli_2.1.0 rvest_0.3.6 gt_0.2.2 xml2_1.3.2
[21] labeling_0.3 sass_0.2.0 scales_1.1.0 checkmate_2.0.0
[25] quadprog_1.5-8 digest_0.6.27 rmarkdown_2.4 base64enc_0.1-3
[29] pkgconfig_2.0.3 htmltools_0.5.0 dbplyr_2.0.0 highr_0.8
[33] htmlwidgets_1.5.1 rlang_0.4.8 readxl_1.3.1 rstudioapi_0.11
[37] farver_2.0.3 generics_0.0.2 jsonlite_1.7.1 crosstalk_1.1.0.1
[41] Rcpp_1.0.4.6 Quandl_2.10.0 munsell_0.5.0 fansi_0.4.1
[45] lifecycle_0.2.0 stringi_1.5.3 whisker_0.4 yaml_2.2.1
[49] snakecase_0.11.0 grid_3.6.3 promises_1.1.0 crayon_1.3.4
[53] lattice_0.20-38 haven_2.3.1 hms_0.5.3 knitr_1.28
[57] ps_1.3.2 pillar_1.4.6 reprex_0.3.0 glue_1.4.2
[61] evaluate_0.14 modelr_0.1.8 vctrs_0.3.2 httpuv_1.5.2
[65] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.13
[69] skimr_2.1.1 janitor_2.0.1 broom_0.7.2 later_1.0.0
[73] ellipsis_0.3.1