Last updated: 2020-11-21

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6e7b3db. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt
    Untracked:  data/at_health_facilities.csv
    Untracked:  data/infant_hiv.csv
    Untracked:  data/measurements.csv
    Untracked:  data/person.csv
    Untracked:  data/ranking.csv
    Untracked:  data/visited.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch15_functions.Rmd) and HTML (docs/ch15_functions.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 7ed0458 sciencificity 2020-11-10 Build site.
html 86457fa sciencificity 2020-11-10 Build site.
html 4879249 sciencificity 2020-11-09 Build site.
html e423967 sciencificity 2020-11-08 Build site.
html 0d223fb sciencificity 2020-11-08 Build site.
Rmd b67f0a7 sciencificity 2020-11-08 added ch13
html ecd1d8e sciencificity 2020-11-07 Build site.
Rmd 9440e66 sciencificity 2020-11-07 finished ch15

Functions

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Note the mistake in the copy and paste - we’re using df$a for the calculation for df$b!

Let’s isolate the main functionality that we are repeating.

(df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
#>  [1] 0.5358203 0.0000000 0.5941388 0.5494290 1.0000000 0.3947922 0.5101009
#>  [8] 0.6675033 0.6034676 0.3683384

That’s all good and well, but to write a function you should generalise the functionality as a test first.

x <- df$a

(x - min(x, na.rm = TRUE)) /
  (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
#>  [1] 0.5358203 0.0000000 0.5941388 0.5494290 1.0000000 0.3947922 0.5101009
#>  [8] 0.6675033 0.6034676 0.3683384

Even better, we’re using the range when we use max and min. So let’s re-write using that.

rng <- range(x, na.rm=TRUE)
(x - rng[1]) /
  (rng[2] - rng[1])
#>  [1] 0.5358203 0.0000000 0.5941388 0.5494290 1.0000000 0.3947922 0.5101009
#>  [8] 0.6675033 0.6034676 0.3683384

Let’s pull it all together in a function.

rescale01 <- function(x) {
  # rescales a vector to lie between 0 and 1
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) /(rng[2] - rng[1])
}

rescale01(c(0,5,10))
#> [1] 0.0 0.5 1.0

Key steps:

  1. Pick a name for the function.

  2. List the inputs, or arguments, to the function inside function. Example: function(x, y, z) .

  3. Place your code in body of the function, a { block that immediately follows function(...).

  4. Make it work with a small input.

  5. Check the function with a range of inputs.

rescale01(c(-10,0,10))
#> [1] 0.0 0.5 1.0

rescale01(c(1,2,3,NA,5))
#> [1] 0.00 0.25 0.50   NA 1.00

Simplifying the original calcs:

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

Now changes can be made in a single place.

x <- c(1:10, Inf)
rescale01(x)
#>  [1]   0   0   0   0   0   0   0   0   0   0 NaN
rescale01 <- function(x){
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) /(rng[2] - rng[1])
}

rescale01(x)
#>  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
#>  [8] 0.7777778 0.8888889 1.0000000       Inf

Intro - Exercises

  1. Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?

    rescale01_miss <- function(x){
      rng <- range(x, na.rm = FALSE)
      (x - rng[1]) / (rng[2] - rng[1])
    }
    rescale01_miss(c(1:5, NA, 10))
    #> [1] NA NA NA NA NA NA NA
    
    rescale01_miss <- function(x){
      rng <- range(x, na.rm = FALSE, finite = TRUE)
      (x - rng[1]) / (rng[2] - rng[1])
    }
    rescale01_miss(c(1:5, NA, 10))
    #> [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444        NA 1.0000000

    If any value is NA the range function returns NA as min and max of range. If finite = TRUE is set only the NA value returns NA.

    According to the help page: “If finite is TRUE, the minimum and maximum of all finite values is computed, i.e., finite = TRUE includes na.rm = TRUE.”

  2. In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.

    rescale01_inf <- function(x){
      rng <- range(x, na.rm = TRUE, finite = TRUE)
      x[x == -Inf] <- 0
      x[x == Inf] <- 1
      (x - rng[1]) / (rng[2] - rng[1])
    }
    
    rescale01_inf(c(1, Inf, 0, -Inf))
    #> [1] 1 1 0 0
  3. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?

    mean(is.na(x))
    
    x / sum(x, na.rm = TRUE)
    
    sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)

    mean(is.na(x)) finds what percentage of the data is NA.

    x <- c(NA, 1:10, NA, 5)
    mean(is.na(x))
    #> [1] 0.1538462
    
    mean_na <- function(x){
      sum(x)/length(x)
    }
    mean_na(is.na(x))
    #> [1] 0.1538462

    x / sum(x, na.rm = TRUE) converts each x to a percentage it contributes towards 100%, not considering the NAs.

    contrib_to_one <- function(x) {
      x / sum(x, na.rm = TRUE)
    }
    
    contrib_to_one(x)
    #>  [1]         NA 0.01666667 0.03333333 0.05000000 0.06666667 0.08333333
    #>  [7] 0.10000000 0.11666667 0.13333333 0.15000000 0.16666667         NA
    #> [13] 0.08333333

    sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) finds how the deviation as a proportion of the mean. sd means standard deviation and is a measure of the variability in your data. The mean is the average of your data.

    deviation <- function(x, na.rm = TRUE) {
      sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
    }
    
    deviation(x)
    #> [1] 0.5273097
  4. Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as \[ \mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,} \] where \(\bar{x} = (\sum_i^n x_i) / n\) is the sample mean. Skewness is defined as \[ \mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.} \]

    
    variance <- function(x){
      # find non na length of vector x
      n <- sum(!is.na(x))
      avg <- mean(x, na.rm = TRUE)
      (1/(n-1))*(sum((x - avg)^2, na.rm=TRUE))
    }
    
    x <- c(1:10)
    variance(x)
    #> [1] 9.166667
    var(x, na.rm = TRUE)
    #> [1] 9.166667
    
    y <- c(1:5, 1:5)
    variance(y)
    #> [1] 2.222222
    var(y)
    #> [1] 2.222222
    
    z <- c(1:10, NA)
    variance(z)
    #> [1] 9.166667
    var(z, na.rm = TRUE)
    #> [1] 9.166667
    skew <- function(x){
        n <- sum(!is.na(x))
        avg <- mean(x, na.rm = TRUE) 
        summation <- sum((x - avg)^3, na.rm = TRUE)
        variance_val <- (variance(x))^(3/2)
        ((1/(n-2))*summation)/variance_val
    }
    
    a <- c(1:10, 1000)
    skew(a)
    #> [1] 3.014606
  5. Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

    both_na <- function(x, y){
      # find where x is NA - is.na(x) - returns a bunch of TRUE, FALSE
      # find where y is NA - is.na(y) - returns a bunch of TRUE, FALSE
      # where are they both NA - &
      # sum where they are both NA 
      # e.g. (T, F, T) & (T, F, F) == 1; only T in same place once
      sum(is.na(x) & is.na(y))
    }
    
    x <- c(1:10, NA, 12, NA)
    y <- c(1:9, 10, NA, 12, 13)
    both_na(x,y)
    #> [1] 1
    both_na(c(NA, 2, 4, 6, NA, 10, 12),
            c(NA, 1, 3, 5, NA, 9, 11))
    #> [1] 2
  6. What do the following functions do? Why are they useful even though they are so short?

    • is_directory() tells you whether a given path is a directory.

    • is_readable() tells you whether you have permission to read a file.

      is_directory <- function(x) file.info(x)$isdir
      is_directory("C:/Personal")
      #> [1] TRUE
      is_directory("C:/Work/Learning/how-containers-work.pdf")
      #> [1] FALSE
      is_readable <- function(x) file.access(x, 4) == 0
      is_readable("C:/Personal")
      #> C:/Personal 
      #>        TRUE
      is_readable("C:/Work/Learning/how-containers-work.pdf")
      #> C:/Work/Learning/how-containers-work.pdf 
      #>                                     TRUE
  7. Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.

    Repeat (x3)
    Little Bunny Foo Foo,
    Hopping through the forest,
    Scooping up the field mice,
    And bopping them on the head.
    
    (Spoken)
    Down came the Good Fairy, and she said,
    
    "Little Bunny Foo Foo,
    I don't want to see you,
    Scooping up the field mice
    And bopping them on the head."
    
    (Spoken)
    "I'll give you three (two / one) chance(s),
    And if you don't behave,
    I'm gonna turn you into a goon!"
    END Repeat
    
    The next day... or That evening... or Later that night...
    "I gave you three chances,
    And you didn't behave,
    And now I'm gonna turn you into a goon. POOF!"
    
    "And the moral of the story is: Hare today, goon tomorrow."
    
    little_bunny <- function(bunny = ""){
      str_glue("{bunny}Little Bunny Foo-Foo,\n\n")
    }
    
    hop <- function(bunny, where = "forest"){
      str_glue("{bunny}Hopping through the {where},\n\n")
    }
    
    scoop <- function(bunny, what = "field mice"){
      str_glue("{bunny}Scooping up the {what},\n\n")
    }
    
    bop <- function(bunny, where = "head"){
      str_glue("{bunny}And bopping them on the {where}.\n\n\n")
    }
    
    chances <- function(bunny, num = "three"){
      if(num == "one") {
        str_glue("{bunny}I'll give you {num} chance,\n\n")
      }
      else {
        str_glue("{bunny}I'll give you {num} chances,\n\n")
      }
    }
    
    static_line1 <- function(bunny){
      str_glue("{bunny}I don't want to see you,\n\n")
    }
    
    static_line2 <- function(bunny){
      str_glue("{bunny}And if you don't behave,\nI'm gonna turn you into a goon!\n\n\n")
    }
    
    static_line3 <- function(bunny){
      str_glue("{bunny}Down came the Good Fairy, and she said,\n\n")
    }
    
    
    foo_foo <- little_bunny()
    foo_foo %>% 
      hop(where = "forest") %>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>% 
      static_line3() %>% 
      little_bunny() %>% 
      static_line1() %>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>%
      chances(num = "three") %>% 
      static_line2() %>%
      little_bunny() %>% 
      hop(where = "forest") %>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>% 
      static_line3()%>% 
      little_bunny() %>% 
      static_line1()%>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>%
      chances(num = "two") %>% 
      static_line2() %>% 
      little_bunny() %>% 
      hop(where = "forest") %>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>% 
      static_line3()%>% 
      little_bunny() %>% 
      static_line1() %>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>%
      chances(num = "one") %>% 
      static_line2() %>%
      little_bunny() %>% 
      hop(where = "forest") %>% 
      scoop(what = "field mice") %>% 
      bop(where = "head") %>% 
      static_line3() %>%   
      str_glue("I gave you three chances,\nAnd you didn't behave,\nAnd now I'm gonna turn you into a goon. POOF!")
    #> Little Bunny Foo-Foo,
    #> Hopping through the forest,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> Down came the Good Fairy, and she said,
    #> Little Bunny Foo-Foo,
    #> I don't want to see you,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> I'll give you three chances,
    #> And if you don't behave,
    #> I'm gonna turn you into a goon!
    #> 
    #> Little Bunny Foo-Foo,
    #> Hopping through the forest,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> Down came the Good Fairy, and she said,
    #> Little Bunny Foo-Foo,
    #> I don't want to see you,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> I'll give you two chances,
    #> And if you don't behave,
    #> I'm gonna turn you into a goon!
    #> 
    #> Little Bunny Foo-Foo,
    #> Hopping through the forest,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> Down came the Good Fairy, and she said,
    #> Little Bunny Foo-Foo,
    #> I don't want to see you,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> I'll give you one chance,
    #> And if you don't behave,
    #> I'm gonna turn you into a goon!
    #> 
    #> Little Bunny Foo-Foo,
    #> Hopping through the forest,
    #> Scooping up the field mice,
    #> And bopping them on the head.
    #> 
    #> Down came the Good Fairy, and she said,
    #> I gave you three chances,
    #> And you didn't behave,
    #> And now I'm gonna turn you into a goon. POOF!

Human Readable Functions - Exercises

  1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.

    f1 <- function(string, prefix) {
      substr(string, 1, nchar(prefix)) == prefix
    }
    f2 <- function(x) {
      if (length(x) <= 1) return(NULL)
      x[-length(x)]
    }
    f3 <- function(x, y) {
      rep(y, length.out = length(x))
    }

    The first function f1 checks if the substring of string matches the prefix given.

    match_prefix <- function(string, prefix) {
      substr(string, 1, nchar(prefix)) == prefix
    }
    match_prefix("===The quick brown fox", "===")
    #> [1] TRUE
    match_prefix("```The quick brown fox", "===")
    #> [1] FALSE
    match_prefix(c("the man in the bowler hat",
                   "that quick brown fox",
                   "the one and only",
                   "that crafy ol' fox"),
                 "the")
    #> [1]  TRUE FALSE  TRUE FALSE

    The function removes the last item of vector x.

    remove_last <- function(x) {
      if (length(x) <= 1) return(NULL)
      x[-length(x)]
    }
    remove_last(c(1:5, 2, 30, 45, NA))
    #> [1]  1  2  3  4  5  2 30 45
    remove_last(c(NA, 2, 30, 45, 1:5))
    #> [1] NA  2 30 45  1  2  3  4

    The function repeats the y vector as many times as the length of vector x.

    rep_len <- function(x, y) {
      rep(y, length.out = length(x))
    }
    rep_len(c(1:10), 1)
    #>  [1] 1 1 1 1 1 1 1 1 1 1
    rep_len(c("a", "b", "abc", "abcd", "efghi"),
            c("a", "b", "abc"))
    #> [1] "a"   "b"   "abc" "a"   "b"
  2. Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.

  3. Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent?

    rnorm() MASS::mvrnorm()
    univariate normal distribution multivariate normal distribution
    n, mean, sd n, mu, Sigma

    You could make mean / mu and sd / Sigma consistent by choosing one of these names. I would also make all names lower case so I would change Sigma to sigma.

  4. Make a case for why norm_r(), norm_d() etc would be better than rnorm(), dnorm(). Make a case for the opposite.

    • For norm_r(), norm_d() the fact that you can type norm and the autocomplete will give you a list of all norm_ functions is a win.

    • For rnorm(), dnorm() I would say it is more natural to say random normal distribution / density normal distribution than “normal random distribution” / “normal density distribution”.

Conditional execution

An if statement allows conditional execution of code.

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}
  • To get help on if you need to surround it in backticks: ?`if`.

  • The condition must evaluate to either TRUE or FALSE. You will get a warning / error if it is a vector OR if it’s an NA.

  • Use || (or) and && (and) to combine multiple logical expressions. These operators are “short-circuiting”

  • NEVER use | or & in an if statement, since these are vectorised operations.

  • == is also vectorised, which means you may get more than one output!

    • Check length is 1 already.
    • Collapse with any() or all().
    • Use the non-vectorised identical(), BUT this is really strict and can result in unexpected output given how computers store numbers etc. Use dplyr::near() for comparisons.
    • Use is.na() for NA checks.
if (c(TRUE, FALSE)) {}
#> NULL
if (NA) {}
#> Error in if (NA) {: missing value where TRUE/FALSE needed
identical(0L, 0)
#> [1] FALSE
x <- sqrt(2) ^ 2
x
#> [1] 2
x == 2
#> [1] FALSE
x - 2
#> [1] 0.0000000000000004440892

Multiple if statements are also allowed.

if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # 
}

Another useful technique is the switch() function for many if ... else constructs.

perform_calc <- function(x, y, op) {
  switch(op,
    plus = x + y,
    minus = x - y,
    times = x * y,
    divide = x / y,
    stop("Unknown op!")
  )
}
perform_calc(c(1,2,3), c(4,5,6), "plus")
#> [1] 5 7 9
perform_calc(c(1,2,3), c(4,5), "minus")
#> Warning in x - y: longer object length is not a multiple of shorter object
#> length
#> [1] -3 -3 -1
perform_calc(c(1,2,3), c(4,5,6), "power")
#> Error in perform_calc(c(1, 2, 3), c(4, 5, 6), "power"): Unknown op!

Conditions - Exercises

  1. What’s the difference between if and ifelse()? Carefully read the help and construct three examples that illustrate the key differences.

    if tests just one item, so it has to be used in a loop or with any() / all() if you want to ensure all in a vector meet a condition or any meet a condition.

    ifelse checks each item in the vector.

    If you read the help [?ifelse] it says use if/else construct for simple yes/no answers, and use ifelse for over a set of values.

    x <- c(1:5, 20, 25, 30)
        
        if(length(x) > 10){
          print(length(x))
          TRUE
        } else {
          print(length(x))
          FALSE
        }
        #> [1] 8
        #> [1] FALSE
        
        # You can use ifelse but kinda silly to use in this case
        ifelse(length(x)>10, length(x), "smaller than 10 items in vector")
        #> [1] "smaller than 10 items in vector"
        
        # here is a better use
        ifelse(x > 5, TRUE, FALSE)
        #> [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
        
        # DOES NOT work for if --- notice the warning that only
        # 1st is used
        if (x > 5) TRUE else FALSE
        #> Warning in if (x > 5) TRUE else FALSE: the condition has length > 1 and only the
        #> first element will be used
        #> [1] FALSE
  2. Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)

    greeting <- function(){
      time <- lubridate::now()
      hr <- lubridate::hour(time)
      if(hr >= 0 && hr < 12){
        "good morning"
      } else if (hr >= 12 && hr < 18){
        "good afternoon"
      } else {
        "good evening"
      }
    }
    greeting()
    #> [1] "good evening"
  3. Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.

    fizzbuzz <- function(x){
      if((x %% 3 == 0) && (x %% 5 == 0)){
        "fizzbuzz"
      } else if (x %% 3 == 0) {
        "fizz"
      } else if (x %% 5 == 0){
        "buzz"
      } else {
        x
      }
    }
    fizzbuzz(5)
    #> [1] "buzz"
    fizzbuzz(9)
    #> [1] "fizz"
    fizzbuzz(30)
    #> [1] "fizzbuzz"
    fizzbuzz(10)
    #> [1] "buzz"
  4. How could you use cut() to simplify this set of nested if-else statements?

    temp <- 31
    if (temp <= 0) {
      "freezing"
    } else if (temp <= 10) {
      "cold"
    } else if (temp <= 20) {
      "cool"
    } else if (temp <= 30) {
      "warm"
    } else {
      "hot"
    }
    #> [1] "hot"
    
    cut(temp, breaks = 10*(-3:10), 
        labels = c(rep("freezing", 3),
                   "cold",
                   "cool",
                   "warm",
                   rep("hot", 7)))
    #> [1] hot
    #> Levels: freezing cold cool warm hot
    
    table(cut(temp, breaks = 10*(-3:10), 
        labels = c(rep("freezing", 3),
                   "cold",
                   "cool",
                   "warm",
                   rep("hot", 7))))
    #> 
    #> freezing     cold     cool     warm      hot 
    #>        0        0        0        0        1

    How would you change the call to cut() if I’d used < instead of <=? What is the other chief advantage of cut() for this problem? (Hint: what happens if you have many values in temp?)

    You would use right = FALSE. cut allows vectors of values whereas if else constructs do not.

    NOTE: If you look at jrnold’s solutions you note that he uses -Inf and Inf for the breaks. This is much better than my hacky solution above that contains boundaries at the bottom and top.

  5. What happens if you use switch() with numeric values?

    Read more here.

    switch evaluates the EXPR against the list item number. Below we have 3 and this evaluates to list item == 3, hence 6.

    x <-  3
    switch(x, 
      "2",
      "4",
      "6",
      "8"
    )
    #> [1] "6"
  6. What does this switch() call do? What happens if x is “e”?

    If x is a or b, it returns “ab”; if x is c or d it returns cd. If x is e, no match is found and nothing is returned.

    The docs say to have an unnamed value as default after the named values for no match ones. This can be done as per the last way - note the test.

    x <-  "c"
    switch(x, 
      a = ,
      b = "ab",
      c = ,
      d = "cd"
    )
    
    x <- "e"
    switch(x, 
      a = ,
      b = "ab",
      c = ,
      d = "cd"
    )
    
    x <- "e"
    switch(x, 
      a = ,
      b = "ab",
      c = ,
      d = "cd"
      "test"
    )

    Experiment, then carefully read the documentation.

Default Arguments

The data to apply a function to should come first, followed by the arguments. Arguments should have default values as far as possible.

# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
#> [1] 0.4505862 0.5586331
mean_ci(x, conf = 0.99)
#> [1] 0.4336108 0.5756085
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
#> [1] 0.4187221 0.5271327
mean_ci(x, conf = 0.99)
#> [1] 0.4016896 0.5441652

It’s good practice to check important preconditions, and throw an error (with stop()), if they are not met.

wt_mean <- function(x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(w * x) / sum(w)
}
wt_mean(1:6, 1:3)
#> Error: `x` and `w` must be the same length

A useful function is the built-in stopifnot() which checks that each argument is TRUE, and produces a generic error message if not.

wt_mean <- function(x, w, na.rm = FALSE) {
  # I want na.rm to be logical: is.logical(na.rm)
  # I want length na.rm to be 1: length(na.rm) == 1
  # If it is NOT that STOP!
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
#> Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE

NOTE: When using stopifnot() you assert what should be true rather than checking for what might be wrong.

Dot-dot-dot (…)

Many functions in R take an arbitrary number of inputs by using a special argument: ... (pronounced dot-dot-dot). You can forward ... to other functions.

sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#> [1] 55
stringr::str_c("a", "b", "c", "d", "e", "f")
#> [1] "abcdef"
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
#> [1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad = "-") {
  title <- paste0(...)
  # getOption("width") tells you how many chars can be printed
  # nchar(title) tells you how long title is
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
#> Important output -----------------------------------------------------------
x <- c(1, 2)
sum(x, na.mr = TRUE)
#> [1] 4

Notice the result above is incorrect as well, since it sums 1 + 2 + TRUE = 4, but it should be 3!

Func Arguments - Exercises

  1. What does commas(letters, collapse = "-") do? Why?

    It throws an error. The reason is that both letters and collapse are passed into commas in the ... argument.

    commas(letters, collapse = "-")
    #> Error in stringr::str_c(..., collapse = ", "): formal argument "collapse" matched by multiple actual arguments
  2. It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?

    rule <- function(..., pad = "-") {
      title <- paste0(...)
      # getOption("width") tells you how many chars can be printed
      # nchar(title) tells you how long title is
      width <- getOption("width") - nchar(title) - 5
      cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
    }
    rule("Important output", pad = "-+")
    #> Important output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    The width of the resulting string is too long this way.

    rule <- function(..., pad = "-") {
          title <- paste0(...)
          # getOption("width") tells you how many chars can be printed
          # nchar(title) tells you how long title is
          width <- getOption("width") - nchar(title) - 5
          cat(title, " ", stringr::str_dup(pad, width/nchar(pad)), "\n", sep = "")
        }
        rule("Important output", pad = "-+")
        #> Important output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  3. What does the trim argument to mean() do? When might you use it?

    The trim argument removes / trims a certain percentage of the observations from each side. In the help page if we consider the examples, notice that the 0 and 50 are trimmed. So the values are sorted, and then the trimming occurs - this makes sense because we don’t want to just arbitrarily chop data, we want to remove outliers.

    In the second example we sort 5, 10, 11, …, 19, 20. Then 5 and 20 trimmed.

    x <- c(0:10, 50)
    (xm <- mean(x))
    #> [1] 8.75
    c(xm, mean(x, trim = 0.10))
    #> [1] 8.75 5.50
    mean(c(1:10))
    #> [1] 5.5
    mean(c(1:10)) == mean(x, trim = 0.10)
    #> [1] TRUE
    
    x <- c(10:20, 5)
    (xm <- mean(x))
    #> [1] 14.16667
    c(xm, mean(x, trim = 0.10))
    #> [1] 14.16667 14.50000
    mean(c(10:19))
    #> [1] 14.5
    mean(c(10:19)) == mean(x, trim = 0.10)
    #> [1] TRUE
  4. The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?

    The cor() function can use any one of those values, by default the first is used which is pearson.

Writing pipeable functions

If you want to write your own pipeable functions, think about the return value. E.g. for dplyr and tidyr the object type is the data frame.

There are two basic types of pipeable functions: transformations and side-effects.

  • With transformations, an object is passed to the function’s first argument and a modified object is returned.
  • With side-effects, the passed object is not transformed. Instead, the function performs an action on the object, like printing or plotting. These functions “invisibly” return the first argument, so that they can still be used in a pipeline.
show_missings <- function(df) {
  n <- sum(is.na(df))
  cat("Missing values: ", n, "\n", sep = "")
  
  invisible(df) # return df invisibly
}
# df does not get printed but it is there
show_missings(mtcars)
#> Missing values: 0

It’s still there, it’s just not printed by default.

x <- show_missings(mtcars) 
#> Missing values: 0
class(x)
#> [1] "data.frame"
dim(x)
#> [1] 32 11
head(x)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can use it in a pipe:

library(dplyr)
mtcars %>% 
  show_missings() %>% 
  mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>% 
  show_missings() 
#> Missing values: 0
#> Missing values: 18

What’s in a function?

Want to see what the code for a function is?

Type the function name with no parantheses.

E.g. lm, factorial

lm
#> function (formula, data, subset, weights, na.action, method = "qr", 
#>     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
#>     contrasts = NULL, offset, ...) 
#> {
#>     ret.x <- x
#>     ret.y <- y
#>     cl <- match.call()
#>     mf <- match.call(expand.dots = FALSE)
#>     m <- match(c("formula", "data", "subset", "weights", "na.action", 
#>         "offset"), names(mf), 0L)
#>     mf <- mf[c(1L, m)]
#>     mf$drop.unused.levels <- TRUE
#>     mf[[1L]] <- quote(stats::model.frame)
#>     mf <- eval(mf, parent.frame())
#>     if (method == "model.frame") 
#>         return(mf)
#>     else if (method != "qr") 
#>         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
#>             method), domain = NA)
#>     mt <- attr(mf, "terms")
#>     y <- model.response(mf, "numeric")
#>     w <- as.vector(model.weights(mf))
#>     if (!is.null(w) && !is.numeric(w)) 
#>         stop("'weights' must be a numeric vector")
#>     offset <- model.offset(mf)
#>     mlm <- is.matrix(y)
#>     ny <- if (mlm) 
#>         nrow(y)
#>     else length(y)
#>     if (!is.null(offset)) {
#>         if (!mlm) 
#>             offset <- as.vector(offset)
#>         if (NROW(offset) != ny) 
#>             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
#>                 NROW(offset), ny), domain = NA)
#>     }
#>     if (is.empty.model(mt)) {
#>         x <- NULL
#>         z <- list(coefficients = if (mlm) matrix(NA_real_, 0, 
#>             ncol(y)) else numeric(), residuals = y, fitted.values = 0 * 
#>             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
#>             0) else ny)
#>         if (!is.null(offset)) {
#>             z$fitted.values <- offset
#>             z$residuals <- y - offset
#>         }
#>     }
#>     else {
#>         x <- model.matrix(mt, mf, contrasts)
#>         z <- if (is.null(w)) 
#>             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
#>                 ...)
#>         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
#>             ...)
#>     }
#>     class(z) <- c(if (mlm) "mlm", "lm")
#>     z$na.action <- attr(mf, "na.action")
#>     z$offset <- offset
#>     z$contrasts <- attr(x, "contrasts")
#>     z$xlevels <- .getXlevels(mt, mf)
#>     z$call <- cl
#>     z$terms <- mt
#>     if (model) 
#>         z$model <- mf
#>     if (ret.x) 
#>         z$x <- x
#>     if (ret.y) 
#>         z$y <- y
#>     if (!qr) 
#>         z$qr <- NULL
#>     z
#> }
#> <bytecode: 0x0000000027823e50>
#> <environment: namespace:stats>

To figure out what arguments a function takes use args(func_name).

args(lm)
#> function (formula, data, subset, weights, na.action, method = "qr", 
#>     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
#>     contrasts = NULL, offset, ...) 
#> NULL
args(dplyr::across)
#> function (.cols = everything(), .fns = NULL, ..., .names = NULL) 
#> NULL

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
#> [3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
#> [5] LC_TIME=English_South Africa.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] magrittr_1.5    flair_0.0.2     forcats_0.5.0   stringr_1.4.0  
#>  [5] dplyr_1.0.2     purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
#>  [9] tibble_3.0.3    ggplot2_3.3.2   tidyverse_1.3.0 workflowr_1.6.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0 xfun_0.13        haven_2.3.1      colorspace_1.4-1
#>  [5] vctrs_0.3.2      generics_0.0.2   htmltools_0.5.0  yaml_2.2.1      
#>  [9] rlang_0.4.8      later_1.0.0      pillar_1.4.6     withr_2.2.0     
#> [13] glue_1.4.2       DBI_1.1.0        dbplyr_2.0.0     modelr_0.1.8    
#> [17] readxl_1.3.1     lifecycle_0.2.0  munsell_0.5.0    gtable_0.3.0    
#> [21] cellranger_1.1.0 rvest_0.3.6      evaluate_0.14    knitr_1.28      
#> [25] ps_1.3.2         httpuv_1.5.2     fansi_0.4.1      broom_0.7.2     
#> [29] Rcpp_1.0.4.6     promises_1.1.0   backports_1.1.6  scales_1.1.0    
#> [33] jsonlite_1.7.1   fs_1.5.0         hms_0.5.3        digest_0.6.27   
#> [37] stringi_1.5.3    rprojroot_1.3-2  grid_3.6.3       cli_2.1.0       
#> [41] tools_3.6.3      crayon_1.3.4     whisker_0.4      pkgconfig_2.0.3 
#> [45] ellipsis_0.3.1   xml2_1.3.2       reprex_0.3.0     lubridate_1.7.9 
#> [49] assertthat_0.2.1 rmarkdown_2.4    httr_1.4.2       rstudioapi_0.11 
#> [53] R6_2.4.1         git2r_0.26.1     compiler_3.6.3