Chapter 7 Defining your own functions

In this section we are going to learn some advanced concepts that are going to make you into a full-fledged R programmer. Before this chapter you only used whatever R came with, as well as the functions contained in packages. We did define some functions ourselves in Chapter 6 already, but without going into many details. In this chapter, we will learn about building functions ourselves, and do so in greater detail than what we did before.

7.1 Control flow

Knowing about control flow is essential to build your own functions. Without control flow statements, such as if-else statements or loops (or, in the case of pure functional programming languages, recursion), programming languages would be very limited.

7.1.1 If-else

Imagine you want a variable to be equal to a certain value if a condition is met. This is a typical problem that requires the if ... else ... construct. For instance:

a <- 4
b <- 5

Suppose that if a > b then f should be equal to 20, else f should be equal to 10. Using if ... else ... you can achieve this like so:

if (a > b) {
  f <- 20
    } else {
  f <- 10
}

Obviously, here f = 10. Another way to achieve this is by using the ifelse() function:

f <- ifelse(a > b, 20, 10)

if...else... and ifelse() might seem interchangeable, but they’re not. ifelse() is vectorized, while if...else.. is not. Let’s try the following:

ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no")
## [1] "no"  "yes" "yes"

The result is a vector. Now, let’s see what happens if we use if...else... instead of ifelse():

if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no")
> Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") : 
  the condition has length > 1

This results in an error (in previous R version, only the first element of the vector would get used). We have already discussed this in Chapter 2, remember? If you want to make sure that such an expression evaluates to TRUE, then you need to use all():

ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater")
## [1] "not all elements are greater"

You may also remember the any() function:

ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater")
## [1] "at least one element is greater"

These are the basics. But sometimes, you might need to test for more complex conditions, which can lead to using nested if...else... constructs. These, however, can get messy:

if (10 %% 3 == 0) {
  print("10 is divisible by 3")
  } else if (10 %% 2 == 0) {
    print("10 is divisible by 2")
}
## [1] "10 is divisible by 2"

10 being obviously divisible by 2 and not 3, it is the second sentence that will be printed. The %% operator is the modulus operator, which gives the rest of the division of 10 by 2. In such cases, it is easier to use dplyr::case_when():

case_when(10 %% 3 == 0 ~ "10 is divisible by 3",
          10 %% 2 == 0 ~ "10 is divisible by 2")
## [1] "10 is divisible by 2"

We have already encountered this function in Chapter 4, inside a dplyr::mutate() call to create a new column.

Let’s now discuss loops.

7.1.2 For loops

For loops make it possible to repeat a set of instructions i times. For example, try the following:

for (i in 1:10){
  print("hello")
}
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"

It is also possible to do computations using for loops. Let’s compute the sum of the first 100 integers:

result <- 0
for (i in 1:100){
  result <- result + i
}

print(result)
## [1] 5050

result is equal to 5050, the expected result. What happened in that loop? First, we defined a variable called result and set it to 0. Then, when the loops starts, i equals 1, so we add result to 1, which is 1. Then, i equals 2, and again, we add result to i. But this time, result equals 1 and i equals 2, so now result equals 3, and we repeat this until i equals 100. If you know a programming language like C, this probably looks familiar. However, R is not C, and you should, if possible, avoid writing code that looks like this. You should always ask yourself the following questions:

  • Is there an inbuilt function to achieve what I need? In this case we have sum(), so we could use sum(seq(1, 100)).
  • Is there a way to use matrix algebra? This can sometimes make things easier, but it depends how comfortable you are with matrix algebra. This would be the solution with matrix algebra: rep(1, 100) %*% seq(1, 100).
  • Is there a way to use building blocks that are already available? For instance, suppose that sum() would not be a function available in R. Another way to solve this issue would be to use the following building blocks: +, which computes the sum of two numbers and Reduce(), which reduces a list of elements using an operator. Sounds complicated? Let’s see how Reduce() works. First, let me show you how I combine these two functions to achieve the same result as when using sum():
Reduce(`+`, seq(1, 100))
## [1] 5050

We will see how Reduce() works in greater detail in the next chapter, but what happened was something like this:

Reduce(`+`, seq(1, 100)) = 
 1 + Reduce(`+`, seq(2, 100)) = 
 1 + 2 + Reduce(`+`, seq(3, 100)) = 
 1 + 2 + 3 + Reduce(`+`, seq(4, 100)) = 
 ....

If you ask yourself these questions, it turns out that you only rarely actually need to write loops, but loops are still important, because sometimes there simply isn’t an alternative. Also, there are other situations where loops are also important, so I refer you to the following section of Hadley Wickham’s Advanced R for an in-depth discussion on situations where loops make more sense than using functions such as Reduce().

7.1.3 While loops

While loops are very similar to for loops. The instructions inside a while loop are repeated while a certain condition holds true. Let’s consider the sum of the first 100 integers again:

result <- 0
i <- 1
while (i<=100){
  result = result + i
  i = i + 1
}

print(result)
## [1] 5050

Here, we first set result and i to 0. Then, while i is less than, or equal to 100, we add i to result. Notice that there is one more line than in the for loop version of this code: we need to increment the value of i at each iteration, if not, i would stay equal to 1, and the condition would always be fulfilled, and the loop would run forever (not really, only until your computer runs out of memory, or until the heat death of the universe, whichever comes first).

Now that we know how to write loops, and know about if...else... constructs, we have (almost) all the ingredients to write our own functions.

7.2 Writing your own functions

As you have seen by now, R includes a very large amount of in-built functions, but also many more functions are available in packages. However, there will be a lot of situations where you will need to write your own. In this section we are going to learn how to write our own functions.

7.2.1 Declaring functions in R

Suppose you want to create the following function: \(f(x) = \dfrac{1}{\sqrt{x}}\). Writing this in R is quite simple:

my_function <- function(x){
  1/sqrt(x)
}

The argument of the function, x, gets passed to the function() function and the body of the function (more on that in the next Chapter) contains the function definition. Of course, you could define functions that use more than one input:

my_function <- function(x, y){
  1/sqrt(x + y)
}

or inputs with names longer than one character:

my_function <- function(argument1, argument2){
  1/sqrt(argument1 + argument2)
}

Functions written by the user get called just the same way as functions included in R:

my_function(1, 10)
## [1] 0.3015113

It is also possible to provide default values to the function’s arguments, which are values that are used if the user omits them:

my_function <- function(argument1, argument2 = 10){
1/sqrt(argument1 + argument2)
}
my_function(1)
## [1] 0.3015113

This is especially useful for functions with many arguments. Consider also the following example, where the function has a default method:

my_function <- function(argument1, argument2, method = "foo"){
  
  x <- argument1 + argument2
  
  if(method == "foo"){
    1/sqrt(x)
  } else if (method == "bar"){
    "this is a string"
    }
}

my_function(10, 11)
## [1] 0.2182179
my_function(10, 11, "bar")
## [1] "this is a string"

As you see, depending on the “method” chosen, the returned result is either a numeric, or a string. What happens if the user provides a “method” that is neither “foo” nor “bar”?

my_function(10, 11, "spam")

As you can see nothing happens. It is possible to add safeguards to your function to avoid such situations:

my_function <- function(argument1, argument2, method = "foo"){
  
  if(!(method %in% c("foo", "bar"))){
    return("Method must be either 'foo' or 'bar'")
  }
  
  x <- argument1 + argument2
  
  if(method == "foo"){
    1/sqrt(x)
  } else if (method == "bar"){
    "this is a string"
    }
}

my_function(10, 11)
## [1] 0.2182179
my_function(10, 11, "bar")
## [1] "this is a string"
my_function(10, 11, "foobar")
## [1] "Method must be either 'foo' or 'bar'"

Notice that I have used return() inside my first if statement. This is to immediately stop evaluation of the function and return a value. If I had omitted it, evaluation would have continued, as it is always the last expression that gets evaluated. Remove return() and run the function again, and see what happens. Later, we are going to learn how to add better safeguards to your functions and to avoid runtime errors.

While in general, it is a good idea to add comments to your functions to explain what they do, I would avoid adding comments to functions that do things that are very obvious, such as with this one. Function names should be of the form: function_name(). Always give your function very explicit names! In mathematics it is standard to give functions just one letter as a name, but I would advise against doing that in your code. Functions that you write are not special in any way; this means that R will treat them the same way, and they will work in conjunction with any other function just as if it was built-in into R.

They have one limitation though (which is shared with R’s native function): just like in math, they can only return one value. However, sometimes, you may need to return more than one value. To be able to do this, you must put your values in a list, and return the list of values. For example:

average_and_sd <- function(x){
c(mean(x), sd(x))
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## [1] 7.166667 4.262237

You’re still returning a single object, but it’s a vector. You can also return a named list:

average_and_sd <- function(x){
list("mean_x" =  mean(x), "sd_x" = sd(x))
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## $mean_x
## [1] 7.166667
## 
## $sd_x
## [1] 4.262237

As described before, you can use return() at the end of your functions:

average_and_sd <- function(x){
  result <- c(mean(x), sd(x))
return(result)
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## [1] 7.166667 4.262237

But this is only needed if you need to return a value early:

average_and_sd <- function(x){
if(any(is.na(x))){
    return(NA)
  } else {
    c(mean(x), sd(x))
    }
}

average_and_sd(c(1, 3, 8, 9, 10, 12))
## [1] 7.166667 4.262237
average_and_sd(c(1, 3, NA, 9, 10, 12))
## [1] NA

If you need to use a function from a package inside your function use :::

my_sum <- function(a_vector){
  purrr::reduce(a_vector, `+`)
}

However, if you need to use more than one function, this can become tedious. A quick and dirty way of doing that, is to use library(package_name), inside the function:

my_sum <- function(a_vector){
  library(purrr)
  reduce(a_vector, `+`)
}

Loading the library inside the function has the advantage that you will be sure that the package upon which your function depends will be loaded. If the package is already loaded, it will not be loaded again, thus not impact performance, but if you forgot to load it at the beginning of your script, then, no worries, your function will load it the first time you use it! However, you should avoid doing this, because the resulting function is now not pure. It has a side effect, which is loading a library. This could result in problems, especially if several functions load several different packages that have functions with the same name. Depending on which function runs first, a function with the same name but coming from the same package will be available in the global environment. The very best way would be to write your own package and declare the packages upon which your functions depend as dependencies. This is something we are going to explore in Chapter 9.

You can put a lot of instructions inside a function, such as loops. Let’s create the function that returns Fionacci numbers.

7.2.2 Fibonacci numbers

The Fibonacci sequence is the following:

\[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\]

Each subsequent number is composed of the sum of the two preceding ones. In R, it is possible to define a function that returns the \(n^{th}\) fibonacci number:

my_fibo <- function(n){
 a <- 0
 b <- 1
 for (i in 1:n){
  temp <- b
  b <- a
  a <- a + temp
 }
 a
}

Inside the loop, we defined a variable called temp. Defining temporary variables is usually very useful. Let’s try to understand what happens inside this loop:

  • First, we assign the value 0 to variable a and value 1 to variable b.
  • We start a loop, that goes from 1 to n.
  • We assign the value inside of b to a temporary variable, called temp.
  • b becomes a.
  • We assign the sum of a and temp to a.
  • When the loop is finished, we return a.

What happens if we want the 3rd fibonacci number? At n = 1 we have first a = 0 and b = 1, then temp = 1, b = 0 and a = 0 + 1. Then n = 2. Now b = 0 and temp = 0. The previous result, a = 0 + 1 is now assigned to b, so b = 1. Then, a = 1 + 0. Finally, n = 3. temp = 1 (because b = 1), the previous result a = 1 is assigned to b and finally, a = 1 + 1. So the third fibonacci number equals 2. Reading this might be a bit confusing; I strongly advise you to run the algorithm on a sheet of paper, step by step.

The above algorithm is called an iterative algorithm, because it uses a loop to compute the result. Let’s look at another way to think about the problem, with a so-called recursive function:

fibo_recur <- function(n){
 if (n == 0 || n == 1){
   return(n)
   } else {
   fibo_recur(n-1) + fibo_recur(n-2)
   }
}

This algorithm should be easier to understand: if n = 0 or n = 1 the function should return n (0 or 1). If n is strictly bigger than 1, fibo_recur() should return the sum of fibo_recur(n-1) and fibo_recur(n-2). This version of the function is very much the same as the mathematical definition of the fibonacci sequence. So why not use only recursive algorithms then? Try to run the following:

system.time(my_fibo(30))
##    user  system elapsed 
##   0.007   0.000   0.007

The result should be printed very fast (the system.time() function returns the time that it took to execute my_fibo(30)). Let’s try with the recursive version:

system.time(fibo_recur(30))
##    user  system elapsed 
##   1.482   0.080   1.574

It takes much longer to execute! Recursive algorithms are very CPU demanding, so if speed is critical, it’s best to avoid recursive algorithms. Also, in fibo_recur() try to remove this line: if (n == 0 || n == 1) and try to run fibo_recur(5) and see what happens. You should get an error: this is because for recursive algorithms you need a stopping condition, or else, it would run forever. This is not the case for iterative algorithms, because the stopping condition is the last step of the loop.

So as you can see, for recursive relationships, for or while loops are the way to go in R, whether you’re writing these loops inside functions or not.

7.3 Exercises

Exercise 1

In this exercise, you will write a function to compute the sum of the n first integers. Combine the algorithm we saw in section about while loops and what you learned about functions in this section.

Exercise 2

Write a function called my_fact() that computes the factorial of a number n. Do it using a loop, using a recursive function, and using a functional:

Exercise 3

Write a function to find the roots of quadratic functions. Your function should take 3 arguments, a, b and c and return the two roots. Only consider the case where there are two real roots (delta > 0).

7.4 Functions that take functions as arguments: writing your own higher-order functions

Functions that take functions as arguments are very powerful and useful tools. Two very important functions, that we will discuss in chapter 8 are purrr::map() and purrr::reduce(). But you can also write your own! A very simple example would be the following:

my_func <- function(x, func){
  func(x)
}

my_func() is a very simple function that takes x and func() as arguments and that simply executes func(x). This might not seem very useful (after all, you could simply use func(x)!) but this is just for illustration purposes, in practice, your functions would be more useful than that! Let’s try to use my_func():

my_func(c(1, 8, 1, 0, 8), mean)
## [1] 3.6

As expected, this returns the mean of the given vector. But now suppose the following:

my_func(c(1, 8, 1, NA, 8), mean)
## [1] NA

Because one element of the list is NA, the whole mean is NA. mean() has a na.rm argument that you can set to TRUE to ignore the NAs in the vector. However, here, there is no way to provide this argument to the function mean()! Let’s see what happens when we try to:

my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
Error in my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) :
  unused argument (na.rm = TRUE)

So what you could do is pass the value TRUE to the na.rm argument of mean() from your own function:

my_func <- function(x, func, remove_na){
  func(x, na.rm = remove_na)
}

my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE)
## [1] 4.5

This is one solution, but mean() also has another argument called trim. What if some other user needs this argument? Should you also add it to your function? Surely there’s a way to avoid this problem? Yes, there is, and it by using the dots. The ... simply mean “any other argument as needed”, and it’s very easy to use:

my_func <- function(x, func, ...){
  func(x, ...)
}

my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
## [1] 4.5

or, now, if you need the trim argument:

my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1)
## [1] 4.5

The ... are very useful when writing higher-order functions such as my_func(), because it allows you to pass arguments down to the underlying functions.

7.5 Functions that return functions

The example from before, my_func() took three arguments, some x, a function func, and ... (dots). my_func() was a kind of wrapper that evaluated func on its arguments x and .... But sometimes this is not quite what you need or want. It is sometimes useful to write a function that returns a modified function. This type of function is called a function factory, as it builds functions. For instance, suppose that we want to time how long functions take to run. An idea would be to proceed like this:

tic <- Sys.time()
very_slow_function(x)
toc <- Sys.time()

running_time <- toc - tic

but if you want to time several functions, this gets very tedious. It would be much easier if functions would time themselves. We could achieve this by writing a wrapper, like this:

timed_very_slow_function <- function(...){

  tic <- Sys.time()
  result <- very_slow_function(x)
  toc <- Sys.time()

  running_time <- toc - tic

  list("result" = result,
       "running_time" = running_time)

}

The problem here is that we have to change each function we need to time. But thanks to the concept of function factories, we can write a function that does this for us:

time_f <- function(.f, ...){

  function(...){

    tic <- Sys.time()
    result <- .f(...)
    toc <- Sys.time()

    running_time <- toc - tic

    list("result" = result,
         "running_time" = running_time)

  }
}

time_f() is a function that returns a function, a function factory. Calling it on a function returns, as expected, a function:

t_mean <- time_f(mean)

t_mean
## function(...){
## 
##     tic <- Sys.time()
##     result <- .f(...)
##     toc <- Sys.time()
## 
##     running_time <- toc - tic
## 
##     list("result" = result,
##          "running_time" = running_time)
## 
##   }
## <environment: 0x562c5699a6b8>

This function can now be used like any other function:

output <- t_mean(seq(-500000, 500000))

output is a list of two elements, the first being simply the result of mean(seq(-500000, 500000)), and the other being the running time.

This approach is super flexible. For instance, imagine that there is an NA in the vector. This would result in the mean of this vector being NA:

t_mean(c(NA, seq(-500000, 500000)))
## $result
## [1] NA
## 
## $running_time
## Time difference of 0.006885529 secs

But because we use the ... in the definition of time_f(), we can now simply pass mean()’s option down to it:

t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE)
## $result
## [1] 0
## 
## $running_time
## Time difference of 0.01394773 secs

7.6 Functions that take columns of data as arguments

7.6.1 The enquo() - !!() approach

In many situations, you will want to write functions that look similar to this:

my_function(my_data, one_column_inside_data)

Such a function would be useful in situation where you have to apply a certain number of operations to columns for different data frames. For example if you need to create tables of descriptive statistics or graphs periodically, it might be very interesting to put these operations inside a function and then call the function whenever you need it, on the fresh batch of data.

However, if you try to write something like that, something that might seem unexpected, at first, will happen:

data(mtcars)

simple_function <- function(dataset, col_name){
  dataset %>%
    group_by(col_name) %>%
    summarise(mean_speed = mean(speed))
}


simple_function(cars, "dist")
Error: unknown variable to group by : col_name

The variable col_name is passed to simple_function() as a string, but group_by() requires a variable name. So why not try to convert col_name to a name?

simple_function <- function(dataset, col_name){
  col_name <- as.name(col_name)
  dataset %>%
    group_by(col_name) %>%
    summarise(mean_speed = mean(speed))
}


simple_function(cars, "dist")
Error: unknown variable to group by : col_name

This is because R is literally looking for the variable "dist" somewhere in the global environment, and not as a column of the data. R does not understand that you are refering to the column "dist" that is inside the dataset. So how can we make R understands what you mean?

To be able to do that, we need to use a framework that was introduced in the {tidyverse}, called tidy evaluation. This framework can be used by installing the {rlang} package. {rlang} is quite a technical package, so I will spare you the details. But you should at the very least take a look at the following documents here and here. The discussion can get complicated, but you don’t need to know everything about {rlang}. As you will see, knowing some of the capabilities {rlang} provides can be incredibly useful. Take a look at the code below:

simple_function <- function(dataset, col_name){
  col_name <- enquo(col_name)
  dataset %>%
    group_by(!!col_name) %>%
    summarise(mean_mpg = mean(mpg))
}


simple_function(mtcars, cyl)
## # A tibble: 3 × 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4     26.7
## 2     6     19.7
## 3     8     15.1

As you can see, the previous idea we had, which was using as.name() was not very far away from the solution. The solution, with {rlang}, consists in using enquo(), which (for our purposes), does something similar to as.name(). Now that col_name is (R programmers call it) quoted, or defused, we need to tell group_by() to evaluate the input as is. This is done with !!(), called the injection operator, which is another {rlang} function. I say it again; don’t worry if you don’t understand everything. Just remember to use enquo() on your column names and then !!() inside the {dplyr} function you want to use.

Let’s see some other examples:

simple_function <- function(dataset, col_name, value){
  col_name <- enquo(col_name)
  dataset %>%
    filter((!!col_name) == value) %>%
    summarise(mean_cyl = mean(cyl))
}


simple_function(mtcars, am, 1)
##   mean_cyl
## 1 5.076923

Notice that I’ve written:

filter((!!col_name) == value)

and not:

filter(!!col_name == value)

I have enclosed !!col_name inside parentheses. This is because operators such as == have precedence over !!, so you have to be explicit. Also, notice that I didn’t have to quote 1. This is because it’s standard variable, not a column inside the dataset. Let’s make this function a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d like the mean of another variable?

simple_function <- function(dataset, filter_col, mean_col, value){
  filter_col <- enquo(filter_col)
  mean_col <- enquo(mean_col)
  dataset %>%
    filter((!!filter_col) == value) %>%
    summarise(mean((!!mean_col)))
}


simple_function(mtcars, am, cyl, 1)
##   mean(cyl)
## 1  5.076923

Notice that I had to quote mean_col too.

Using the ... that we discovered in the previous section, we can pass more than one column:

simple_function <- function(dataset, ...){
  col_vars <- quos(...)
  dataset %>%
    summarise_at(vars(!!!col_vars), funs(mean, sd))
}

Because these dots contain more than one variable, you have to use quos() instead of enquo(). This will put the arguments provided via the dots in a list. Then, because we have a list of columns, we have to use summarise_at(), which you should know if you did the exercices of Chapter 4. So if you didn’t do them, go back to them and finish them first. Doing the exercise will also teach you what vars() and funs() are. The last thing you have to pay attention to is to use !!!() if you used quos(). So 3 ! instead of only 2. This allows you to then do things like this:

simple_function(mtcars, am, cyl, mpg)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
##   am_mean cyl_mean mpg_mean     am_sd   cyl_sd   mpg_sd
## 1 0.40625   6.1875 20.09062 0.4989909 1.785922 6.026948

Using ... with !!!() allows you to write very flexible functions.

If you need to be even more general, you can also provide the summary functions as arguments of your function, but you have to rewrite your function a little bit:

simple_function <- function(dataset, cols, funcs){
  dataset %>%
    summarise_at(vars(!!!cols), funs(!!!funcs))
}

You might be wondering where the quos() went? Well because now we are passing two lists, a list of columns that we have to quote, and a list of functions, that we also have to quote, we need to use quos() when calling the function:

simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum))
##   am_mean cyl_mean mpg_mean     am_sd   cyl_sd   mpg_sd am_sum cyl_sum mpg_sum
## 1 0.40625   6.1875 20.09062 0.4989909 1.785922 6.026948     13     198   642.9

This works, but I don’t think you’ll need to have that much flexibility; either the columns are variables, or the functions, but rarely both at the same time.

To conclude this function, I should also talk about as_label() which allows you to change the name of a variable, for instance if you want to call the resulting column mean_mpg when you compute the mean of the mpg column:

simple_function <- function(dataset, filter_col, mean_col, value){

  filter_col <- enquo(filter_col)
  mean_col <- enquo(mean_col)
  mean_name <- paste0("mean_", as_label(mean_col))
  
  dataset %>%
    filter((!!filter_col) == value) %>%
    summarise(!!(mean_name) := mean((!!mean_col)))
}

Pay attention to the := operator in the last line. This is needed when using as_label().

7.6.2 Curly Curly, a simplified approach to enquo() and !!()

The previous section might have been a bit difficult to grasp, but there is a simplified way of doing it, which consists in using {{}}, introduced in {rlang} version 0.4.0. The suggested pronunciation of {{}} is curly-curly, but there is no consensus yet.

Let’s suppose that I need to write a function that takes a data frame, as well as a column from this data frame as arguments, just like before:

how_many_na <- function(dataframe, column_name){
  dataframe %>%
    filter(is.na(column_name)) %>%
    count()
}

Let’s try this function out on the starwars data:

data(starwars)

head(starwars)
## # A tibble: 6 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

As you can see, there are missing values in the hair_color column. Let’s try to count how many missing values are in this column:

how_many_na(starwars, hair_color)
Error: object 'hair_color' not found

Just as expected, this does not work. The issue is that the column is inside the dataframe, but when calling the function with hair_color as the second argument, R is looking for a variable called hair_color that does not exist. What about trying with "hair_color"?

how_many_na(starwars, "hair_color")
## # A tibble: 1 × 1
##       n
##   <int>
## 1     0

Now we get something, but something wrong!

One way to solve this issue, is to not use the filter() function, and instead rely on base R:

how_many_na_base <- function(dataframe, column_name){
  na_index <- is.na(dataframe[, column_name])
  nrow(dataframe[na_index, column_name])
}

how_many_na_base(starwars, "hair_color")
## [1] 5

This works, but not using the {tidyverse} at all is not always an option. For instance, the next function, which uses a grouping variable, would be difficult to implement without the {tidyverse}:

summarise_groups <- function(dataframe, grouping_var, column_name){
  dataframe %>%
    group_by(grouping_var) %>%  
    summarise(mean(column_name, na.rm = TRUE))
}

Calling this function results in the following error message, as expected:

Error: Column `grouping_var` is unknown

In the previous section, we solved the issue like so:

summarise_groups <- function(dataframe, grouping_var, column_name){

  grouping_var <- enquo(grouping_var)
  column_name <- enquo(column_name)
  mean_name <- paste0("mean_", as_label(column_name))

  dataframe %>%
    group_by(!!grouping_var) %>%  
    summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE))
}

The core of the function remained very similar to the version from before, but now one has to use the enquo()-!! syntax.

Now this can be simplified using the new {{}} syntax:

summarise_groups <- function(dataframe, grouping_var, column_name){

  dataframe %>%
    group_by({{grouping_var}}) %>%  
    summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE))
}

Much easier and cleaner! You still have to use the := operator instead of = for the column name however, and if you want to modify the column names, for instance in this case return "mean_height" instead of height you have to keep using the enquo()-!! syntax.

7.7 Functions that use loops

It is entirely possible to put a loop inside a function. For example, consider the following function that return the square root of a number using Newton’s algorithm:

sqrt_newton <- function(a, init = 1, eps = 0.01){
    stopifnot(a >= 0)
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
    }
    init
}

This functions contains a while loop inside its body. Let’s see if it works:

sqrt_newton(16)
## [1] 4.000001

In the definition of the function, I wrote init = 1 and eps = 0.01 which means that this argument can be omitted and will have the provided value (0.01) as the default. You can then use this function as any other, for example with map():

map(c(16, 7, 8, 9, 12), sqrt_newton)
## [[1]]
## [1] 4.000001
## 
## [[2]]
## [1] 2.645767
## 
## [[3]]
## [1] 2.828469
## 
## [[4]]
## [1] 3.000092
## 
## [[5]]
## [1] 3.464616

This is what I meant before with “your functions are nothing special”. Once the function is defined, you can use it like any other base R function.

Notice the use of stopifnot() inside the body of the function. This is a way to return an error in case a condition is not fulfilled. We are going to learn more about this type of functions in the next chapter.

7.8 Anonymous functions

As the name implies, anonymous functions are functions that do not have a name. These are useful inside functions that have functions as arguments, such as purrr::map() or purrr::reduce():

map(c(1,2,3,4), function(x){1/sqrt(x)})
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 0.7071068
## 
## [[3]]
## [1] 0.5773503
## 
## [[4]]
## [1] 0.5

These anonymous functions get defined in a very similar way to regular functions, you just skip the name and that’s it. {tidyverse} functions also support formulas; these get converted to anonymous functions:

map(c(1,2,3,4), ~{1/sqrt(.)})
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 0.7071068
## 
## [[3]]
## [1] 0.5773503
## 
## [[4]]
## [1] 0.5

Using a formula instead of an anonymous function is less verbose; you use ~ instead of function(x) and a single dot . instead of x. What if you need an anonymous function that requires more than one argument? This is not a problem:

map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), function(x, y){(x**2)/y})
## [[1]]
## [1] 0.1111111
## 
## [[2]]
## [1] 0.5
## 
## [[3]]
## [1] 1.285714
## 
## [[4]]
## [1] 2.666667
## 
## [[5]]
## [1] 5

or, using a formula:

map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y})
## [[1]]
## [1] 0.1111111
## 
## [[2]]
## [1] 0.5
## 
## [[3]]
## [1] 1.285714
## 
## [[4]]
## [1] 2.666667
## 
## [[5]]
## [1] 5

Because you have now two arguments, a single dot could not work, so instead you use .x and .y to avoid confusion.

Since version 4.1, R introduced a short-hand for defining anonymous functions:

map(c(1,2,3,4), \(x)(1/sqrt(x)))
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 0.7071068
## 
## [[3]]
## [1] 0.5773503
## 
## [[4]]
## [1] 0.5

\(x) is supposed to look like this notation: \(\lambda(x)\). This is a notation comes from lambda calculus, where functions are defined like this:

\[ \lambda(x).1/sqrt(x) \]

which is equivalent to \(f(x) = 1/sqrt(x)\). You can use \(x) or function(x) interchangeably.

You now know a lot about writing your own functions. In the next chapter, we are going to learn about functional programming, the programming paradigm I described in the introduction of this book.

7.9 Exercises

Exercise 1

  • Create the following vector:

\[a = (1,6,7,8,8,9,2)\]

Using a for loop and a while loop, compute the sum of its elements. To avoid issues, use i as the counter inside the for loop, and j as the counter for the while loop.

  • How would you achieve that with a functional (a function that takes a function as an argument)?

Exercise 2

  • Let’s use a loop to get the matrix product of a matrix A and B. Follow these steps to create the loop:
  1. Create matrix A:

\[A = \left( \begin{array}{ccc} 9 & 4 & 12 \\ 5 & 0 & 7 \\ 2 & 6 & 8 \\ 9 & 2 & 9 \end{array} \right) \]

  1. Create matrix B:

\[B = \left( \begin{array}{cccc} 5 & 4 & 2 & 5 \\ 2 & 7 & 2 & 1 \\ 8 & 3 & 2 & 6 \\ \end{array} \right) \]

  1. Create a matrix C, with dimension 4x4 that will hold the result. Use this command: `C = matrix(rep(0,16), nrow = 4)}

  2. Using a for loop, loop over the rows of A first: `for(i in 1:nrow(A))}

  3. Inside this loop, loop over the columns of B: `for(j in 1:ncol(B))}

  4. Again, inside this loop, loop over the rows of B: `for(k in 1:nrow(B))}

  5. Inside this last loop, compute the result and save it inside C: `C[i,j] = C[i,j] + A[i,k] * B[k,j]}

  6. Now write a function that takes two matrices as arguments, and returns their product.

  • R has a built-in function to compute the dot product of 2 matrices. Which is it?

Exercise 3

  • Fizz Buzz: Print integers from 1 to 100. If a number is divisible by 3, print the word "Fizz" if it’s divisible by 5, print "Buzz". Use a for loop and if statements.

  • Write a function that takes an integer as arguments, and prints "Fizz" or "Buzz" up to that integer.

Exercise 4

  • Fizz Buzz 2: Same as above, but now add this third condition: if a number is both divisible by 3 and 5, print "FizzBuzz".

  • Write a function that takes an integer as argument, and prints Fizz, Buzz or FizzBuzz up to that integer.