class: center, middle, inverse, title-slide .title[ # Programming Tools in Data Science ] .subtitle[ ## Lecture #10: Functional programming ] .author[ ### Samuel Orso ] .date[ ### 31 October 2024 ] --- # S3 OOP system * Object-oriented programming (OOP) is one of the most popular programming paradigm. * The type of an object is a **class** and a function implemented for a specific class is a **method**. * It is mostly used for **polymorphism**: the function interface is separated from its implementation. In other words, the function behaves differently according to the class. * This is related to the idea of **encapsulation**: the object interface is separated from its internal structure. In other words, the user doesn't need to worry about details of an object. Encapsulation avoids spaghetti code (see [Toyota 2013 case](http://archive.wikiwix.com/cache/index2.php?url=https%3A%2F%2Fwww.usna.edu%2FAcResearch%2F_files%2Fdocuments%2FNASEC%2F2016%2FCYBER%2520-%2520Toyota%2520Unintended%2520Acceleration.pdf)). * `R` has several OOP systems: S3, S4, R6, ... * S3 OOP system is the first R OOP system, it is rather informal (easy to modify) and widespread. --- # Functional programming * Functional programming is a programming paradigm that generally means writing computer programs by following strict rules, using simple functions (*pure*) that don't change things around (*immutability*, or no side effects). * Benefits are more maintainable, predictable, and scalable (parallel) code. * There are several key concepts, including: + *Pure* function: always produces the same output for the same input and has no side effects. + *First-class* function: like any other data structure, functions can be passed as arguments to other functions, returned from other functions, and assigned to variables. + *Higher-order* function: functions that take one or more functions as arguments or return them as results. --- # Pure function * A pure function always produces the same output for the same input. * Is `rnorm` a *pure* function? ``` r set.seed(123) rnorm(10) ``` ``` ## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499 ## [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197 ``` --- # Pure function * A pure function always produces the same output for the same input. * Is `rnorm` a *pure* function? ``` r set.seed(123) rnorm(10) ``` ``` ## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499 ## [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197 ``` ``` r set.seed(124) rnorm(10) ``` ``` ## [1] -1.38507062 0.03832318 -0.76303016 0.21230614 1.42553797 0.74447982 ## [7] 0.70022940 -0.22935461 0.19709386 1.20715377 ``` * Same input, different output `\(\Rightarrow\)` `rnorm` is not a *pure* function. --- # First-class function * A function can be passed as an argument. ``` r f <- function(g) g(rnorm(10)) f(sum) ``` ``` ## [1] -1.923609 ``` ``` r f(max) ``` ``` ## [1] 1.675632 ``` ``` r f(mean) ``` ``` ## [1] -0.182821 ``` --- # First-class function * A function can be returned from other functions. ``` r # Define a function that returns another function makeMultiplier <- function(factor) { # Define the inner function multiplier <- function(x) { return(x * factor) } # Return the inner function return(multiplier) } # Create a new function that multiplies by 5 timesFive <- makeMultiplier(5) # Use the returned function timesFive(10) ``` ``` ## [1] 50 ``` - See [Function factories](https://adv-r.hadley.nz/function-factories.html). --- # First-class function * A function can be passed as an argument and returned from other functions. ``` r # Define a function operator that takes a function as an argument applyTwice <- function(func) { return(function(x) { return(func(func(x))) }) } # Define a simple function addTwo <- function(x) { return(x + 2) } # Use the function operator to create a new function applyTwiceAddTwo <- applyTwice(addTwo) # Apply the new function to a value applyTwiceAddTwo(3) ``` ``` ## [1] 7 ``` - See [Function operators](https://adv-r.hadley.nz/function-operators.html). --- # Functional - Functionals are frequently used in `R` as a more efficient alternative to `for` loops. - A `for` loop indicates iteration but not the specific operation to perform on each element whereas functionals are specialized for specific tasks. - Transitioning from `for` loops to functionals is often a matter of finding a functional that matches the basic structure of the loop. - If there isn't an appropriate functional, it's advisable to stick with a `for` loop rather than trying to adapt an existing functional. - After repeating the same loop several times, it might be worth considering creating a custom functional tailored to the task. --- # Transitioning from `for` loops to functionals .pull-left[ ``` r # Using a for loop to calculate the squares of numbers from 1 to n n <- 5 result <- vector("list", n) for (i in 1:n) { result[[i]] <- i^2 } result ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 4 ## ## [[3]] ## [1] 9 ## ## [[4]] ## [1] 16 ## ## [[5]] ## [1] 25 ``` ] .pull-right[ ``` r # Using a functional approach with map library(purrr) sequence <- 1:n squares <- map(sequence, ~ .^2) squares ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 4 ## ## [[3]] ## [1] 9 ## ## [[4]] ## [1] 16 ## ## [[5]] ## [1] 25 ``` ] --- # Transitioning from `for` loops to functionals .pull-left[ ``` r # Using a for loop to calculate the sum of numbers from 1 to n n <- 5 result <- 0 for (i in 1:n) { result <- result + i } result ``` ``` ## [1] 15 ``` ] .pull-right[ ``` r # Using a functional approach with Reduce sequence <- 1:n result <- Reduce(function(x, y) x + y, sequence) result ``` ``` ## [1] 15 ``` ] --- # `purrr::map()` .pull-left[ - `map` takes a vector `v` and a function `f` as input and return the evaluation of `f` at each element of `v` in a list. ``` r # Using map to calculate the exponential of a vector of numbers map(1:2, exp) ``` ``` ## [[1]] ## [1] 2.718282 ## ## [[2]] ## [1] 7.389056 ``` ``` r # Or equivalently lapply(1:2, exp) ``` ``` ## [[1]] ## [1] 2.718282 ## ## [[2]] ## [1] 7.389056 ``` ] .pull-right[ ![](images/map.png) ] --- # Returning atomic vectors - `map` / `lapply` return a `list`, you may want to return an atomic vector. - For this task, there exist `map_lgl()`,` map_int()`, `map_dbl()`, `map_chr()` instead of `map`, and `vapply`, `sapply` instead of `lapply`. .pull-left[ `purrr::map` approach ``` r map_dbl(1:4, exp) ``` ``` ## [1] 2.718282 7.389056 20.085537 54.598150 ``` ] .pull-right[ base `R` approach ``` r sapply(1:4, exp) ``` ``` ## [1] 2.718282 7.389056 20.085537 54.598150 ``` ``` r # Type of output must be specified for `vapply` vapply(1:4, FUN=exp, FUN.VALUE=double(1)) ``` ``` ## [1] 2.718282 7.389056 20.085537 54.598150 ``` ] --- # Inline anonymous functions There are situations where the function you would like to pass as an argument does not exist. Instead of creating it, you can pass it as an _inline anonymous function_ (aka _lambda function_). .pull-left[ `purrr::map` approach ``` r map_int(1:4, function(x) if (x %% 2 == 0) return(x^2) else return(x^3)) ``` ``` ## [1] 1 4 27 16 ``` ``` r map_int(1:4, ~ if (.x %% 2 == 0) return(.x^2) else return(.x^3)) ``` ``` ## [1] 1 4 27 16 ``` ] .pull-right[ base `R` approach ``` r sapply(1:4, function(x) if (x %% 2 == 0) return(x^2) else return(x^3)) ``` ``` ## [1] 1 4 27 16 ``` ``` r vapply(1:4, FUN=function(x) if (x %% 2 == 0) return(x^2) else return(x^3), FUN.VALUE=double(1)) ``` ``` ## [1] 1 4 27 16 ``` ] --- # Variants to `purrr::map()` There several variants to `map` in `purrr`. For example `map2` allows for 2 arguments. See [Map variants](https://adv-r.hadley.nz/functionals.html#map-variants) for others. .pull-left[ `map2` takes two vectors `v1`, `v2` and a function `f` as input, plus some additional arguments, and return the evaluation of `f` at each pair of elements of `v1` and `v2` in a list. ``` r # Using map2 to calculate weighted means wt <- c(5, 5, 4, 1)/15 wtL <- list(wt1 = wt, wt2=wt, wt3 = wt) x <- list(x1 = c(6, 4.5, 5, 4), x2 = c(5.5, 5, 4.5, 6), x3 = c(6, 6, 4, 4)) map2_dbl(x, wtL, weighted.mean) ``` ``` ## x1 x2 x3 ## 5.100000 5.100000 5.333333 ``` ] .pull-right[ ![](images/map2-arg.png) ] --- # Variants to `purrr::map()` - `pmap` generalizes `map` to any number of inputs. .pull-left[ ``` r l1 <- as.list(1:3) l2 <- as.list(4:6) l3 <- as.list(7:9) # Define a function that takes three arguments and calculates their sum calculate_sum <- function(e1, e2, e3) e1 + e2 + e3 # Use pmap to apply the function element-wise to the lists pmap(list(l1, l2, l3), calculate_sum) ``` ``` ## [[1]] ## [1] 12 ## ## [[2]] ## [1] 15 ## ## [[3]] ## [1] 18 ``` ] .pull-right[ ![](images/pmap-3.png) ] --- # Variants to `sapply` - Similar to `pmap`, `mapply` generalizes `sapply` to any number of inputs. - There is also `Map`, but it vectorizes over all arguments: it is not possible to supply extra non-vectorized input. .pull-left[ ``` r # Using mapply to calculate weighted means wt <- c(5, 5, 4, 1)/15 wtL <- list(wt1 = wt, wt2=wt, wt3 = wt) x <- list(x1 = c(6, 4.5, 5, 4), x2 = c(5.5, 5, 4.5, 6), x3 = c(6, 6, 4, 4)) mapply(FUN = weighted.mean, x, wtL) ``` ``` ## x1 x2 x3 ## 5.100000 5.100000 5.333333 ``` ] .pull-right[ ``` r # Using Map to calculate a weighted mean wt <- c(5, 5, 4, 1)/15 wtL <- list(wt1 = wt, wt2=wt, wt3 = wt) x <- list(x1 = c(6, 4.5, 5, 4), x2 = c(5.5, 5, 4.5, 6), x3 = c(6, 6, 4, 4)) Map(f = weighted.mean, x, wtL) ``` ``` ## $x1 ## [1] 5.1 ## ## $x2 ## [1] 5.1 ## ## $x3 ## [1] 5.333333 ``` ] --- # `outer` product - `outer(X, Y, FUN, ...)` produce an array (or matrix) with the same dimension as the outer product of `X` and `Y` applied to a vectorized `FUN`. ``` r outer(X = c("a","b","c"), Y = c("1", "2", "3", "4"), FUN = paste0) ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "a1" "a2" "a3" "a4" ## [2,] "b1" "b2" "b3" "b4" ## [3,] "c1" "c2" "c3" "c4" ``` --- # Common Higher-Order functions in FPL - *Higher-order* function: functions that take one or more functions as arguments or return them as results. - `Reduce` employs a binary function `f` to iteratively merge the elements of a provided vector `x`, potentially starting with an initial value `init`. .pull-left[ ``` r # Using a functional approach with Reduce sequence <- 1:10 Reduce(function(x, y) x + y, sequence) ``` ``` ## [1] 55 ``` ``` r # or shorter Reduce(`+`, 1:10) ``` ``` ## [1] 55 ``` ] .pull-right[ ![](images/reduce.png) ] --- # `R`'s vectorization - Many functions are already _vectorized_ in `R`. .pull-left[ ``` r # Using map to calculate the exponential of a vector of numbers map(1:2, exp) ``` ``` ## [[1]] ## [1] 2.718282 ## ## [[2]] ## [1] 7.389056 ``` ``` r # Or equivalently lapply(1:2, exp) ``` ``` ## [[1]] ## [1] 2.718282 ## ## [[2]] ## [1] 7.389056 ``` ] .pull-right[ ``` r # Exp is already vectorized exp(1:2) ``` ``` ## [1] 2.718282 7.389056 ``` ] --- # Vectorizing a function - There is a dedicated function `Vectorize` that vectorizes a function. ``` r # return square if even, cube if otherwise # `purrr::map` approach map_int(1:4, ~ if (.x %% 2 == 0) return(.x^2) else return(.x^3)) ``` ``` ## [1] 1 4 27 16 ``` ``` r f <- function(x) if (x %% 2 == 0) return(x^2) else return(x^3) f(1:4) ``` ``` ## Error in if (x%%2 == 0) return(x^2) else return(x^3): the condition has length > 1 ``` --- # Vectorizing a function - There is a dedicated function `Vectorize` that vectorizes a function. ``` r # return square if even, cube if otherwise f <- function(x) if (x %% 2 == 0) return(x^2) else return(x^3) *vf <- Vectorize(FUN = f, vectorize.args = "x") vf(1:4) ``` ``` ## [1] 1 4 27 16 ``` --- # Vectorizing a function - There is a dedicated function `Vectorize` that vectorizes a function. ``` r # return square if even, cube if otherwise vf <- Vectorize(FUN = f, vectorize.args = "x") # `ifelse` is a specific vectorized function g <- function(x) ifelse(x %% 2 == 0, x^2, x^3) g(1:4) ``` ``` ## [1] 1 4 27 16 ``` --- # Parallelism * Benefits of FP are more maintainable, predictable, and scalable (*parallel*) code. * Many problems are *embarrassingly parallel*: the task can be split with little (or no) efforts into independent parallel subtasks. * `R`'s library `parallel` comes with your `R` installation and offers several parallelized version of the different `apply` functions. ``` r # how many cores on the current host? library(parallel) detectCores() ``` ``` ## [1] 12 ``` - This is not physical cores but rather the total number of threads. In short, Hyper Threading allows a physical Core to work on different thread simultaneously. --- # Forking with `mclapply` - `mclapply` is a parallelized version of `lapply` that uses forking. (There are also `mcMap` and `mcmapply`). - Forking is the Unix-based (might not work on Windows) process of creating new child process, which is an identical copy of the parent process, allowing for concurrent execution of multiple tasks. ``` r measure_time <- function(x){ t1 <- Sys.time() Sys.sleep(x) t2 <- Sys.time() difftime(t2,t1,units="secs") } ``` --- # Forking with `mclapply` ``` r t1 <- Sys.time() mclapply(1:5, measure_time, mc.cores = 5) ``` ``` ## [[1]] ## Time difference of 1.001395 secs ## ## [[2]] ## Time difference of 2.002373 secs ## ## [[3]] ## Time difference of 3.002665 secs ## ## [[4]] ## Time difference of 4.004278 secs ## ## [[5]] ## Time difference of 5.005422 secs ``` ``` r t2 <- Sys.time() sprintf("In total, it took %.1f seconds to run", difftime(t2,t1,units="secs")) ``` ``` ## [1] "In total, it took 5.0 seconds to run" ``` --- # Building a Socket Cluster with `parLapply` - A socket enables interprocess communication between concurrent applications running on the computer. This is an alternative to forking mechanism. ``` r cl <- makeCluster(5) t1 <- Sys.time() parLapply(cl, 1:5, measure_time) ``` ``` ## [[1]] ## Time difference of 1.00124 secs ## ## [[2]] ## Time difference of 2.002342 secs ## ## [[3]] ## Time difference of 3.003498 secs ## ## [[4]] ## Time difference of 4.004308 secs ## ## [[5]] ## Time difference of 5.005362 secs ``` ``` r t2 <- Sys.time() ``` --- # Building a Socket Cluster with `parLapply` ``` r stopCluster(cl) sprintf("In total, it took %.1f seconds to run", difftime(t2,t1,units="secs")) ``` ``` ## [1] "In total, it took 5.0 seconds to run" ``` --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2024/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] --- # To go further * See [Functionals](https://adv-r.hadley.nz/functionals.html), [Function factories](https://adv-r.hadley.nz/function-factories.html) and [Function operators](https://adv-r.hadley.nz/function-operators.html) chapters of [Advanced R](https://adv-r.hadley.nz/index.html) written by H. Wickham. * See [`purrr` cheatsheet](https://maraaverick.rbind.io/banners/purrr_apply_cheatsheet_rstudio.png). * See [Loop Functions](https://bookdown.org/rdpeng/rprogdatascience/loop-functions.html) and [Parallel Computation](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html) chapters of [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/) written by R.D. Peng. * The article [Cleaner R Code with Functional Programming](https://towardsdatascience.com/cleaner-r-code-with-functional-programming-adc37931ef7a) by Tim Book.