class: center, middle, inverse, title-slide .title[ # Programming Tools in Data Science ] .subtitle[ ## Lecture #4: Data Structure ] .author[ ### Samuel Orso ] .date[ ### 3 October 2024 ] --- # Data structures <img src="images/lego-3388163_1280.png" width="1707" style="display: block; margin: auto;" /> --- # Primary types - **Double** (or real): these are used to store real numbers. Examples: -4, 12.4532, 6. - **Integer**: followed by `L`; examples: 2L, 12L. - **Logical** (or boolean); written in full `TRUE`, `FALSE`, or in short `T`, `F`. - **Character** (or strings); examples: `"a"`, `"Bonjour"`. - (**Complex** (complex number) and **Raw**) Note that: - Integer and double are of type **numeric**. - All these types are **atomic vector**, a fundamental data structure. --- # Reserved words * **Missing values** `NA` (Not Available) can have primary types: `NA_real_`, `NA_integer_`, `NA` (logical by default), `NA_complex_`, `NA_character_`. * For real and complex numbers, `R` has also `Inf` for infinity and `NaN` (Not a Number). * `NULL` for a null object. * For control structures: `if`, `else`, `repeat`, `while`, `function`, `for`, `in`, `next`, `break` --- # `typeof` typeof determines the (`R` internal) type or storage mode of any object ```r typeof(2) ``` ``` ## [1] "double" ``` ```r typeof("a") ``` ``` ## [1] "character" ``` ```r typeof(NA) ``` ``` ## [1] "logical" ``` ```r typeof(Inf) ``` ``` ## [1] "double" ``` --- # `is.type` primitive functions * `is.type` allows to test for a type * There are `is.double`, `is.integer`, `is.numeric`, `is.logical`, `is.character`, `is.complex`, `is.na`, `is.nan`, `is.finite`, `is.infinite`, `is.null`, ... ```r is.double(2L) ``` ``` ## [1] FALSE ``` ```r is.double(2) ``` ``` ## [1] TRUE ``` ```r is.numeric(2) ``` ``` ## [1] TRUE ``` --- # `as.type` primitive functions * `as.type` allows to coerce to a type * There are `as.double`, `as.integer`, `as.numeric`, `as.logical`, `as.character`, `as.complex`, `as.na`, `as.nan`, `as.finite`, `as.infinite`, `as.null`, ... ```r typeof(2L) ``` ``` ## [1] "integer" ``` ```r typeof(as.double(2L)) ``` ``` ## [1] "double" ``` --- # Data structures * A data structure can be **homogenous** (accept only one primary type) or **heterogenous** (accept multiple primary types). | Dimension | Homogenous | Heterogeneous | | --- | --- | --- | | 1 | Vector | List | | 2 | Matrix | Data Frame | | n | Array | | --- # Example | Name | Date of Birth | Place of Birth | Place of Residence | ATP Ranking | Prize Money | Win Percentage | Grand Slam Wins | Pro Since | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Novak Djokovic | 22 May 1987 | Belgrade, Serbia | Monte Carlo, Monaco | 1 | 135,259,120 | 82.7 | 16 | 2003 | | Rafael Nadal | 3 June 1986 | Manacor, Spain | Manacor, Spain | 2 | 115,178,858 | 83.1 | 19 | 2001 | | Roger Federer | 8 August 1981 | Basel, Switzerland | Wollerau, Switzerland | 3 | 126,840,700 | 82.2 | 20 | 1998 | | Daniil Medvedev | 11 February 1996 | Moscow, Russia | Monte Carlo, Monaco | 4 | 8,171,483 | 63.6 | 0 | 2014 | | Dominic Thiem | 3 September 1993 | Wiener Neustadt, Austria | Lichtenwörth, Austria | 5 | 18,588,662 | 64.3 | 0 | 2011 | --- # Vector A vector has three important properties: - The **Type** of objects contained in the vector (`typeof`) - The **Length** of a vector indicates the number of elements in the vector. This information can be obtained using the function `length()`. - **Attributes** are metadata attached to a vector. The functions `attr()` and `attributes()` can be used to store and retrieve attributes. --- # Creating vectors with `c()` * `c()` is a generic function that combines arguments separated by `,` to form a vector. ```r c(1, 2, 8, 10) ``` ``` ## [1] 1 2 8 10 ``` * You can assign a value to a name via the `<-` operator ```r grand_slam_win <- c(16, 19, 20, 0, 0) grand_slam_win ``` ``` ## [1] 16 19 20 0 0 ``` ```r typeof(grand_slam_win) ``` ``` ## [1] "double" ``` --- # Creating a vector with `vector()` * `vector` produces a vector of given length and mode (primary type) ```r grand_slam_win <- vector(mode = "integer", length = 5) grand_slam_win ``` ``` ## [1] 0 0 0 0 0 ``` --- # Subsetting There are four main ways to subset a vector: * **Positive Index**: *access* or *subset* the `\(i\)`-th element of a vector by simply using `grand_slam_win[i]` where `\(i\)` is a positive number between 1 and the length of the vector. ```r grand_slam_win <- vector(mode = "integer", length = 5) grand_slam_win[1] <- 16 grand_slam_win[2] <- 19 grand_slam_win[3] <- 20 grand_slam_win ``` ``` ## [1] 16 19 20 0 0 ``` or equivalently ```r grand_slam_win <- vector(mode = "integer", length = 5) grand_slam_win[1:3] <- c(16, 19, 20) grand_slam_win ``` ``` ## [1] 16 19 20 0 0 ``` --- # Subsetting * **Negative Index**: *remove* elements in a vector using negative indices. * **Logical Indices**: give a logical vector of same length with `TRUE` to subset the element and `FALSE` otherwise. ```r grand_slam_win <- vector(mode = "integer", length = 5) grand_slam_win[c(TRUE, TRUE, TRUE, FALSE, FALSE)] <- c(16, 19, 20) grand_slam_win ``` ``` ## [1] 16 19 20 0 0 ``` * **Names Index**: use names (or labels) instead of positive numbers. --- # Coercion for homogenous data structure * A vector has a homogeneous data structure meaning that it can only contain a single primary type. * If multiple primary types is provided, then the following coercion rule applies: `\begin{equation*} \text{logical} < \text{integer} < \text{double} < \text{character} \end{equation*}` ```r # Logical + double (mix_logic_int <- c(TRUE, 12, 0.5)) ``` ``` ## [1] 1.0 12.0 0.5 ``` ```r typeof(mix_logic_int) ``` ``` ## [1] "double" ``` --- # Attributes * Attributes are used to store metadata on the object of interest. * Individual attributes can be accessed and set via `attr()` ```r attr(grand_slam_win, "date") <- "09-30-2019" attr(grand_slam_win, "type") <- "Men, Singles" ``` * Attributes can be accessed and set all at once with `attributes` and `structure` ```r attributes(grand_slam_win) ``` ``` ## $date ## [1] "09-30-2019" ## ## $type ## [1] "Men, Singles" ``` --- # Adding labels * You can add labels to a vector using `names` ```r names(grand_slam_win) <- c("Novak Djokovic", "Rafael Nadal", "Roger Federer", "Daniil Medvedev", "Dominic Thiem") attributes(grand_slam_win) ``` ``` ## $date ## [1] "09-30-2019" ## ## $type ## [1] "Men, Singles" ## ## $names ## [1] "Novak Djokovic" "Rafael Nadal" "Roger Federer" "Daniil Medvedev" ## [5] "Dominic Thiem" ``` * Avoid using `attr(grand_slam_win, "names")` as it is more error-prone * You can unname using `unname(grand_slam_win)` or `names(grand_slam_win) <- NULL` --- # Working with dates * `as.Date()` function converts character strings into dates. The typical syntax is of the form: ```r as.Date(<vector of dates>, format = <your format>) ``` For example ```r (players_dob <- as.Date(c("22 May 1987", "3 Jun 1986", "8 Aug 1981", "11 Feb 1996", "3 Sep 1993"), format = "%d %b %Y")) ``` ``` ## [1] "1987-05-22" "1986-06-03" "1981-08-08" "1996-02-11" "1993-09-03" ``` --- # Useful Functions with Vectors * Common function for vectors are `length()` `sum()` `mean()` `order()` and `sort()` ```r length(grand_slam_win) ``` ``` ## [1] 5 ``` ```r sum(grand_slam_win) ``` ``` ## [1] 55 ``` ```r mean(grand_slam_win) ``` ``` ## [1] 11 ``` --- ```r order(grand_slam_win) ``` ``` ## [1] 4 5 1 2 3 ``` ```r sort(grand_slam_win) ``` ``` ## Daniil Medvedev Dominic Thiem Novak Djokovic Rafael Nadal Roger Federer ## 0 0 16 19 20 ``` --- # Creating sequences Common ways to create sequences: - **`from:to`**: This method is quite intuitive and very compact. For example: ```r 1:3 ``` ``` ## [1] 1 2 3 ``` ```r (x <- 3:1) ``` ``` ## [1] 3 2 1 ``` ```r (y <- -1:-4) ``` ``` ## [1] -1 -2 -3 -4 ``` --- - **`seq_len(n)`**: This function provides a simple way to generate a sequence from 1 to an arbitrary number `n`. Few examples: ```r n <- 3 1:n ``` ``` ## [1] 1 2 3 ``` ```r seq_len(n) ``` ``` ## [1] 1 2 3 ``` ```r n <- 0 1:n ``` ``` ## [1] 1 0 ``` ```r seq_len(n) ``` ``` ## integer(0) ``` --- - **`seq(a, b, by/length.out = d)`**: This function can be used to create more "complex" sequences. It can either be used to create a sequence from `a` to `b` by increments of `d` (using the option `by`) or of a total length of `d` (using the option `length.out`). ```r (x <- seq(1, 2.8, by = 0.4)) ``` ``` ## [1] 1.0 1.4 1.8 2.2 2.6 ``` ```r (y <- seq(1, 2.8, length.out = 6)) ``` ``` ## [1] 1.00 1.36 1.72 2.08 2.44 2.80 ``` --- - **`rep()`**: create vectors with repeated values or sequences, for example: ```r rep(c(1,2), times = 3, each = 1) ``` ``` ## [1] 1 2 1 2 1 2 ``` ```r rep(c(1,2), times = 1, each = 3) ``` ``` ## [1] 1 1 1 2 2 2 ``` ```r rep(c(1,2), times = 2, each = 2) ``` ``` ## [1] 1 1 2 2 1 1 2 2 ``` --- # Exercise 1. Let `x <- 3 * seq_len(4)`. Select the 2nd element of `x` with (a) positive, (b) negative and (c) logical indices. 2. Sort `x` in descending order using (a) a sequence of positive indices and (b) the `sort` function. --- # Matrix <center> <iframe src="https://giphy.com/embed/sULKEgDMX8LcI" width="480" height="225" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/sci-fi-matrix-cyberpunk-sULKEgDMX8LcI">via GIPHY</a></p></center> --- # Matrix * Matrix is a common data structure in `R`. It has two dimensions and stores homogeneous primary types. * The `matrix()` function can be used to create a matrix from a vector: ```r (M <- matrix(1:12, ncol = 4, nrow = 3)) ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] 1 4 7 10 ## [2,] 2 5 8 11 ## [3,] 3 6 9 12 ``` * Several vectors can be combined by row `rbind` or column `cbind`. * Names can be assigned by column `colnames` and by row `rownames`. * Subsetting is similar to vector `M[rows, cols]` --- # Common matrix operations * Verify `M` is a matrix: `is.matrix(M)` * Return the dimensions of `M`: `dim(M)` or `nrow(M)` and `ncol(M)` * Matrix transpose: `t(M)` * Elementwise multiplication (assume correct dimension): `A * B` * Matrix multiplication (assume correct dimension): `A %*% B` * Matrix inverse (assume correct dimension): `solve(M)` * col/row mean and sum: `colMeans(M), rowMeans(M), colSums(M), rowSums(M)` * Extract diagonal element: `diag(M)` * Set diagonal element: `diag(M) <- c(...)` --- # List * A list is a flexible and common **heterogeneous** data structure (contains different object types). ```r # List elements num_vec <- c(188, 140) char_vec <- c("Height", "Weight", "Length") logic_vec <- rep(TRUE, 8) my_mat <- matrix(seq_len(10), nrow = 2, ncol = 5) # List initialization (my_list <- list(num_vec, char_vec, logic_vec, my_mat)) ``` ``` ## [[1]] ## [1] 188 140 ## ## [[2]] ## [1] "Height" "Weight" "Length" ## ## [[3]] ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## ## [[4]] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 ``` --- It is possible to add named labels to the elements of your list. For example: ```r # List initialization with custom names (my_list <- list(number = num_vec, character = char_vec, logic = logic_vec, matrix = my_mat)) ``` ``` ## $number ## [1] 188 140 ## ## $character ## [1] "Height" "Weight" "Length" ## ## $logic ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## ## $matrix ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 ``` --- # Subsetting Subsetting is very similar to what we have already discussed. Indeed, we have ```r # Extract the first and third element my_list[c(1, 3)] ``` ``` ## $number ## [1] 188 140 ## ## $logic ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ``` ```r # Compare these two subsets my_list[1] ``` ``` ## $number ## [1] 188 140 ``` ```r my_list[[1]] ``` ``` ## [1] 188 140 ``` --- # Subsetting ```r # Using labels to subset my_list$number ``` ``` ## [1] 188 140 ``` ```r my_list$matrix ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 ``` ```r my_list$matrix[,3] # Extract third column of matrix ``` ``` ## [1] 5 6 ``` --- # Single `[` or double `[[`? * Single `[` keeps the `list` class ```r # type of object typeof(my_list[1]) ``` ``` ## [1] "list" ``` * Double `[[` simplifies to the class of the element ```r is.matrix(my_list[4]) ``` ``` ## [1] FALSE ``` ```r is.matrix(my_list[[4]]) ``` ``` ## [1] TRUE ``` --- # Dataframe * A data frame is a **heterogeneous** data structure used for storing data tables. ```r ### Creation players <- c("Novak Djokovic", "Rafael Nadal", "Roger Federer", "Daniil Medvedev", "Dominic Thiem") grand_slam_win <- c(16, 19, 20, 0, 0) date_of_birth <- c("22 May 1987", "3 June 1986", "8 August 1981", "11 February 1996", "3 September 1993") tennis <- data.frame(date_of_birth, grand_slam_win) dimnames(tennis)[[1]] <- players tennis ``` ``` ## date_of_birth grand_slam_win ## Novak Djokovic 22 May 1987 16 ## Rafael Nadal 3 June 1986 19 ## Roger Federer 8 August 1981 20 ## Daniil Medvedev 11 February 1996 0 ## Dominic Thiem 3 September 1993 0 ``` --- # Dataframe * Each column has its own type. We can check if we have achieved our goal by using: ```r is.data.frame(tennis) ``` ``` ## [1] TRUE ``` * We can also verify the structure ```r str(tennis) ``` ``` ## 'data.frame': 5 obs. of 2 variables: ## $ date_of_birth : chr "22 May 1987" "3 June 1986" "8 August 1981" "11 February 1996" ... ## $ grand_slam_win: num 16 19 20 0 0 ``` --- # Combination * Data frames can be combined using `cbind()` - column bind and `rbind()` - row bind: ```r height <- c(1.88, 1.85, 1.85, 1.98, 1.85) hand <- c("R","L","R","R","R") (tennis <- cbind(tennis, data.frame(height, hand))) ``` ``` ## date_of_birth grand_slam_win height hand ## Novak Djokovic 22 May 1987 16 1.88 R ## Rafael Nadal 3 June 1986 19 1.85 L ## Roger Federer 8 August 1981 20 1.85 R ## Daniil Medvedev 11 February 1996 0 1.98 R ## Dominic Thiem 3 September 1993 0 1.85 R ``` --- # Subsetting * Data frame can be subset like a matrix or like a list. ```r # There are two ways to select columns from a data frame # Like a list: tennis[c("height", "date_of_birth")] ``` ``` ## height date_of_birth ## Novak Djokovic 1.88 22 May 1987 ## Rafael Nadal 1.85 3 June 1986 ## Roger Federer 1.85 8 August 1981 ## Daniil Medvedev 1.98 11 February 1996 ## Dominic Thiem 1.85 3 September 1993 ``` ```r # Like a matrix tennis[, c("height", "date_of_birth")] ``` ``` ## height date_of_birth ## Novak Djokovic 1.88 22 May 1987 ## Rafael Nadal 1.85 3 June 1986 ## Roger Federer 1.85 8 August 1981 ## Daniil Medvedev 1.98 11 February 1996 ## Dominic Thiem 1.85 3 September 1993 ``` --- # Tibble - Tibble is a modern alternative to data frame championed by H. Wickham. He claims it is *lazy and surly* data frame (with a bit of humor). - Tibble is lazy because it never coerces its input. For instance, it does not transform non-syntactic names: ```r library(tibble) # you need to install an extra library (or package) names(data.frame(`1` = 1:3, `?` = 4:6)) ``` ``` ## [1] "X1" "X." ``` ```r names(tibble(`1` = 1:3, `?` = 4:6)) ``` ``` ## [1] "1" "?" ``` --- - Tibble is surly, for example it does not automatically recycle shorter columns (unless of length 1) whereas data frame do so if columns length are integer multiples: ```r data.frame(a = 1:4, b = 1:2) ``` ``` ## a b ## 1 1 1 ## 2 2 2 ## 3 3 1 ## 4 4 2 ``` ```r tibble(a = 1:4, b = 1:2) ``` ``` ## Error in `tibble()`: ## ! Tibble columns must have compatible sizes. ## • Size 4: Existing data. ## • Size 2: Column `b`. ## ℹ Only values of size one are recycled. ``` --- - Tibble allows you to interact with variables during creation: ```r tibble(a = 1:3, b = 2 + 2 * a) ``` ``` ## # A tibble: 3 × 2 ## a b ## <int> <dbl> ## 1 1 4 ## 2 2 6 ## 3 3 8 ``` --- - A last notable difference, tibble by default prints only the first 10 rows: ```r set.seed(1) tibble(x = rnorm(1e3), y = rexp(1e3)) ``` ``` ## # A tibble: 1,000 × 2 ## x y ## <dbl> <dbl> ## 1 -0.626 0.601 ## 2 0.184 0.750 ## 3 -0.836 0.167 ## 4 1.60 2.21 ## 5 0.330 0.0553 ## 6 -0.820 0.727 ## 7 0.487 0.583 ## 8 0.738 0.198 ## 9 0.576 3.54 ## 10 -0.305 4.01 ## # ℹ 990 more rows ``` <!-- --- --> <!-- # Exercise --> <!-- Using the following code: --> <!-- ```{r, eval=F} --> <!-- set.seed(1) --> <!-- A <- matrix(rnorm(20), ncol = 2) --> <!-- B <- matrix(rnorm(20), ncol = 2) --> <!-- ``` --> <!-- 1. What are the dimensions of `\(A\)` and `\(B\)`? Compute `\(A^TB\)` and `\(AB^T\)`. --> <!-- 2. Combine `\(A\)` and `\(B\)` row-wise to create `\(C\)`. --> <!-- 3. Let `\(D\)` be a copy of `\(C\)` centered around the mean --> <!-- columnwise. --> <!-- The unbiased estimator of the covariance matrix of --> <!-- `\(C\)` is defined as `$$\frac{1}{n-1}D^TD,$$` --> <!-- where `\(n\)` is the number of rows of `\(D\)`. --> <!-- Compute this quantity. --> <!-- Compare with `cov(C)`. --> --- # To go further * More details and examples in the book [An Introduction to Statistical Programming Methods with R](https://smac-group.github.io/ds/section-data.html) * More advanced remarks in [Advanced R](https://adv-r.hadley.nz/index.html), Chapters 2, 3 and 4. --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2024/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ]