Programming Tools in Data Science

class: center, middle, inverse, title-slide

.title[
# Programming Tools in Data Science
]
.subtitle[
## Lecture #4: Data Structure
]
.author[
### Samuel Orso
]
.date[
### 3 October 2024
]

---

# Data structures
<img src="images/lego-3388163_1280.png" width="1707" style="display: block; margin: auto;" />

---
# Primary types

- **Double** (or real): these are used to store real numbers. Examples: -4, 12.4532, 6.
- **Integer**: followed by `L`; examples: 2L, 12L.
- **Logical** (or boolean); written in full `TRUE`, `FALSE`, or in short `T`, `F`.
- **Character** (or strings); examples: `"a"`, `"Bonjour"`.
- (**Complex** (complex number) and **Raw**)

Note that:
- Integer and double are of type **numeric**.
- All these types are **atomic vector**, a fundamental data structure.

---
# Reserved words
* **Missing values** `NA` (Not Available) can have primary types:  `NA_real_`, `NA_integer_`, `NA` (logical by default), `NA_complex_`, `NA_character_`.
* For real and complex numbers, `R` has also `Inf` for infinity and `NaN` (Not a Number).
* `NULL` for a null object.
* For control structures: `if`, `else`, `repeat`, `while`, `function`, `for`, `in`, `next`, `break`

---
# `typeof`
typeof determines the (`R` internal) type or storage mode of any object

```r
typeof(2)
```

```
## [1] "double"
```

```r
typeof("a")
```

```
## [1] "character"
```

```r
typeof(NA)
```

```
## [1] "logical"
```

```r
typeof(Inf)
```

```
## [1] "double"
```

---
# `is.type` primitive functions
* `is.type` allows to test for a type
* There are `is.double`, `is.integer`, `is.numeric`, `is.logical`, `is.character`, `is.complex`, `is.na`, `is.nan`, `is.finite`, `is.infinite`, `is.null`, ...

```r
is.double(2L)
```

```
## [1] FALSE
```

```r
is.double(2)
```

```
## [1] TRUE
```

```r
is.numeric(2)
```

```
## [1] TRUE
```

---
# `as.type` primitive functions
* `as.type` allows to coerce to a type
* There are `as.double`, `as.integer`, `as.numeric`, `as.logical`, `as.character`, `as.complex`, `as.na`, `as.nan`, `as.finite`, `as.infinite`, `as.null`, ...

```r
typeof(2L)
```

```
## [1] "integer"
```

```r
typeof(as.double(2L))
```

```
## [1] "double"
```

---
# Data structures
* A data structure can be **homogenous** (accept only one primary type) or **heterogenous** (accept multiple primary types).

| Dimension | Homogenous | Heterogeneous |   
| --- | --- | --- |   
| 1 | Vector | List |   
| 2 | Matrix | Data Frame |   
| n | Array | |

---
# Example
| Name | Date of Birth | Place of Birth | Place of Residence | ATP Ranking | Prize Money | Win Percentage | Grand Slam Wins | Pro Since |   
| --- | --- | --- | --- | --- | --- | --- | --- | --- |   
| Novak Djokovic | 22 May 1987 | Belgrade, Serbia | Monte Carlo, Monaco | 1 | 135,259,120 | 82.7 | 16 | 2003 |   
| Rafael Nadal | 3 June 1986 | Manacor, Spain | Manacor, Spain | 2 | 115,178,858 | 83.1 | 19 | 2001 |   
| Roger Federer | 8 August 1981 | Basel, Switzerland | Wollerau, Switzerland | 3 | 126,840,700 | 82.2 | 20 | 1998 |   
| Daniil Medvedev | 11 February 1996 | Moscow, Russia | Monte Carlo, Monaco | 4 | 8,171,483 | 63.6 | 0 | 2014 |   
| Dominic Thiem | 3 September 1993 | Wiener Neustadt, Austria | Lichtenwörth, Austria | 5 | 18,588,662 | 64.3 | 0 | 2011 |

---
# Vector
A vector has three important properties:

- The **Type** of objects contained in the vector (`typeof`)
- The **Length** of a vector indicates the number of elements in the vector. This information can be obtained using the function `length()`.
- **Attributes** are metadata attached to a vector. The functions `attr()` and `attributes()` can be used to store and retrieve attributes.

---
# Creating vectors with `c()`
* `c()` is a generic function that combines arguments separated by `,` to form a vector.

```r
c(1, 2, 8, 10)
```

```
## [1]  1  2  8 10
```

* You can assign a value to a name via the `<-` operator

```r
grand_slam_win <- c(16, 19, 20, 0, 0)
grand_slam_win
```

```
## [1] 16 19 20  0  0
```

```r
typeof(grand_slam_win)
```

```
## [1] "double"
```

---
# Creating a vector with `vector()`
* `vector` produces a vector of given length and mode (primary type)

```r
grand_slam_win <- vector(mode = "integer", length = 5)
grand_slam_win
```

```
## [1] 0 0 0 0 0
```

---
# Subsetting
There are four main ways to subset a vector:
* **Positive Index**: *access* or *subset* the `$i$`-th element of a vector by simply using `grand_slam_win[i]` where `$i$` is a positive number between 1 and the length of the vector.

```r
grand_slam_win <- vector(mode = "integer", length = 5)
grand_slam_win[1] <- 16
grand_slam_win[2] <- 19
grand_slam_win[3] <- 20
grand_slam_win
```

```
## [1] 16 19 20  0  0
```

or equivalently

```r
grand_slam_win <- vector(mode = "integer", length = 5)
grand_slam_win[1:3] <- c(16, 19, 20)
grand_slam_win
```

```
## [1] 16 19 20  0  0
```

---
# Subsetting
* **Negative Index**: *remove* elements in a vector using negative indices.
* **Logical Indices**: give a logical vector of same length with `TRUE` to subset the element and `FALSE` otherwise.

```r
grand_slam_win <- vector(mode = "integer", length = 5)
grand_slam_win[c(TRUE, TRUE, TRUE, FALSE, FALSE)] <- c(16, 19, 20)
grand_slam_win
```

```
## [1] 16 19 20  0  0
```
* **Names Index**: use names (or labels) instead of positive numbers.

---
# Coercion for homogenous data structure
* A vector has a homogeneous data structure meaning that it can only contain a single primary type.
* If multiple primary types is provided, then the following coercion rule applies:

`\begin{equation*}
 \text{logical} < \text{integer} < \text{double} < \text{character}
\end{equation*}`

```r
# Logical + double
(mix_logic_int <- c(TRUE, 12, 0.5))
```

```
## [1]  1.0 12.0  0.5
```

```r
typeof(mix_logic_int)
```

```
## [1] "double"
```

---
# Attributes
* Attributes are used to store metadata on the object of interest.
* Individual attributes can be accessed and set via `attr()`

```r
attr(grand_slam_win, "date") <- "09-30-2019"
attr(grand_slam_win, "type") <- "Men, Singles"
```
* Attributes can be accessed and set all at once with `attributes` and `structure`

```r
attributes(grand_slam_win)
```

```
## $date
## [1] "09-30-2019"
## 
## $type
## [1] "Men, Singles"
```

---
# Adding labels
* You can add labels to a vector using `names`

```r
names(grand_slam_win) <- c("Novak Djokovic", "Rafael Nadal", "Roger Federer", "Daniil Medvedev", "Dominic Thiem")
attributes(grand_slam_win)
```

```
## $date
## [1] "09-30-2019"
## 
## $type
## [1] "Men, Singles"
## 
## $names
## [1] "Novak Djokovic"  "Rafael Nadal"    "Roger Federer"   "Daniil Medvedev"
## [5] "Dominic Thiem"
```
* Avoid using `attr(grand_slam_win, "names")` as it is more error-prone
* You can unname using `unname(grand_slam_win)` or `names(grand_slam_win) <- NULL`

---
# Working with dates
* `as.Date()` function converts character strings into dates. The typical syntax is of the form:

```r
as.Date(<vector of dates>, format = <your format>)
```

For example

```r
(players_dob <- as.Date(c("22 May 1987", "3 Jun 1986", "8 Aug 1981", 
                         "11 Feb 1996", "3 Sep 1993"), 
                       format = "%d %b %Y"))
```

```
## [1] "1987-05-22" "1986-06-03" "1981-08-08" "1996-02-11" "1993-09-03"
```

---
# Useful Functions with Vectors

* Common function for vectors are
`length()` `sum()` `mean()` `order()` and `sort()`

```r
length(grand_slam_win)
```

```
## [1] 5
```

```r
sum(grand_slam_win)
```

```
## [1] 55
```

```r
mean(grand_slam_win)
```

```
## [1] 11
```

---

```r
order(grand_slam_win)
```

```
## [1] 4 5 1 2 3
```

```r
sort(grand_slam_win)
```

```
## Daniil Medvedev   Dominic Thiem  Novak Djokovic    Rafael Nadal   Roger Federer 
##               0               0              16              19              20
```

---
# Creating sequences

Common ways to create sequences:

- **`from:to`**: This method is quite intuitive and very compact. For example:

```r
1:3
```

```
## [1] 1 2 3
```

```r
(x <- 3:1)
```

```
## [1] 3 2 1
```

```r
(y <- -1:-4)
```

```
## [1] -1 -2 -3 -4
```

---

- **`seq_len(n)`**: This function provides a simple way to generate a sequence from 1 to an arbitrary number `n`. Few examples:

```r
n <- 3
1:n
```

```
## [1] 1 2 3
```

```r
seq_len(n)
```

```
## [1] 1 2 3
```

```r
n <- 0
1:n
```

```
## [1] 1 0
```

```r
seq_len(n)
```

```
## integer(0)
```

---
- **`seq(a, b, by/length.out = d)`**: This function can be used to create more "complex" sequences. It can either be used to create a sequence from `a` to `b` by increments of `d` (using the option `by`) or of a total length of `d` (using the option `length.out`).

```r
(x <- seq(1, 2.8, by = 0.4))
```

```
## [1] 1.0 1.4 1.8 2.2 2.6
```

```r
(y <- seq(1, 2.8, length.out = 6))
```

```
## [1] 1.00 1.36 1.72 2.08 2.44 2.80
```

---
- **`rep()`**: create vectors with repeated values or sequences, for example:

```r
rep(c(1,2), times = 3, each = 1)
```

```
## [1] 1 2 1 2 1 2
```

```r
rep(c(1,2), times = 1, each = 3)
```

```
## [1] 1 1 1 2 2 2
```

```r
rep(c(1,2), times = 2, each = 2)
```

```
## [1] 1 1 2 2 1 1 2 2
```

---
# Exercise
1. Let `x <- 3 * seq_len(4)`. Select the 2nd element of `x` with (a) positive, (b) negative and (c) logical indices.
2. Sort `x` in descending order using (a) a sequence of positive indices and (b) the `sort` function.

---
# Matrix
<center>
<iframe src="https://giphy.com/embed/sULKEgDMX8LcI" width="480" height="225" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/sci-fi-matrix-cyberpunk-sULKEgDMX8LcI">via GIPHY</a></p></center>

---
# Matrix

* Matrix is a common data structure in `R`. It has two dimensions and stores homogeneous primary types.
* The `matrix()` function can be used to create a matrix from a vector:

```r
(M <- matrix(1:12, ncol = 4,  nrow = 3))
```

```
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
```

* Several vectors can be combined by row `rbind` or column `cbind`.
* Names can be assigned by column `colnames` and by row `rownames`.
* Subsetting is similar to vector `M[rows, cols]`

---
# Common matrix operations
* Verify `M` is a matrix: `is.matrix(M)`
* Return the dimensions of `M`: `dim(M)` or `nrow(M)` and `ncol(M)`
* Matrix transpose: `t(M)`
* Elementwise multiplication (assume correct dimension): `A * B`
* Matrix multiplication (assume correct dimension): `A %*% B`
* Matrix inverse (assume correct dimension): `solve(M)`
* col/row mean and sum: `colMeans(M), rowMeans(M), colSums(M), rowSums(M)`
* Extract diagonal element: `diag(M)`
* Set diagonal element: `diag(M) <- c(...)`

---
# List
* A list is a flexible and common **heterogeneous** data structure (contains different object types).

```r
# List elements
num_vec <- c(188, 140)
char_vec <- c("Height", "Weight", "Length")
logic_vec <- rep(TRUE, 8)
my_mat <- matrix(seq_len(10), nrow = 2, ncol = 5)

# List initialization 
(my_list <- list(num_vec, char_vec, logic_vec, my_mat))
```

```
## [[1]]
## [1] 188 140
## 
## [[2]]
## [1] "Height" "Weight" "Length"
## 
## [[3]]
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 
## [[4]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

---
It is possible to add named labels to the elements of your list. For example:

```r
# List initialization with custom names 
(my_list <- list(number = num_vec, character = char_vec, 
                 logic = logic_vec, matrix = my_mat))
```

```
## $number
## [1] 188 140
## 
## $character
## [1] "Height" "Weight" "Length"
## 
## $logic
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 
## $matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

---
# Subsetting
Subsetting is very similar to what we have already discussed. Indeed, we have

```r
# Extract the first and third element 
my_list[c(1, 3)]
```

```
## $number
## [1] 188 140
## 
## $logic
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
```

```r
# Compare these two subsets  
my_list[1]
```

```
## $number
## [1] 188 140
```

```r
my_list[[1]]
```

```
## [1] 188 140
```

---
# Subsetting

```r
# Using labels to subset 
my_list$number
```

```
## [1] 188 140
```

```r
my_list$matrix
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

```r
my_list$matrix[,3] # Extract third column of matrix 
```

```
## [1] 5 6
```

---
# Single `[` or double `[[`?
* Single `[` keeps the `list` class

```r
# type of object 
typeof(my_list[1])
```

```
## [1] "list"
```

* Double `[[` simplifies to the class of the element

```r
is.matrix(my_list[4])
```

```
## [1] FALSE
```

```r
is.matrix(my_list[[4]])
```

```
## [1] TRUE
```

---
# Dataframe
* A data frame is a **heterogeneous** data structure used for storing data tables.

```r
### Creation
players <- c("Novak Djokovic", "Rafael Nadal", "Roger Federer",
             "Daniil Medvedev", "Dominic Thiem")
grand_slam_win <- c(16, 19, 20, 0, 0) 
date_of_birth <- c("22 May 1987", "3 June 1986", "8 August 1981",
                   "11 February 1996", "3 September 1993")
tennis <- data.frame(date_of_birth, grand_slam_win)
dimnames(tennis)[[1]] <- players
tennis
```

```
##                    date_of_birth grand_slam_win
## Novak Djokovic       22 May 1987             16
## Rafael Nadal         3 June 1986             19
## Roger Federer      8 August 1981             20
## Daniil Medvedev 11 February 1996              0
## Dominic Thiem   3 September 1993              0
```

---
# Dataframe
* Each column has its own type. We can check if we have achieved our goal by using:

```r
is.data.frame(tennis)
```

```
## [1] TRUE
```

* We can also verify the structure

```r
str(tennis)
```

```
## 'data.frame':	5 obs. of  2 variables:
##  $ date_of_birth : chr  "22 May 1987" "3 June 1986" "8 August 1981" "11 February 1996" ...
##  $ grand_slam_win: num  16 19 20 0 0
```

---
# Combination

* Data frames can be combined using `cbind()` - column bind and `rbind()` - row bind:

```r
height <- c(1.88, 1.85, 1.85, 1.98, 1.85)
hand <- c("R","L","R","R","R")

(tennis <- cbind(tennis, data.frame(height, hand)))
```

```
##                    date_of_birth grand_slam_win height hand
## Novak Djokovic       22 May 1987             16   1.88    R
## Rafael Nadal         3 June 1986             19   1.85    L
## Roger Federer      8 August 1981             20   1.85    R
## Daniil Medvedev 11 February 1996              0   1.98    R
## Dominic Thiem   3 September 1993              0   1.85    R
```

---
# Subsetting

* Data frame can be subset like a matrix or like a list.

```r
# There are two ways to select columns from a data frame
# Like a list:
tennis[c("height", "date_of_birth")]
```

```
##                 height    date_of_birth
## Novak Djokovic    1.88      22 May 1987
## Rafael Nadal      1.85      3 June 1986
## Roger Federer     1.85    8 August 1981
## Daniil Medvedev   1.98 11 February 1996
## Dominic Thiem     1.85 3 September 1993
```

```r
# Like a matrix
tennis[, c("height", "date_of_birth")]
```

---
# Tibble
- Tibble is a modern alternative to data frame championed by H. Wickham. He claims it is *lazy and surly* data frame (with a bit of humor).
- Tibble is lazy because it never coerces its input. For instance, it does not transform non-syntactic names:

```r
library(tibble) # you need to install an extra library (or package)
names(data.frame(`1` = 1:3, `?` = 4:6))
```

```
## [1] "X1" "X."
```

```r
names(tibble(`1` = 1:3, `?` = 4:6))
```

```
## [1] "1" "?"
```
---
- Tibble is surly, for example it does not automatically recycle shorter columns (unless of length 1) whereas data frame do so if columns length are integer multiples:

```r
data.frame(a = 1:4, b = 1:2)
```

```
##   a b
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
```

```r
tibble(a = 1:4, b = 1:2)
```

```
## Error in `tibble()`:
## ! Tibble columns must have compatible sizes.
## • Size 4: Existing data.
## • Size 2: Column `b`.
## ℹ Only values of size one are recycled.
```
---
- Tibble allows you to interact with variables during creation:

```r
tibble(a = 1:3, b = 2 + 2 * a)
```

```
## # A tibble: 3 × 2
##       a     b
##   <int> <dbl>
## 1     1     4
## 2     2     6
## 3     3     8
```
---
- A last notable difference, tibble by default prints only the first 10 rows:

```r
set.seed(1)
tibble(x = rnorm(1e3), y = rexp(1e3))
```

```
## # A tibble: 1,000 × 2
##         x      y
##     <dbl>  <dbl>
##  1 -0.626 0.601 
##  2  0.184 0.750 
##  3 -0.836 0.167 
##  4  1.60  2.21  
##  5  0.330 0.0553
##  6 -0.820 0.727 
##  7  0.487 0.583 
##  8  0.738 0.198 
##  9  0.576 3.54  
## 10 -0.305 4.01  
## # ℹ 990 more rows
```

---
# To go further
* More details and examples in the book [An Introduction to Statistical Programming Methods with R](https://smac-group.github.io/ds/section-data.html)
* More advanced remarks in [Advanced R](https://adv-r.hadley.nz/index.html), Chapters 2, 3 and 4.

---
class: sydney-blue, center, middle

# Question ?

.pull-down[
<a href="https://ptds.samorso.ch/">
.white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website]
</a>

<a href="https://github.com/ptds2024/">
.white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub]
</a>
]