+ - 0:00:00
Notes for current slide
Notes for next slide

R programming, a gentle introduction

M1 IREF, M1 ROAD

1 / 63

Laurent Bergé

BxSE, University of Bordeaux

Fall 2022

Data management:
  Example

2 / 63

Problem: Herfindahl

Question

Does high tech employment becomes more geographically concentrated over time?

The file “High_tech_Employment_Eurostat.txt” contains information on the spatial distribution of high technology employment in Europe for several years.

The aim of this exercise is to compute the Herfindahl index for each country and year.

For a country c and a year y, the Herfindahl measures the level of spatial concentration of an activity and is defined as:

hc,y=i=1ncsc,y,i2, with sc,y,i the share of national employment of country c in year y located in region i.

An Herfindahl of 1 would mean that all employment is concentrated in only one region.

  1. Import the data, and keep only observations at the NUTS2 level (roughly the regional level). NUTS2 level is a 4-characters geographical code of the form: XXYY with XX the country code and YY the NUTS2 code (e.g. FR17). Figure out and deal with data issues.
  2. Compute the Herfindahl index for each country and year.
  3. Plot the evolution of the Herfindahl for the EU5 (DE, BE, IT, FR, NL).
3 / 63

Data cleaning

library(data.table)
base_raw = fread("_DATA/_RAW/High_tech_Employment_Eurostat.txt")
# Selection & renaming
base_emp = base_raw[nchar(geo) == 4,
.(nuts2 = geo,
year,
emp = Employment_highTech)]
# we zeroe the NAs (ad hoc choice, maybe wrong)
base_emp[is.na(emp), emp := 0]
# we create the country
base_emp[, country := substr(nuts2, 1, 2)]
# total emp per country-year
base_emp[, emp_total_cy := sum(emp), by = .(country, year)]
# share
base_emp[, share := emp / emp_total_cy]
# the final Herfindahl DB
base_herf = base_emp[, .(herf = sum(share ** 2)),
by = .(country, year, emp_total = emp_total_cy)]
4 / 63

Graph

library(ggplot2)
ctry = c("FR", "BE", "DE", "IT", "NL")
ggplot(base_herf[country %in% ctry],
aes(x = year, y = herf, col = country)) +
geom_line() +
geom_point() +
theme_minimal()
5 / 63

Data management with R

6 / 63

Data management

All you need is....

data.table

7 / 63

Outline: data.table

  • import
  • subset
  • create simple variables
  • create complex variables
  • combine data sets
8 / 63

Importation: data.table

You can import rectangular data sets with fread:

library(data.table)
base_ht_emp = fread("data/High_tech_Employment_Eurostat.txt")
head(base_ht_emp)
#> geo year Employment_highTech
#> 1: AT 2008 161.0
#> 2: AT1 2008 80.6
#> 3: AT11 2008 3.5
#> 4: AT12 2008 34.1
#> 5: AT13 2008 43.0
#> 6: AT2 2008 32.8
9 / 63

Importation: data.table

You can import rectangular data sets with fread:

library(data.table)
base_ht_emp = fread("data/High_tech_Employment_Eurostat.txt")
head(base_ht_emp)
#> geo year Employment_highTech
#> 1: AT 2008 161.0
#> 2: AT1 2008 80.6
#> 3: AT11 2008 3.5
#> 4: AT12 2008 34.1
#> 5: AT13 2008 43.0
#> 6: AT2 2008 32.8

The function fread is nice because:

  • it guesses column types (based on a large random sample of rows)
  • it guesses the delimiter
  • it is fast
9 / 63

Data management with
data.table

10 / 63

Data management with data.table

Pros:

  • memory efficient
  • very fast
  • compact syntax
11 / 63

Data management with data.table

Pros:

  • memory efficient
  • very fast
  • compact syntax

Cons:

  • annoying startup message
  • non-explicit intricate syntax
  • non-standard R syntax
11 / 63

Data management with data.table

Pros:

  • memory efficient
  • very fast
  • compact syntax

Cons:

  • annoying startup message
  • non-explicit intricate syntax
  • non-standard R syntax

You can't have one without the other!

12 / 63

Creating a data.table

library(data.table)
dt = data.table(id = c("Mum", "Dad"), value = 1:2)
dt
#> id value
#> 1: Mum 1
#> 2: Dad 2
13 / 63

Creating a data.table

library(data.table)
dt = data.table(id = c("Mum", "Dad"), value = 1:2)
dt
#> id value
#> 1: Mum 1
#> 2: Dad 2

From existing data.frames:

df = iris[, 1:3] # new DF
dt = df
setDT(dt) # change the type to DT without recreating it in memory
dt = as.data.table(iris) # creates a copy
13 / 63

data.table syntax

DT[i, j, by]

i: row index   j: column index/values   by: aggregation

14 / 63

data.table syntax

DT[i, j, by]

i: row index   j: column index/values   by: aggregation

You can do most of what you need in data management with just that!

14 / 63

Difference with DF: Row selection

In a data.table, you don't need a comma after the index to select rows (differently from data.frames).

set.seed(1)
df = iris[sample(150, 5), c(1,5)]
dt = as.data.table(df)
# dt[i] is +/- equivalent to df[i, ]
dt[1]
#> Sepal.Length Species
#> 1: 5.8 versicolor
# look at the difference
head(df[1])
#> Sepal.Length
#> 68 5.8
#> 129 6.4
#> 43 4.4
#> 14 4.3
#> 51 7.0
15 / 63

Difference with DF: Column selection

Inside the brackets of a data.table, you need not use character vectors to reference column names:

# works => NON STANDARD R
dt[, Species]
#> [1] versicolor virginica setosa setosa versicolor
#> Levels: setosa versicolor virginica
# does not work => error message
df[, Species]
#> Error in `[.data.frame`(df, , Species): object 'Species' not found
16 / 63

Difference with DF: Ordering

Ordering the DT: use the names of the variables directly:

  • order(var_name_1, var_name_2, etc): use directly variable names
  • order(var_name_1, -var_name_2): put a minus in front of a variable to have a decreasing order
17 / 63

Difference with DF: Ordering

Ordering the DT: use the names of the variables directly:

  • order(var_name_1, var_name_2, etc): use directly variable names
  • order(var_name_1, -var_name_2): put a minus in front of a variable to have a decreasing order

Example

dt = as.data.table(iris)
head(dt[order(Sepal.Length)], 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 4.3 3.0 1.1 0.1 setosa
#> 2: 4.4 2.9 1.4 0.2 setosa
#> 3: 4.4 3.0 1.3 0.2 setosa
# decreasing order
head(dt[order(Species, -Sepal.Length)], 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.8 4.0 1.2 0.2 setosa
#> 2: 5.7 4.4 1.5 0.4 setosa
#> 3: 5.7 3.8 1.7 0.3 setosa
17 / 63

Difference with DF: Subsetting

Subsetting: use the variables names directly!

dt = as.data.table(iris)
head(dt[Species == "setosa" & Sepal.Length == 5.1])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 5.1 3.5 1.4 0.3 setosa
#> 3: 5.1 3.8 1.5 0.3 setosa
#> 4: 5.1 3.7 1.5 0.4 setosa
#> 5: 5.1 3.3 1.7 0.5 setosa
#> 6: 5.1 3.4 1.5 0.2 setosa

It is equivalent to:

df = iris
head(df[df$Species == "setosa" & df$Sepal.Length == 5.1, ])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 18 5.1 3.5 1.4 0.3 setosa
#> 20 5.1 3.8 1.5 0.3 setosa
#> 22 5.1 3.7 1.5 0.4 setosa
#> 24 5.1 3.3 1.7 0.5 setosa
#> 40 5.1 3.4 1.5 0.2 setosa
18 / 63

Creating new variables

19 / 63

Creating new variables: Canonical way

To create new variables, the syntax is:

dt[, c("vector", "of", "names") := list(stuff1, stuff2, stuff3)]

dt = as.data.table(iris)
dt[, c("id", "x") := list(1:nrow(dt), Sepal.Length**2)]
head(dt, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id x
#> 1: 5.1 3.5 1.4 0.2 setosa 1 26.01
#> 2: 4.9 3.0 1.4 0.2 setosa 2 24.01
#> 3: 4.7 3.2 1.3 0.2 setosa 3 22.09

Keep in mind that this syntax is data.table specific.

20 / 63

Creating new variables: Too long syntax?

The canonical syntax to create new variables may seem too long:

dt[, c("id", "x") := list(1:nrow(dt), Sepal.Length**2)]
21 / 63

Creating new variables: Too long syntax?

The canonical syntax to create new variables may seem too long:

dt[, c("id", "x") := list(1:nrow(dt), Sepal.Length**2)]
Good news

data.table allows for shorthands!

21 / 63

Creating new variables: Shorthands

Starters:

  • you can replace list(stuff) by .(stuff): yes, a point is a function

  • there is an internal variable, which can be summoned with .N, reporting the number of observations

22 / 63

Creating new variables: Shorthands

Starters:

  • you can replace list(stuff) by .(stuff): yes, a point is a function

  • there is an internal variable, which can be summoned with .N, reporting the number of observations

Canonical call:

dt[, c("id", "x") := list(1:nrow(dt),
Sepal.Length**2)]

Use . for list:

dt[, c("id", "x") := .(1:nrow(dt),
Sepal.Length**2)]

Use .N:

dt[, c("id", "x") := .(1:.N,
Sepal.Length**2)]
22 / 63

Creating new variables: Shorthands

When there is only one variable to be created:

  • you can avoid the quotes in the left side
  • you can avoid the list in the right side
23 / 63

Creating new variables: Shorthands

When there is only one variable to be created:

  • you can avoid the quotes in the left side
  • you can avoid the list in the right side

Canonical call:

dt[, "id" := list(1:nrow(dt))]

Remove quotes:

dt[, id := list(1:nrow(dt))]

Use .N:

dt[, id := list(1:.N)]

Remove list:

dt[, id := 1:.N]
23 / 63

Creating new variables: Clumsy?

When you have 3+ variables to create, this syntax may be error prone since you need to match the order of the left side to the order of the right side.

dt[, c("id", "x") := .(1:.N, Sepal.Length**2)]
24 / 63

Creating new variables: Clumsy?

When you have 3+ variables to create, this syntax may be error prone since you need to match the order of the left side to the order of the right side.

dt[, c("id", "x") := .(1:.N, Sepal.Length**2)]
Good news

data.table allows to create variables differently!

24 / 63

Creating new variables: ":=" is an operator!

Question

Is the following code legit?

"+"(5, 3)
25 / 63

Creating new variables: ":=" is an operator!

Question

Is the following code legit?

"+"(5, 3)
Answer

Sure it is!

"+"(5, 3)
#> [1] 8

Weird, right?

25 / 63

Operators

26 / 63

Operators

The values +, *, ^, :, and := (and many others) are operators.

They are special symbols which require values on the left and on the right to work.

Each of them is in fact associated to a regular function which is defined in a regular way.

Something more or less of the form:

operator = function(a, b){ etc }

27 / 63

Operators

The values +, *, ^, :, and := (and many others) are operators.

They are special symbols which require values on the left and on the right to work.

Each of them is in fact associated to a regular function which is defined in a regular way.

Something more or less of the form:

operator = function(a, b){ etc }

So operators are in fact regular functions that you can summon like any other function!

27 / 63

Operators: Example

Let's create an operator that keeps the first n letters of a word.

"%k%" = function(x, n){
substr(x, 1, n)
}
"bonjour" %k% 3
#> [1] "bon"
28 / 63

Operators: Example

Let's create an operator that keeps the first n letters of a word.

"%k%" = function(x, n){
substr(x, 1, n)
}
"bonjour" %k% 3
#> [1] "bon"

And since they're no different from regular functions, you can summon them just like any function they're just quoted to avoid a parsing error:

"%k%"("bonjour", 3)
#> [1] "bon"
28 / 63

Creating variables: Functional form

You can use the function ":=" to create variables in the following way:

Canonical call:

dt[, c("id", "x") :=
list(1:nrow(dt), Sepal.Length**2)]

Functional call:

dt[, ":="(id = 1:.N,
x = Sepal.Length**2)]
29 / 63

Creating variables: Partial modification

Q: Say you want to modify a variable, but only for some observations. How do you do?

30 / 63

Creating variables: Partial modification

Q: Say you want to modify a variable, but only for some observations. How do you do?

A: Yes, you've got it! Just use the argument i.

Let's trim large values of Sepal.Length:

dt = as.data.table(iris)
summary(dt$Sepal.Length)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 4.300 5.100 5.800 5.843 6.400 7.900
cutoff = mean(dt$Sepal.Length) + 1.5 * sd(dt$Sepal.Length)
dt[Sepal.Length >= cutoff, Sepal.Length := cutoff]
summary(dt$Sepal.Length)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 4.300 5.100 5.800 5.812 6.400 7.085
30 / 63

Creating variables: The function set

The function set is specific to data.table and allows to create variables more programmatically.

The syntax is:

set(dt, i = NULL, j, value)

31 / 63

Creating variables: The function set

The function set is specific to data.table and allows to create variables more programmatically.

The syntax is:

set(dt, i = NULL, j, value)

Create a new variable:

dt = data.table(id = letters[1:3],
x = 1:3)
set(dt, j = "x2", value = dt$x ** 2)
dt
#> id x x2
#> 1: a 1 1
#> 2: b 2 4
#> 3: c 3 9

Note that you cannot use variables by reference any more, you need explicit values (here dt$x).

Modify an existing variable:

dt = data.table(id = letters[1:3],
x = 1:3)
set(dt, c(1L, 3L), j = "id",
value = c("aa", "cc"))
dt
#> id x
#> 1: aa 1
#> 2: b 2
#> 3: cc 3

Note that 1L, 3L means 1 and 3 in integer format (otherwise data.table complains).

31 / 63

Selecting variables

32 / 63

Extraction: Basic way

The most basic way is to insert a character vector of variables' names:

dt = as.data.table(iris)
head(dt[, c("Species", "Sepal.Width")], 2)
#> Species Sepal.Width
#> 1: setosa 3.5
#> 2: setosa 3.0
33 / 63

Extraction: Basic way

The most basic way is to insert a character vector of variables' names:

dt = as.data.table(iris)
head(dt[, c("Species", "Sepal.Width")], 2)
#> Species Sepal.Width
#> 1: setosa 3.5
#> 2: setosa 3.0
Very weird trick alert

You can negate character strings!

head(dt[, -c("Species", "Sepal.Width")], 2)
#> Sepal.Length Petal.Length Petal.Width
#> 1: 5.1 1.4 0.2
#> 2: 4.9 1.4 0.2
33 / 63

Extraction: Usual way

You can extract and modify the columns of a data.table by using:

dt[, list(newvar1 = fun(oldvars), etc)]

dt = as.data.table(iris)
new_dt = dt[, list(x = Sepal.Length, id = 1:nrow(dt))]
new_dt[1]
#> x id
#> 1: 5.1 1

As usual, shorthands apply:

new_dt = dt[, .(x = Sepal.Length, id = 1:.N)]
new_dt[1]
#> x id
#> 1: 5.1 1
34 / 63

Extraction: Simplification

When you select just one variable, you can omit the list.

This leads to return a vector.

head(dt[, Sepal.Length])
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4
35 / 63

Extraction: Simplification

When you select just one variable, you can omit the list.

This leads to return a vector.

head(dt[, Sepal.Length])
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4

If you want to return a data.table, use a list:

head(dt[, .(Sepal.Length)])
#> Sepal.Length
#> 1: 5.1
#> 2: 4.9
#> 3: 4.7
#> 4: 4.6
#> 5: 5.0
#> 6: 5.4
35 / 63

Using variables' names in vectors

Question

Does this work?

var = c("Species", "Sepal.Width")
dt[, var]
36 / 63

Using variables' names in vectors

Question

Does this work?

var = c("Species", "Sepal.Width")
dt[, var]
Answer

It does not.

var = c("Species", "Sepal.Width")
dt[, var]
#> Error in `[.data.table`(dt, , var) : ]
#> j (the 2nd argument inside ) is a single symbol but column name 'var' is not found. etc
36 / 63

Using variables' names in vectors

Question

Does this work?

var = c("Species", "Sepal.Width")
dt[, var]
Answer

It does not.

var = c("Species", "Sepal.Width")
dt[, var]
#> Error in `[.data.table`(dt, , var) : ]
#> j (the 2nd argument inside ) is a single symbol but column name 'var' is not found. etc

Q: How to make it work?

36 / 63

Base R's with mechanism

In R, you use the function with to attach the variables of a list (usually a data set) to the "search path", so that it works as if the variables of that data set were in the current environment.

37 / 63

Base R's with mechanism

In R, you use the function with to attach the variables of a list (usually a data set) to the "search path", so that it works as if the variables of that data set were in the current environment.

Does not work because the variables ex1 and ex2 were never created: so we rightfully get an error.

my_list = list(ex1 = 3, ex2 = 5)
ex1 + ex2
#> Error in eval(expr, envir, enclos): object 'ex1' not found
37 / 63

Base R's with mechanism

In R, you use the function with to attach the variables of a list (usually a data set) to the "search path", so that it works as if the variables of that data set were in the current environment.

Does not work because the variables ex1 and ex2 were never created: so we rightfully get an error.

my_list = list(ex1 = 3, ex2 = 5)
ex1 + ex2
#> Error in eval(expr, envir, enclos): object 'ex1' not found

Now works, because we ask the variables ex1 and ex2 to be searched within the list my_list.

my_list = list(ex1 = 3, ex2 = 5)
with(my_list, ex1 + ex2)
#> [1] 8
37 / 63

Base R's with mechanism: data.table

Link to data.table

What data.table does is identical to this with mechanism: the variables' names are directly fetched in the data.

Guess what: the operator [.data.table has a with argument.

38 / 63

Using the argument with in data.table

Let's use the argument with and see what it does:

dt = as.data.table(iris)
var = c("Species", "Sepal.Width")
head(dt[, var, with = FALSE])
#> Species Sepal.Width
#> 1: setosa 3.5
#> 2: setosa 3.0
#> 3: setosa 3.2
#> 4: setosa 3.1
#> 5: setosa 3.6
#> 6: setosa 3.9

Yeah it works!

39 / 63

The with argument

But as usual, there are shorthands. Just add two points before the name of the variable to indicate to data.table that the variable is not to be fetched in the data:

var = c("Species", "Sepal.Width")
head(dt[, ..var])
#> Species Sepal.Width
#> 1: setosa 3.5
#> 2: setosa 3.0
#> 3: setosa 3.2
#> 4: setosa 3.1
#> 5: setosa 3.6
#> 6: setosa 3.9
40 / 63

The with argument

But as usual, there are shorthands. Just add two points before the name of the variable to indicate to data.table that the variable is not to be fetched in the data:

var = c("Species", "Sepal.Width")
head(dt[, ..var])
#> Species Sepal.Width
#> 1: setosa 3.5
#> 2: setosa 3.0
#> 3: setosa 3.2
#> 4: setosa 3.1
#> 5: setosa 3.6
#> 6: setosa 3.9
To note

This is a super weird syntax trick, absolutely specific to data.table!

Explaining why it works is out of the scope of this course.1

40 / 63

Using with when creating variables

Remember the syntax to create variables? We can also use with = FALSE.

Does not work because var is not defined in dt:

dt = data.table(x = 1:3)
var = c("x2", "x3")
dt[, var := .(x**2, x**3)]
#> Error in `[.data.table`(dt, , `:=`(var, .(x^2, x^3))): Supplied 2 items to be assigned to 3 items of column 'var'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

41 / 63

Using with when creating variables

Remember the syntax to create variables? We can also use with = FALSE.

Does not work because var is not defined in dt:

dt = data.table(x = 1:3)
var = c("x2", "x3")
dt[, var := .(x**2, x**3)]
#> Error in `[.data.table`(dt, , `:=`(var, .(x^2, x^3))): Supplied 2 items to be assigned to 3 items of column 'var'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

Now works with with = FALSE, although deprecated:

dt = data.table(x = 1:3)
var = c("x2", "x3")
dt[, var := .(x**2, x**3), with = FALSE]
#> Warning in `[.data.table`(dt, , `:=`(var, .(x^2, x^3)), with = FALSE):
#> with=FALSE together with := was deprecated in v1.9.4 released Oct 2014. Please
#> wrap the LHS of := with parentheses; e.g., DT[,(myVar):=sum(b),by=a] to assign
#> to column name(s) held in variable myVar. See ?':=' for other examples. As
#> warned in 2014, this is now a warning.
dt
#> x x2 x3
#> 1: 1 1 1
#> 2: 2 4 8
#> 3: 3 9 27

41 / 63

Using with when creating variables

Remember the syntax to create variables? We can also use with = FALSE.

Does not work because var is not defined in dt:

dt = data.table(x = 1:3)
var = c("x2", "x3")
dt[, var := .(x**2, x**3)]
#> Error in `[.data.table`(dt, , `:=`(var, .(x^2, x^3))): Supplied 2 items to be assigned to 3 items of column 'var'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

And with the shorthand:

dt = data.table(x = 1:3)
var = c("x2", "x3")
dt[, (var) := .(x**2, x**3)]
dt
#> x x2 x3
#> 1: 1 1 1
#> 2: 2 4 8
#> 3: 3 9 27
42 / 63

Aggregation

43 / 63

Aggregation

What?

Applying operations by groups of observations is one of the most common data tasks:

  • average/min/max per country
  • whatever you want per whatever grouping
44 / 63

Aggregation

What?

Applying operations by groups of observations is one of the most common data tasks:

  • average/min/max per country
  • whatever you want per whatever grouping

How?

This can be done exquisitely easily in data.table:

Just use the argument by!

44 / 63

Aggregating data by groups

To aggregate data according to an identifier:

dt[, list(newvar1 = fun(oldvars), etc), by = list(identifiers)]

dt = as.data.table(iris)
agg_dt = dt[, list(mean_sl = mean(Sepal.Length),
n_obs = length(Sepal.Length)),
by = list(Species)]
agg_dt
#> Species mean_sl n_obs
#> 1: setosa 5.006 50
#> 2: versicolor 5.936 50
#> 3: virginica 6.588 50
45 / 63

Aggregating data: Shorthands

Shorthands (as usual):

  • list() .()
  • if only one variable in by=, you can omit list()
  • .N: the number of observations in the group
46 / 63

Aggregating data: Shorthands

Shorthands (as usual):

  • list() .()
  • if only one variable in by=, you can omit list()
  • .N: the number of observations in the group

Canonical call:

dt[, list(mean_sl = mean(Sepal.Length),
n_obs = length(Sepal.Length)),
by = list(Species)]

Using . for list:

dt[, .(mean_sl = mean(Sepal.Length),
n_obs = length(Sepal.Length)),
by = .(Species)]

Removing the list in by:

dt[, .(mean_sl = mean(Sepal.Length),
n_obs = length(Sepal.Length)),
by = Species]

Using .N:

dt[, .(mean_sl = mean(Sepal.Length),
n_obs = .N),
by = Species]
46 / 63

Aggregate: Exercize

The following data contains trade values in euros from exporting (Origin) countries to importing (Destination) countries.

# install.packages("fixest")
data(trade, package = "fixest")
head(trade)
#> Destination Origin Product Year dist_km Euros
#> 1 LU BE 1 2007 139.5719 2966697
#> 2 BE LU 1 2007 139.5719 6755030
#> 3 LU BE 2 2007 139.5719 57078782
#> 4 BE LU 2 2007 139.5719 7117406
#> 5 LU BE 3 2007 139.5719 17379821
#> 6 BE LU 3 2007 139.5719 2622254
47 / 63

Aggregate: Exercize

The following data contains trade values in euros from exporting (Origin) countries to importing (Destination) countries.

# install.packages("fixest")
data(trade, package = "fixest")
head(trade)
#> Destination Origin Product Year dist_km Euros
#> 1 LU BE 1 2007 139.5719 2966697
#> 2 BE LU 1 2007 139.5719 6755030
#> 3 LU BE 2 2007 139.5719 57078782
#> 4 BE LU 2 2007 139.5719 7117406
#> 5 LU BE 3 2007 139.5719 17379821
#> 6 BE LU 3 2007 139.5719 2622254
Exercize

Create the table containing the yearly total exportations for each exporting country.

47 / 63

Creating aggregate variables

Now say you want to create a new variable:

  • the max Petal.Length for each variety of flower
  • but you want to keep the same number of rows! In other words, you want the original data set with just an extra variable.
48 / 63

Creating aggregate variables

Now say you want to create a new variable:

  • the max Petal.Length for each variety of flower
  • but you want to keep the same number of rows! In other words, you want the original data set with just an extra variable.

Regular way to proceed:

  1. you aggregate the data at the species level
  2. you merge back the information to the original database
dt = as.data.table(iris)[, 3:5]
agg_dt = dt[, .(max_sl = max(Petal.Length)), by = Species]
res = merge(dt, agg_dt)
# looking at some obs
res[c(1:2, 51:52, 101:102)]
#> Species Petal.Length Petal.Width max_sl
#> 1: setosa 1.4 0.2 1.9
#> 2: setosa 1.4 0.2 1.9
#> 3: versicolor 4.7 1.4 5.1
#> 4: versicolor 4.5 1.5 5.1
#> 5: virginica 6.0 2.5 6.9
#> 6: virginica 5.1 1.9 6.9
48 / 63

Creating aggregate variables

But you can do it directly in one line with data.table!

49 / 63

Creating aggregate variables

But you can do it directly in one line with data.table!

Since it concerns the creation of a new variable, you must use :=:

dt[, [character vector of names] := [list of values], by = list(identifiers)]

49 / 63

Creating aggregate variables

But you can do it directly in one line with data.table!

Since it concerns the creation of a new variable, you must use :=:

dt[, [character vector of names] := [list of values], by = list(identifiers)]

We obtain the previous result with a single line:

dt = as.data.table(iris)[, 3:5]
dt[, max_sl := max(Petal.Length), by = Species]
# looking at some obs
dt[c(1:2, 51:52, 101:102)]
#> Petal.Length Petal.Width Species max_sl
#> 1: 1.4 0.2 setosa 1.9
#> 2: 1.4 0.2 setosa 1.9
#> 3: 4.7 1.4 versicolor 5.1
#> 4: 4.5 1.5 versicolor 5.1
#> 5: 6.0 2.5 virginica 6.9
#> 6: 5.1 1.9 virginica 6.9
49 / 63

Exercize: Aggregate variables

Remember the trade data:

# install.packages("fixest")
data(trade, package = "fixest")
head(trade, 3)
#> Destination Origin Product Year dist_km Euros
#> 1 LU BE 1 2007 139.5719 2966697
#> 2 BE LU 1 2007 139.5719 6755030
#> 3 LU BE 2 2007 139.5719 57078782
  • create the data set base_export containing the yearly total exports between each country-pair

  • add the variable share_export: it represents the Destination country represents in the yearly total exports of an Origin country.

For example, take France (FR) exports to Germany (DE) in 2010. If share_export = 50 this means that Germany receives 50% of all France exportations in 2010.

50 / 63

data.table: Summing up

  • selecting observations
  • selecting an modifying variables
  • creating new variables
  • creating aggregated measures
51 / 63

data.table: Summing up

  • selecting observations
  • selecting an modifying variables
  • creating new variables
  • creating aggregated measures

It's about 80% of your data management journey.

51 / 63

data.table: Summing up

  • selecting observations
  • selecting an modifying variables
  • creating new variables
  • creating aggregated measures

It's about 80% of your data management journey. The other 18% is merging.

51 / 63

Combining data sets

52 / 63

Merging different sets of information

Merging is the bread and butter of data management, most important and recurring operation.

First let's define data:

dtx = data.table(id = c("Al", "Jil", "Pablo", "Jules"))
dtx$performance = 1:4
dty = data.table(id = c("Al", "Francis", "Myriam"))
dty$age = c(34, 52, 29)
dtx
#> id performance
#> 1: Al 1
#> 2: Jil 2
#> 3: Pablo 3
#> 4: Jules 4
dty
#> id age
#> 1: Al 34
#> 2: Francis 52
#> 3: Myriam 29
53 / 63

Merging

Merging is an operation that combines information from different sources for a set of identifiers.

In our example we have a performance variable in table dtx and an age variable for table dty.

In both tables, the information is unique for a given id.

As we can see, the information is not exhaustive: Al is the only id to be in both tables.

54 / 63

Merging

You have 4 types of merging operations:

  1. inner join: only the identifiers that are present in both tables are kept.
  2. left join: all identifiers from the first table are kept. Information on identifiers of the second table that are not present in the first table are dropped.
  3. right join: explicit, same logic as left join
  4. outer join: All the information is kept. No identifier is lost.
55 / 63

Merging

This translates to the function merge() with the explicit:

  1. inner join: default
  2. left join: all.x = TRUE
  3. right join: all.y = TRUE
  4. outer join: all = TRUE (equiv. all.x = TRUE & all.y = TRUE)
56 / 63

Merging: Examples

merge(dtx, dty) # inner join
#> id performance age
#> 1: Al 1 34
merge(dtx, dty, all.x = TRUE) # left join
#> id performance age
#> 1: Al 1 34
#> 2: Jil 2 NA
#> 3: Jules 4 NA
#> 4: Pablo 3 NA
merge(dtx, dty, all.y = TRUE) # right join
#> id performance age
#> 1: Al 1 34
#> 2: Francis NA 52
#> 3: Myriam NA 29
57 / 63

Merging: Examples

merge(dtx, dty, all = TRUE) # outer join
#> id performance age
#> 1: Al 1 34
#> 2: Francis NA 52
#> 3: Jil 2 NA
#> 4: Jules 4 NA
#> 5: Myriam NA 29
#> 6: Pablo 3 NA
58 / 63

Merging: Question

dtx = data.table(id = c("Al", "Al",
"Pablo", "Jules"),
performance = 1:4)
dtx
#> id performance
#> 1: Al 1
#> 2: Al 2
#> 3: Pablo 3
#> 4: Jules 4
dty = data.table(id = c("Al", "Al",
"Myriam"),
age = c(34, 52, 29))
dty
#> id age
#> 1: Al 34
#> 2: Al 52
#> 3: Myriam 29
59 / 63

Merging: Question

dtx = data.table(id = c("Al", "Al",
"Pablo", "Jules"),
performance = 1:4)
dtx
#> id performance
#> 1: Al 1
#> 2: Al 2
#> 3: Pablo 3
#> 4: Jules 4
dty = data.table(id = c("Al", "Al",
"Myriam"),
age = c(34, 52, 29))
dty
#> id age
#> 1: Al 34
#> 2: Al 52
#> 3: Myriam 29
Question

What happens if?

merge(dtx, dty, by = "id")
59 / 63

Merging: Question

dtx = data.table(id = c("Al", "Al",
"Pablo", "Jules"),
performance = 1:4)
dtx
#> id performance
#> 1: Al 1
#> 2: Al 2
#> 3: Pablo 3
#> 4: Jules 4
dty = data.table(id = c("Al", "Al",
"Myriam"),
age = c(34, 52, 29))
dty
#> id age
#> 1: Al 34
#> 2: Al 52
#> 3: Myriam 29
Question

What happens if?

merge(dtx, dty, by = "id")
Answer

merge(dtx, dty, by = "id")
#> id performance age
#> 1: Al 1 34
#> 2: Al 1 52
#> 3: Al 2 34
#> 4: Al 2 52
59 / 63

Merging: Cartesian product

When merging data tables, there is a check for Cartesian product to avoid performing a very costly operation by mistake.

You can bypass this with the argument allow.cartesian.

dta = as.data.table(iris)[, .(Petal.Length, Species)]
dtb = as.data.table(iris)[, .(Sepal.Width, Species)]
dtab = merge(dta, dtb, by = "Species") # cartesian product => BLOCKED!
#> Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 7500 rows; more than 300 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
# now works => look at the size of the resulting data set!
# iris is 150 obs, 3 species of 50 obs.
# after merge we get: 3 * 50**2 obs.
dtab = merge(dta, dtb, by = "Species", allow.cartesian = TRUE)
dim(dtab)
#> [1] 7500 3
60 / 63

Merging: Keys of different names

You can use argument by to merge with identifiers of different names.

dtz = dty
names(dtz)[1] = "ID" # changing the name
merge(dtx, dtz, by.x = "id", by.y = "ID")
#> id performance age
#> 1: Al 1 34
#> 2: Al 1 52
#> 3: Al 2 34
#> 4: Al 2 52

However I strongly discourage that.

Better keeping consistent variable names across tables right from the start of a project. But it can be useful.

Of course an identifier can consist of more than one variable (e.g. in a panel it can be individual identifier + year identifier).

61 / 63

Exercise: Citation counts

Let's consider patent data from the USPTO. We want to count the number of times each patent gets cited by other patents.

Based on the text files:

  • uspto_sample.tsv
  • uspto_cites.csv

Create a table looking as follows:

with nb_cites the number of times patent identified by patent_id in uspto_sample.tsv has been cited by the patents in uspto_cites.csv.

62 / 63

Going further

The material we've just seen should take you a very long way!

However this was just an introduction: you can do many more things with data.table!

63 / 63

Stacking data: From wide to long

dt = data.table(id = 1:2, var1 = rnorm(2), var2 = rnorm(2))
dt_stacked = melt(dt, id.vars = "id")
dt_stacked
#> id variable value
#> 1: 1 var1 0.4874291
#> 2: 2 var1 0.7383247
#> 3: 1 var2 0.5757814
#> 4: 2 var2 -0.3053884
  • to add (if needed):
  • SD

Data management:
  Example

2 / 63
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow