Many functionalities depend on user-created functions ⇒ quality control depends on the goodwill (and time) of the programmer
It is not "point and click" ⇒ bigger learning cost
huge active community ⇒ many answers to your questions around
you can easily code whatever you want (great for simulation and complex data management)
it has some very (very) smart and convenient coding features
very well integrated document authoring system with Rstudio: to create webpages, reports, presentations
Shiny: https://trackingpatentconcepts.shinyapps.io/shiny_diffusionApp/
you can create your own packages easily
A programming language is really a language:
Basic R programming
Data management
Statistics (basic)
Graphics
A big calculator
1 + 1 #> [1] 2sqrt(4)#> [1] 2exp(log(2))#> [1] 22**2#> [1] 42^2#> [1] 4pi#> [1] 3.141593
A programming Language
loops
conditions
coherent syntax
smart function definition
R likes vectors and matrices:
Every user-level function is documented. The quality of the documentation depends on the author of the function though.
type ?function
to get the help page of the function
The syntax rules you shall obey, there is no other way.
The syntax rules you shall obey, there is no other way.
In concrete terms:
R is case sensitive ⇒ x1
is different from X1
and they will be treated as two different variables
every symbol has a specific meaning: You shall not use a parenthesis for a bracket! (≠{≠[
everything open must be closed (paren./bracket)
1 line = 1 instruction. And nothing more!
use one character sign for another and the code will break
Most errors are syntax errors.
Most errors are syntax errors.
If you think you have found a bug in the software, please refer to Rule 2.
If you knew R before this course, you must have heard of the tidyverse
.
This is a set of functions, for data management and much more, created by talented developers which changed the R paradigm by focusing on user-friendliness.
If you knew R before this course, you must have heard of the tidyverse
.
This is a set of functions, for data management and much more, created by talented developers which changed the R paradigm by focusing on user-friendliness.
tidyverse
introduced a lot of non-standard evaluation to make the typing much lighter and intuitive.
The tradeoff is that it increased the complexity to program with it.☆
And if you're starting with R, it will give you terrible habits and expectations; and you won't be able to understand non-tidyverse users' code!
Before using non-standard evaluation, you should first properly understand what standard evaluation is!
exp(2)#> [1] 7.389056
exp(2)#> [1] 7.389056
exp()
is a (simple) function that takes only one numeric argument as input and gives one numeric output:exp
⟶ 7.389056exp(2)#> [1] 7.389056
exp()
is a (simple) function that takes only one numeric argument as input and gives one numeric output:
2 ⟶ exp
⟶ 7.389056
While there can be multiple arguments as inputs, the output must be a unique object!
round(pi, 2)#> [1] 3.14
exp(2)#> [1] 7.389056
exp()
is a (simple) function that takes only one numeric argument as input and gives one numeric output:
2 ⟶ exp
⟶ 7.389056
While there can be multiple arguments as inputs, the output must be a unique object!
round(pi, 2)#> [1] 3.14
round()
takes two arguments as inputs:round
⟶ 3.14function_name(arg1, arg2, ...., argN)
round(x = pi, digits = 1)#> [1] 3.1
function_name(arg1, arg2, ...., argN)
round(x = pi, digits = 1)#> [1] 3.1
round(pi, 1)#> [1] 3.1round(digits = 1, pi)#> [1] 3.1
function_name(arg1, arg2, ...., argN)
round(x = pi, digits = 1)#> [1] 3.1
round(pi, 1)#> [1] 3.1round(digits = 1, pi)#> [1] 3.1
round(pi)#> [1] 3round(pi, 4)#> [1] 3.1416
<-
.=
" or "<-
" x = 2x <- 2
=
", like in any other programming language★By googling on the internet, you can quickly find code snippets to do quite complex analysis. ML, spatial analysis, you name it.
Even if you're a bloody beginner.
By googling on the internet, you can quickly find code snippets to do quite complex analysis. ML, spatial analysis, you name it.
Even if you're a bloody beginner.
What's the point of "knowing" how to do complex things if you don't know how to do basic things?
This will generate a lot of frustration when you'll want to perform simple things (export/loop/select/etc).
By googling on the internet, you can quickly find code snippets to do quite complex analysis. ML, spatial analysis, you name it.
Even if you're a bloody beginner.
What's the point of "knowing" how to do complex things if you don't know how to do basic things?
This will generate a lot of frustration when you'll want to perform simple things (export/loop/select/etc).
OK, in fact I just wanted to motivate a very tedious outline in the next slide! ;-)
loops and conditions
loops and conditions
loops and conditions
loops and conditions
loops and conditions
c()
: x = c(1, 5, 6)x#> [1] 1 5 6
c()
: x = c(1, 5, 6)x#> [1] 1 5 6
s = c("bonne", "nuit", "les petits")s#> [1] "bonne" "nuit" "les petits"
c()
: x = c(1, 5, 6)x#> [1] 1 5 6
s = c("bonne", "nuit", "les petits")s#> [1] "bonne" "nuit" "les petits"
c()
: y = c(x, 8, 9)y#> [1] 1 5 6 8 9
There are some tools to create regular vectors:
the colon ":
"
rep()
seq()
:
1:5#> [1] 1 2 3 4 5-2:2#> [1] -2 -1 0 1 2
:
1:5#> [1] 1 2 3 4 5-2:2#> [1] -2 -1 0 1 2
1:2+1
:
1:5#> [1] 1 2 3 4 5-2:2#> [1] -2 -1 0 1 2
1:2+1
#> [1] 2 3
1:2+1#> [1] 2 31:(2+1)#> [1] 1 2 3
rep()
rep()
to replicate values or vectors: rep(1, 3) # identical to rep(x = 1, times = 3)#> [1] 1 1 1rep(1:2, 3)#> [1] 1 2 1 2 1 2
rep()
rep()
to replicate values or vectors: rep(1, 3) # identical to rep(x = 1, times = 3)#> [1] 1 1 1rep(1:2, 3)#> [1] 1 2 1 2 1 2
rep
's argument each
:rep(1:2, each = 3)#> [1] 1 1 1 2 2 2
rep()
rep()
to replicate values or vectors: rep(1, 3) # identical to rep(x = 1, times = 3)#> [1] 1 1 1rep(1:2, 3)#> [1] 1 2 1 2 1 2
rep
's argument each
:rep(1:2, each = 3)#> [1] 1 1 1 2 2 2
times
can be a vector of the same length as argument x
: rep(1:2, 2:3)#> [1] 1 1 2 2 2
seq()
seq()
to create regular sequences:seq(1, 5, 1) # identical to seq(from = 1, to = 5, by = 1)#> [1] 1 2 3 4 5seq(1, 5, by = 2)#> [1] 1 3 5
seq()
seq()
to create regular sequences:seq(1, 5, 1) # identical to seq(from = 1, to = 5, by = 1)#> [1] 1 2 3 4 5seq(1, 5, by = 2)#> [1] 1 3 5
length.out
: seq(1, 5, length.out = 7)#> [1] 1.000000 1.666667 2.333333 3.000000 3.666667 4.333333 5.000000
Operations between scalars and vectors are term to term:
x = 1:3x + 1#> [1] 2 3 4x * 2#> [1] 2 4 6x**3 # equivalent to x^3#> [1] 1 8 27log(x)#> [1] 0.0000000 0.6931472 1.0986123
Operations between scalars and vectors are term to term:
x = 1:3x + 1#> [1] 2 3 4x * 2#> [1] 2 4 6x**3 # equivalent to x^3#> [1] 1 8 27log(x)#> [1] 0.0000000 0.6931472 1.0986123
Operations between vectors are also term to term:
y = 10**(1:3) # equivalent to y = c(10, 100, 1000)x + y#> [1] 11 102 1003x * y#> [1] 10 200 3000
Some functions on vectors:
x = 1:4length(x) # nber of elements of x#> [1] 4mean(x)#> [1] 2.5sd(x)#> [1] 1.290994var(x)#> [1] 1.666667sum(x)#> [1] 10cumsum(x) # cumulative sum#> [1] 1 3 6 10diff(x) # next element - current element#> [1] 1 1 1
Create the following vectors:
Say you have a vector x and you want to select specific elements from it.
Variable index represents the elements you want to select from x.
The syntax is as follows (note the square brackets):
x[index]
Say you have a vector x and you want to select specific elements from it.
Variable index represents the elements you want to select from x.
The syntax is as follows (note the square brackets):
x[index]
The index can be of only three types, either:
Ex: You want to select the 4th and 5th elements:
x = 1:5# two types of indexes yielding the same results:index_nb = c(4, 5)index_logic = c(FALSE, FALSE, FALSE, TRUE, TRUE)x[index_nb]#> [1] 4 5x[index_logic]#> [1] 4 5
With an index in number, you can take several times the same value from x:
x = 5:1x[c(4, 4, 5, 1, 1)]#> [1] 2 2 1 5 5
With a logical index, you just can't.
You can also use negative numbers to drop observations. If so, all numbers of the index must be negative:
x[-1] # drops first element#> [1] 4 3 2 1x[-(3:length(x))] # drops the third to the last element#> [1] 5 4
If the vector has names: then you can use a character vector as an index:
x = 1:5names(x) = letters[1:5]x#> a b c d e #> 1 2 3 4 5x[c("b", "b", "d")]#> b b d #> 2 2 4
At first sight, the logical vector looks like impractical, however it is the one you gonna use the most!
Why? Because logical operations on vectors yield logical vectors.
A logical vector is returned when you perform logical operations on a vector.
<
, lower or equal: <=
, greater than: >
, greater or equal: >=
, equal: ==
, different: !=
A logical vector is returned when you perform logical operations on a vector.
<
, lower or equal: <=
, greater than: >
, greater or equal: >=
, equal: ==
, different: !=
To combine the results of several logical operations:
&
(ampersand), OR: |
(pipe), NOT: !
x = 1:5x > 2 #> [1] FALSE FALSE TRUE TRUE TRUEx != 5#> [1] TRUE TRUE TRUE TRUE FALSEx > 2 & x != 5#> [1] FALSE FALSE TRUE TRUE FALSE!(x > 2 & x != 5)#> [1] TRUE TRUE FALSE FALSE TRUE
Use logical vectors to find an approximation of the probability that:
|x|>3,x∼N(0,1)
Use logical vectors to find an approximation of the probability that:
|x|>3,x∼N(0,1)
set.seed(1) # for replicabilityx = rnorm(1000) # 1000 draws from N(0,1)is_large = abs(x) > 3sum(is_large) # nber of times abs(x) > 3#> [1] 3x[is_large] # we see the 'large' elements#> [1] -3.008049 3.810277 3.055742
For information: 2×Φ(−3)=0.0027, or in R
terms:
2 * pnorm(-3)#> [1] 0.002699796
Logicals are just 0/1-like numbers, you can use them in arithmetic operations.
a = c(TRUE, FALSE)a#> [1] TRUE FALSEa + 1 # logical is converted to numeric#> [1] 2 1
Beware: logical operations have the lowest precedence (i.e. they come last.)
Example: we want to set the values of x to 0 if y is negative.
What's the result of:
x = 1:5y = -2:2x * y>0
Beware: logical operations have the lowest precedence (i.e. they come last.)
Example: we want to set the values of x to 0 if y is negative.
What's the result of:
x = 1:5y = -2:2x * y>0
x * y>0#> [1] FALSE FALSE FALSE TRUE TRUE
Beware: logical operations have the lowest precedence (i.e. they come last.)
Example: we want to set the values of x to 0 if y is negative.
What's the result of:
x = 1:5y = -2:2x * y>0
x * y>0#> [1] FALSE FALSE FALSE TRUE TRUE
x * (y > 0)#> [1] 0 0 0 4 5
which()
returns the indexes of a logical vector which are TRUE
:which( c(TRUE, FALSE, FALSE, TRUE) )#> [1] 1 4set.seed(1)x = rnorm(6)which(x > 0)#> [1] 2 4 5
which()
returns the indexes of a logical vector which are TRUE
:which( c(TRUE, FALSE, FALSE, TRUE) )#> [1] 1 4set.seed(1)x = rnorm(6)which(x > 0)#> [1] 2 4 5
With a pen and paper:
which(x > 0)
but without using it.x_abs
which is the absolute value of x = -3:3
(you'll only need subsetting).The syntax of a for
loop is:
for(index in vector){ # do stuff # the variable 'index' will successively take # each value in 'vector'}
The syntax of a for
loop is:
for(index in vector){ # do stuff # the variable 'index' will successively take # each value in 'vector'}
for(i in c("Monique", "Esteban", "Francis")){ # cat: function used to print msg on console, # \n: means return to the line (otherwise # everything is in one line) cat("Hello ", i, "!\n", sep = "") }#> Hello Monique!#> Hello Esteban!#> Hello Francis!
The syntax for a while
loop:
while(condition){ # do stuff}
The syntax for a while
loop:
while(condition){ # do stuff}
i = 2while(i <= 100){ cat(i, "^2 = ", i^2, "\n", sep = "") i = i**2}#> 2^2 = 4#> 4^2 = 16#> 16^2 = 256cat("i out of the loop is", i)#> i out of the loop is 256
The syntax for a while
loop:
while(condition){ # do stuff}
i = 2while(i <= 100){ cat(i, "^2 = ", i^2, "\n", sep = "") i = i**2}#> 2^2 = 4#> 4^2 = 16#> 16^2 = 256cat("i out of the loop is", i)#> i out of the loop is 256
Of course, try to avoid infinite loops!
Use break
to escape a loop (either for
or while
):
a = 5while(TRUE){ a_next = a^a if(!is.finite(a_next)){ break } a = a_next}cat(a, " is finite but ", a, "^", a, " = ", a_next, sep = "")#> 3125 is finite but 3125^3125 = Inf
In a loop, to go to the next iteration, use next
:
for(i in 1:100){ if(i < 99){ next } print(i)}#> [1] 99#> [1] 100
The syntax for a condition is as follows:
if(condition_1){ # stuff} else if(condition_2){ # stuff} else { # stuff}
Any condition MUST be of length 1! i.e. you cannot use logical vectors.
x = 1:5# BAD:if(x == 1){ # meaningless! An error will be raised. # (In R < 4.0.0 it only led to a warning.)}# GOOD:if(x[1] == 1){ # Now it's clear what you test}if(any(x == 1)){ # any() returns TRUE if there is at least # one TRUE in a logical vector}if(all(x == 1)){ # all() returns TRUE if all the values are TRUE}
Say you want to test whether the 5th element of a vector is greater than 12:
if(x[5] > 12) print("ok")
Say you want to test whether the 5th element of a vector is greater than 12:
if(x[5] > 12) print("ok")
If x
is of length lower than 5 ⇒ problem.
Say you want to test whether the 5th element of a vector is greater than 12:
if(x[5] > 12) print("ok")
If x
is of length lower than 5 ⇒ problem.
Use a logical and
:
if(length(x) >= 5 & x[5] > 12) print("ok")
if(length(x) >= 5 & x[5] > 12) print("ok")
It works, but not for the good reasons!!!!!
if(length(x) >= 5 & x[5] > 12) print("ok")
It works, but not for the good reasons!!!!!
What do you think is going to happen?
x = 1:3if(length(x) >= 5 & stop("This has been evaluated")) print("ok")
if(length(x) >= 5 & x[5] > 12) print("ok")
It works, but not for the good reasons!!!!!
What do you think is going to happen?
x = 1:3if(length(x) >= 5 & stop("This has been evaluated")) print("ok")
x = 1:3if(length(x) >= 5 & stop("This has been evaluated")) print("ok")#> Error in eval(expr, envir, enclos): This has been evaluated
if(length(x) >= 5 & x[5] > 12) print("ok")
if(length(x) >= 5 & x[5] > 12) print("ok")
The previous code will:
length(x) >= 5
x[5] > 12
&
if(length(x) >= 5 & x[5] > 12) print("ok")
The previous code will:
length(x) >= 5
x[5] > 12
&
This is not what we want!
What do we want?
Evaluate length(x) >= 5
TRUE
, evaluate x[5] > 12
and return its valueFALSE
, return FALSE
What do we want?
Evaluate length(x) >= 5
TRUE
, evaluate x[5] > 12
and return its valueFALSE
, return FALSE
This means that we don't want x[5]
to be evaluated if the length of x
is lower than 5!
What do we want?
Evaluate length(x) >= 5
TRUE
, evaluate x[5] > 12
and return its valueFALSE
, return FALSE
This means that we don't want x[5]
to be evaluated if the length of x
is lower than 5!
Use logical operators with short-circuit: &&
and ||
:
if(length(x) >= 5 && x[5] > 12) print("ok")
&
and |
operators&
and |
operators&&
and ||
operatorsNow knowing how &
works.
Why does this produces and error:
x = 1:3if(x[5] > 12) print("ok")#> Error in if (x[5] > 12) print("ok"): missing value where TRUE/FALSE needed
... but not this?
x = 1:3if(length(x) >= 5 & x[5] > 12) print("ok")
Now knowing how &
works.
Why does this produces and error:
x = 1:3if(x[5] > 12) print("ok")#> Error in if (x[5] > 12) print("ok"): missing value where TRUE/FALSE needed
... but not this?
x = 1:3if(length(x) >= 5 & x[5] > 12) print("ok")
&
deals with missing values.
&
and NAs:TRUE & NA#> [1] NAFALSE & NA#> [1] FALSE
|
and NAs:TRUE | NA#> [1] TRUEFALSE | NA#> [1] NA
Beware the behavior of NAs when subsetting!!!!
set.seed(1)x = rnorm(8)x[c(3, 7)] = NAy = round(rnorm(8), 1)y[x > 0]#> [1] -0.3 NA 0.4 -0.6 NA 0.0
Beware the behavior of NAs when subsetting!!!!
set.seed(1)x = rnorm(8)x[c(3, 7)] = NAy = round(rnorm(8), 1)y[x > 0]#> [1] -0.3 NA 0.4 -0.6 NA 0.0
Remember that FALSE & NA
leads to FALSE
? You need to use is.na
to identify the NAs:
y[!is.na(x) & x > 0]#> [1] -0.3 0.4 -0.6 0.0
NAs can be sneaky!!
Here say we generate two random variables: x
along a uniform law, y
along a Normal law, and we create z
, equal to y
power x
.
Finally, we subset y
on the values taken by z
.
set.seed(1)x = runif(8)y = round(rnorm(8), 1)z = y ^ x# we want the y's for which z >= 0.7y[z >= 0.7]#> [1] 0.3 NA 0.7 0.6 NA 1.5
NAs can be sneaky!!
Here say we generate two random variables: x
along a uniform law, y
along a Normal law, and we create z
, equal to y
power x
.
Finally, we subset y
on the values taken by z
.
set.seed(1)x = runif(8)y = round(rnorm(8), 1)z = y ^ x# we want the y's for which z >= 0.7y[z >= 0.7]#> [1] 0.3 NA 0.7 0.6 NA 1.5
The right way is:
y[!is.na(z) & z >= 0.7]#> [1] 0.3 0.7 0.6 1.5
names = c("Monique", "Esteban", "Francis")for(i in 1:3){ if(i == 2){ text = "Not hello " } else { text = "Hello " } cat(text, names[i], "!\n", sep = "") }#> Hello Monique!#> Not hello Esteban!#> Hello Francis!
For single instruction loops and conditions, you can omit the brackets:
for(i in 1:100) if(i == 55) print("55 is reached")#> [1] "55 is reached"
Although it looks like two operations are executed in the for
loop, it is really only one (the if
).
For single instruction loops and conditions, you can omit the brackets:
for(i in 1:100) if(i == 55) print("55 is reached")#> [1] "55 is reached"
Although it looks like two operations are executed in the for
loop, it is really only one (the if
).
I do NOT advise using this shorthand ⇒ the code looses in clarity.
Yet useful for QnD★ stuff.
14 %% 5
yields 4, 8 %% 3
yields 2.With a pen and paper:
Compute the mean of x = 1:5
with a loop.
Compute the exponential of 1 with a loop. Remind that
exp(x)=∞∑i=0xii!.
Use the function factorial()
and go only until i = 20
.
Do the previous exercise without loop.
Discover for which integer factorial
becomes infinite. Do it twice: with a for
and then a while
loop.
Find the first divisor of 1234567 (use %%
to get the rest of the Euclidian division★).
Now let's apply the solutions in Rstudio.
To run current line: ctrl + enter
To comment / uncomment: ctrl + shift + c
To create sections, add "####" to the end of a comment (ex: # Section 1 ####).
ctrl + alt + up or down: duplicates the current line
ctrl + alt + click: duplicates the cursors
Create your own macros: https://rstudio.github.io/rstudioaddins/
To create a 2×2 matrix full of ones:
matrix(data = 1, nrow = 2, ncol = 2)#> [,1] [,2]#> [1,] 1 1#> [2,] 1 1
Let's create a matrix with numbers from 1 to 4:
matrix(1:4, 2, 2)#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
Let's create a matrix with numbers from 1 to 4:
matrix(1:4, 2, 2)#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
R fills the matrix by columns. To fill it by row:
matrix(1:4, 2, 2, byrow = TRUE)#> [,1] [,2]#> [1,] 1 2#> [2,] 3 4
You can create matrices by "binding" vectors, using functions rbind
and cbind
:
rbind(1:3, 3:1) # row bind#> [,1] [,2] [,3]#> [1,] 1 2 3#> [2,] 3 2 1cbind(1:3, 3:1) # column bind#> [,1] [,2]#> [1,] 1 3#> [2,] 2 2#> [3,] 3 1cbind(1:2, 2:1, 3:4, 4:3) # you can have any number of args#> [,1] [,2] [,3] [,4]#> [1,] 1 2 3 4#> [2,] 2 1 4 3
Like for vectors, all operations are term to term:
X = matrix(1:4, 2, 2)X + 2#> [,1] [,2]#> [1,] 3 5#> [2,] 4 6X ** 2#> [,1] [,2]#> [1,] 1 9#> [2,] 4 16exp(X)#> [,1] [,2]#> [1,] 2.718282 20.08554#> [2,] 7.389056 54.59815
Even when you multiply by a matrix:
X = matrix(1:4, 2, 2)Y = matrix(1, 2, 2)X * Y#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
Even when you multiply by a matrix:
X = matrix(1:4, 2, 2)Y = matrix(1, 2, 2)X * Y#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
Does vector × matrix multiplication work?
Y * 1:4
Yes it works:
Y * 1:4#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
Yes it works:
Y * 1:4#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
Weird behavior, isn't it? When a vector multiplies a matrix, R first kind of transforms the vector into a matrix of same dimensions before applying the operation.
Yes it works:
Y * 1:4#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
Weird behavior, isn't it? When a vector multiplies a matrix, R first kind of transforms the vector into a matrix of same dimensions before applying the operation.
Here's the logic for Y * 1:4
:
Y
is 2×21:4
is transformed into matrix(1:4, 2, 2)
Y * matrix(1:4, 2, 2)
To perform matrix multiplication, we need to use the following symbol: "%*%
"
X = matrix(1:4, 2, 2)Y = matrix(1, 2, 2)X %*% Y#> [,1] [,2]#> [1,] 4 4#> [2,] 6 6X %*% (1:2)#> [,1]#> [1,] 7#> [2,] 10
To perform matrix multiplication, we need to use the following symbol: "%*%
"
X = matrix(1:4, 2, 2)Y = matrix(1, 2, 2)X %*% Y#> [,1] [,2]#> [1,] 4 4#> [2,] 6 6X %*% (1:2)#> [,1]#> [1,] 7#> [2,] 10
Now the vector MUST be of the appropriate dimensions:
X %*% 1:4#> Error in X %*% 1:4: non-conformable arguments
t(X)
crossprod(X)
is faster than t(X) %*% X
tcrossprod(X)
is faster than X %*% t(X)
solve(X)
rowSums(X)
colSums(X)
dim(X)
nrow()
(ncol()
)Generate data (100 points) according to the following relation:
yi=2+5xi+ϵixi∼N(0,1)ϵi∼N(0,1)
Estimate the coefficients of the constant and of x. Recall that:
^β=(X′X)−1X′Y
runif(n)
rnorm(n)
?Distributions
sample(n, k, replace = TRUE)
draw k numbers among n with replacement.set.seed(m)
, with m a number (when you launch R, default is roughly like set.seed(Sys.time())
, so that everytime the seed will be different.)What happens if I do:
x = 1:5x[c(TRUE, FALSE)]
What happens if I do:
x = 1:5x[c(TRUE, FALSE)]
#> [1] 1 3 5
It works! yet it's not good news...
What happens if I do:
x = 1:5x[c(TRUE, FALSE)]
#> [1] 1 3 5
It works! yet it's not good news...
If an operation requires a vector of length n and you give a vector of length m<n, R tries hard to make the second vector match length n, so that the operation works.
Basically, it usually replicates the vector until it fits.
What happens if I do:
x = 1:5x[c(TRUE, FALSE)]
#> [1] 1 3 5
It works! yet it's not good news...
If an operation requires a vector of length n and you give a vector of length m<n, R tries hard to make the second vector match length n, so that the operation works.
Basically, it usually replicates the vector until it fits.
You may not notice mistakes!
rep(1, 6) + 0:1
rep(1, 6) + 0:1
#> [1] 1 2 1 2 1 2
rep(1, 6) + 0:1
#> [1] 1 2 1 2 1 2
matrix(1:3, 3, 2)
rep(1, 6) + 0:1
#> [1] 1 2 1 2 1 2
matrix(1:3, 3, 2)
#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2#> [3,] 3 3
rep(1, 6) + 0:1
#> [1] 1 2 1 2 1 2
matrix(1:3, 3, 2)
#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2#> [3,] 3 3
matrix(1:3, 2, 3)
rep(1, 6) + 0:1
#> [1] 1 2 1 2 1 2
matrix(1:3, 3, 2)
#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2#> [3,] 3 3
matrix(1:3, 2, 3)
#> [,1] [,2] [,3]#> [1,] 1 3 2#> [2,] 2 1 3
matrix(1, 2, 2) + 0:1
matrix(1, 2, 2) + 0:1
#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2
matrix(1, 2, 2) + 0:1
#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2
matrix(1, 2, 2) + 0:2 # Now lengths don't match => warning
matrix(1, 2, 2) + 0:1
#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2
matrix(1, 2, 2) + 0:2 # Now lengths don't match => warning
#> Warning in matrix(1, 2, 2) + 0:2: longer object length is not a multiple of#> shorter object length#> [,1] [,2]#> [1,] 1 3#> [2,] 2 1
Three ways to extract subsets of a matrix:
X[index_row, index_column]
⇒ yields a matrixX[index_vector]
⇒ yields a vectorX[index_matrix]
⇒ yields a vectorBoth indexes index_row
and index_column
are similar to vector indexes: they can either be logical or numeric (or character).
Leaving an index empty means all rows/columns.
X = matrix(1:9, 3, 3, byrow = TRUE)X[1, 2:3] # 1st line, two last columns => 1x2 mat. #> [1] 2 3# simplified by R to a vectorX[c(1,3), 2:3] # 1st & 3rd lines, two last cols => 2x2 mat. #> [,1] [,2]#> [1,] 2 3#> [2,] 8 9X[1, ] # 1st line, all columns#> [1] 1 2 3X[X[, 1] <= 4, ] # all lines such that 1st element <= 4#> [,1] [,2] [,3]#> [1,] 1 2 3#> [2,] 4 5 6
You can use a logical/numeric vector going through all the elements of a matrix:
X = matrix(1:9, 3, 3, byrow = TRUE)X[X<5]#> [1] 1 4 2 3X[5] # the 5th element of the matrix#> [1] 5
You can use a logical/numeric vector going through all the elements of a matrix:
X = matrix(1:9, 3, 3, byrow = TRUE)X[X<5]#> [1] 1 4 2 3X[5] # the 5th element of the matrix#> [1] 5
Logic? The matrix is ex ante transformed into a vector, then the subsetting is done.
You can use a logical/numeric vector going through all the elements of a matrix:
X = matrix(1:9, 3, 3, byrow = TRUE)X[X<5]#> [1] 1 4 2 3X[5] # the 5th element of the matrix#> [1] 5
Logic? The matrix is ex ante transformed into a vector, then the subsetting is done.
Here is an example of what it does:
X_tmp = as.vector(X)
X_tmp[X_tmp < 5]
Sub-setting with a matrix index (index_matrix
).
index_matrix
:
Sub-setting with a matrix index (index_matrix
).
index_matrix
:
X = matrix(1:9, 3, 3, byrow = TRUE)# getting the diagonal:index_matrix = cbind(1:3, 1:3) # 2 column matrixX[index_matrix]#> [1] 1 5 9# getting the other diagonalX[cbind(1:3, 3:1)]#> [1] 3 5 7
What is the type of y
:
x = matrix(1:4, 2, 2)y = x[, 1]class(y)
What is the type of y
:
x = matrix(1:4, 2, 2)y = x[, 1]class(y)
#> [1] "integer"
It's a vector!
Subsets of matrices leading to something of dimension 1 (either row or column) lead to a conversion to vector. It may not be what you want!
And the help-page explaining this default behavior is super hard to find!
You need to type: ?"["
.
And the help-page explaining this default behavior is super hard to find!
You need to type: ?"["
.
There you find the solution:
x = matrix(1:4, 2, 2)y = x[, 1, drop = FALSE]class(y)#> [1] "matrix" "array"class(x[, 1])#> [1] "integer"
Let X be the 10×4 matrix such that Xik=i×k.
We want to create ~X such that ~Xik=Xik−¯Xk, with ¯Xk the average of the kth column of X.
colSums()
,
ii) matrix multiplication,
iii) colMeans()
List are sort of vectors of objects that can be of any type.
a = list(1:5, c("je", "dors"), matrix(1:4, 2, 2))a#> [[1]]#> [1] 1 2 3 4 5#> #> [[2]]#> [1] "je" "dors"#> #> [[3]]#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
List are sort of vectors of objects that can be of any type.
a = list(1:5, c("je", "dors"), matrix(1:4, 2, 2))a#> [[1]]#> [1] 1 2 3 4 5#> #> [[2]]#> [1] "je" "dors"#> #> [[3]]#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
As we can see the list named a
contains three elements: a numeric vector, a character vector and a matrix.
You can give names to the elements of a list ⇒ makes it clearer:
a = list(vec = 1:5, charvec = c("je", "dors"), mat = matrix(1:4, 2, 2))a#> $vec#> [1] 1 2 3 4 5#> #> $charvec#> [1] "je" "dors"#> #> $mat#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4
A list is not a vector. (Sorry for the pleonasm.)
You cannot perform regular operations with lists.
Why? Because it does not make sense.
a = list(1:5) # list made of just one vectora * 2 # Error!#> Error in a * 2: non-numeric argument to binary operator
To add an element to an existing list:
a = list() # empty lista$x = 1:5 # create the first element named xa[["y"]] = 2 # creates a 2nd element named ya[[3]] = 66 # creates a 3rd element NOT named
To add an element to an existing list:
a = list() # empty lista$x = 1:5 # create the first element named xa[["y"]] = 2 # creates a 2nd element named ya[[3]] = 66 # creates a 3rd element NOT named
Note that a$x
is exactly equivalent to a[["x"]] = 1:5
.
Two ways to extract elements from a list:
Methods returning a list:
a[index_vector]
⇒ it always returns a listMethods extracting one single element from a list:
a[[index_or_name]]
a$name
a = list(numvec = 1:5, charvec = c("bon", "jour"))a[1:2] # a list#> $numvec#> [1] 1 2 3 4 5#> #> $charvec#> [1] "bon" "jour"a[c(FALSE, TRUE)]#> $charvec#> [1] "bon" "jour"a["charvec"] # still a list#> $charvec#> [1] "bon" "jour"a["numvec"] * 2 # it's a list => error is raised#> Error in a["numvec"] * 2: non-numeric argument to binary operator
Now assume you want to extract single elements to perform some operations with them:
a = list(numvec = 1:5, charvec = c("bon", "jour"))a[["numvec"]] * 2 # we use the vector -- and not the list#> [1] 2 4 6 8 10a$numvec # other way#> [1] 1 2 3 4 5a[[1]] # yet another#> [1] 1 2 3 4 5
Now assume you want to extract single elements to perform some operations with them:
a = list(numvec = 1:5, charvec = c("bon", "jour"))a[["numvec"]] * 2 # we use the vector -- and not the list#> [1] 2 4 6 8 10a$numvec # other way#> [1] 1 2 3 4 5a[[1]] # yet another#> [1] 1 2 3 4 5
When a list doesn't have names, you must use numeric/logical subsets.
So far we've seen only vectors, matrices and lists.
Matrices are good for doing numeric applications, yet they only accept numeric data.
Lists are interesting but their content can be too heterogeneous.
What can we use for data analysis?
A data frame is:
It's a table of a given number of rows, each column being of a specific type.
data(iris) # internal R data used for exampleshead(iris, 3) # first 3 rows#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> 1 5.1 3.5 1.4 0.2 setosa#> 2 4.9 3.0 1.4 0.2 setosa#> 3 4.7 3.2 1.3 0.2 setosaclass(iris) # data.frame indeed#> [1] "data.frame"dim(iris) # dimension of the data#> [1] 150 5
You can create data frames with... data.frame
:
data.frame(id = 1:3, name = c("John", "Mary", "Tim"))#> id name#> 1 1 John#> 2 2 Mary#> 3 3 Tim
To get general information on the variables of a data.frame
, you can use the function summary
:
summary(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width #> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 #> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 #> Median :5.800 Median :3.000 Median :4.350 Median :1.300 #> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 #> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 #> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 #> Species #> setosa :50 #> versicolor:50 #> virginica :50 #> #> #>
The function head
still works:
head(iris)#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> 1 5.1 3.5 1.4 0.2 setosa#> 2 4.9 3.0 1.4 0.2 setosa#> 3 4.7 3.2 1.3 0.2 setosa#> 4 4.6 3.1 1.5 0.2 setosa#> 5 5.0 3.6 1.4 0.2 setosa#> 6 5.4 3.9 1.7 0.4 setosa
Data frames are a special beast.
They can be subsetted either like lists, either like matrices.
iris[1:2, 2:3] #> Sepal.Width Petal.Length#> 1 3.5 1.4#> 2 3.0 1.4iris[2:3, c("Sepal.Length", "Species")] #> Sepal.Length Species#> 2 4.9 setosa#> 3 4.7 setosa
head(iris[1], 2) # single square bracket: DF is returned#> Sepal.Length#> 1 5.1#> 2 4.9head(iris[[1]]) # vector is returned#> [1] 5.1 4.9 4.7 4.6 5.0 5.4head(iris[["Sepal.Length"]]) # vector is returned#> [1] 5.1 4.9 4.7 4.6 5.0 5.4head(iris$Sepal.Length) # vector is returned#> [1] 5.1 4.9 4.7 4.6 5.0 5.4
As for matrices, if the subset of a data.frame
leads to something of dimension 1, it is simplified into a vector.
You need to use the (hidden) argument drop
to change the behavior:
class(iris[, 1])#> [1] "numeric"class(iris[, 1, drop = FALSE])#> [1] "data.frame"
A data.frame must have column names. To both access it and set it, use names()
.
df = irisnames(df)#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"names(df) = 1:ncol(df)names(df)#> [1] "1" "2" "3" "4" "5"
To access column names in a matrix, use colnames()
.
To access and set the row names of a data.frame, use row.names()
:
df = data.frame(age = c(32, 12, 27), other = 3:1)row.names(df) = c("Lara", "Julia", "Paul")df["Paul", ]#> age other#> Paul 27 1
To access row names in matrices, use rownames()
(no dot!).
df = data.frame(name = c("Lara", "Julia", "Paul"), age = c(32, 12, 27))df$young = df$age <= 18# initialize the variable with some valuedf$ageSq_if_young = df$age# modify it conditionnalydf$ageSq_if_young[df$young] = df$age[df$young] ** 2# Delete variable:df$age = NULL
Use the function apply()
to apply a function to the rows/columns of a matrix/data.frame.
# MARGIN: 1: row, 2:columnapply(iris[, 1:4], MARGIN = 2, FUN = median) # get the median for 4 first vars#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> 5.80 3.00 4.35 1.30head(apply(iris[, 1:4], 1, max)) # max value for each obs#> [1] 5.1 4.9 4.7 4.6 5.0 5.4
In your R programming journey, you may encounter a strange creature:
factor
s.
In your R programming journey, you may encounter a strange creature:
factor
s.
class(iris$Species)#> [1] "factor"head(iris$Species)#> [1] setosa setosa setosa setosa setosa setosa#> Levels: setosa versicolor virginica
In your R programming journey, you may encounter a strange creature:
factor
s.
class(iris$Species)#> [1] "factor"head(iris$Species)#> [1] setosa setosa setosa setosa setosa setosa#> Levels: setosa versicolor virginica
Factors are in fact a way to encode categorical data:
In your R programming journey, you may encounter a strange creature:
factor
s.
class(iris$Species)#> [1] "factor"head(iris$Species)#> [1] setosa setosa setosa setosa setosa setosa#> Levels: setosa versicolor virginica
Factors are in fact a way to encode categorical data:
they look like character but are in fact integers
they're weird!
In R < 4.0.0
, when creating a data.frame
, the default of argument stringsAsFactors
is TRUE
!
In R < 4.0.0
, when creating a data.frame
, the default of argument stringsAsFactors
is TRUE
!
Back in the days check your R version! upon the creation of a DF, character vectors are converted to factors:
df = data.frame(name = c("Lara", "Julia", "Paul"), age = c(32, 12, 27), stringsAsFactors = TRUE)df$name#> [1] Lara Julia Paul #> Levels: Julia Lara Paulclass(df$name)#> [1] "factor"
Illustration of the problem:
df = data.frame(name = c("Lara", "Julia", "Paul"), stringsAsFactors = TRUE)age = c(Lara = 32, Julia = 12, Paul = 27) # named vector
Illustration of the problem:
df = data.frame(name = c("Lara", "Julia", "Paul"), stringsAsFactors = TRUE)age = c(Lara = 32, Julia = 12, Paul = 27) # named vector
What does the following yields?
df$age = age[df$name]df
Illustration of the problem:
df = data.frame(name = c("Lara", "Julia", "Paul"), stringsAsFactors = TRUE)age = c(Lara = 32, Julia = 12, Paul = 27) # named vector
What does the following yields?
df$age = age[df$name]df
#> name age#> 1 Lara 12#> 2 Julia 32#> 3 Paul 27
Illustration of the problem:
df = data.frame(name = c("Lara", "Julia", "Paul"), stringsAsFactors = TRUE)age = c(Lara = 32, Julia = 12, Paul = 27) # named vector
What does the following yields?
df$age = age[df$name]df
#> name age#> 1 Lara 12#> 2 Julia 32#> 3 Paul 27
WTF???
Why this behavior? Because factors are treated as integers!
Here's a sketch of what's done to the character variable df$name
:
name_sorted_unik = sort(unique(df$name))
dict = 1:length(name_sorted_unik)
names(dict) = name_sorted_unik
df$name = dict[df$name]
unclass(df$name) # integers reflect the "character" order #> [1] 2 1 3#> attr(,"levels")#> [1] "Julia" "Lara" "Paul"
To have the appropriate result:
df$age = age[as.character(df$name)]df # we had to reconvert into character#> name age#> 1 Lara 32#> 2 Julia 12#> 3 Paul 27
To have the appropriate result:
df$age = age[as.character(df$name)]df # we had to reconvert into character#> name age#> 1 Lara 32#> 2 Julia 12#> 3 Paul 27
Factor variables can be useful in some context (we'll see when later).
Think functions.
Why thinking functions? If you have a piece of code that you use at least twice, it is worth making a function out of it so that the next time, you just use one line of code. Direct productivity gain.
Big strenght of R: very easy to create flexible functions. Cost of making functions is low.
Even if you think the problem you're dealing with is specific, try to see it as a special case of a broader context.
This way, you'll be able to create a broad function that'll be able to deal with your specific problem but also many others.
One drawback is that making broad functions requires abstract thinking. Yet it's usually worth the investment.
A function is just a set of instructions applied to objects given in input.
To create a function, the structure is as follows:
functionName = function(arg1, arg2){ # the instructions to perform return(output) # the stuff to be returned}
To lighten notation, R allows you to avoid the use of return()
to return something from a function.
By default, the last element of a function is returned.
add1 = function(x){ res = x + 1 return(res) # return(x + 1) also works}add1_bis = function(x){ x + 1}add1(1)#> [1] 2add1_bis(1)#> [1] 2
If the function bumps into a return()
, it stops right away and returns the object:
test = function(x){ return(1) return(2) return(3)}test()#> [1] 1
You can write functions with only one instruction in one line (no need of brackets):
add1_ter = function(x) x + 1add1_ter(2)#> [1] 3
You can add default values to the arguments:
funName = function(arg1 = default1, arg2 = default2, arg3){ # here arg1 and arg2 have default values}
add1_quar = function(x = 0) x + 1add1_quar()#> [1] 1
Arguments need not be used in the function:
happy = function(x, y, z){ print("I'm happy.")}happy()#> [1] "I'm happy."happy("I", "am not", "happy") # yields same result#> [1] "I'm happy."
Arguments need not be used in the function:
happy = function(x, y, z){ print("I'm happy.")}happy()#> [1] "I'm happy."happy("I", "am not", "happy") # yields same result#> [1] "I'm happy."
In the happy()
function, no error is raised: although missing, arguments x, y and z are not used.
Arguments need not be used in the function:
happy = function(x, y, z){ print("I'm happy.")}happy()#> [1] "I'm happy."happy("I", "am not", "happy") # yields same result#> [1] "I'm happy."
In the happy()
function, no error is raised: although missing, arguments x, y and z are not used.
When an argument is missing and used ⇒ an error will be raised:
funSquare = function(x) x**2funSquare(2) # works#> [1] 4funSquare() # x is missing and used! Error#> Error in funSquare(): argument "x" is missing, with no default
Create myCov(X)
, a function to compute the covariance of a matrix X. Remind that:
V(X)=1n−1~X′~X,~Xik=Xik−¯Xk
Apply it to X = cbind(rnorm(100, sd = 2), rnorm(100, sd = 10))
.
Create myOLS(y, x)
, a function that returns the coef. of an OLS estimation of vector x on vector y.
...
...
?showDot = function(...){ dots = list(...) print(dots)}showDot(arg1 = 1:5, "test stuff", b = "another", list(test_still = 2))#> $arg1#> [1] 1 2 3 4 5#> #> [[2]]#> [1] "test stuff"#> #> $b#> [1] "another"#> #> [[4]]#> [[4]]$test_still#> [1] 2
...
?...
!...
plotCor = function(x, y){ # linear regression reg = lm(y ~ x) # plotting the correlation... plot(x, y) # with the fit abline(reg)}
...
plotCor(iris$Sepal.Length, iris$Sepal.Width)
...
IIWhat if I want:
...
IIWhat if I want:
change other stuff??
you can't.
...
plotCor = function(x, y, ...){ # linear regression reg = lm(y ~ x) # plotting the correlation... plot(x, y, ...) # with the fit abline(reg)}
...
plotCor(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species, xlab = "Sepal.Width")
...
work?How does it work?
all arguments that are NOT plotCor()
arguments are gathered in ...
and passed on to the plot function
in the example, the plot performed in the function is:
plot(x, y, col = iris$Species, xlab = "Sepal.Width")
allows to create flexible functions, with a lightweight syntax
x = 5f1 = function() print(x)
# Does it work?f1()#> [1] 5
#> [1] 5
when R doesn't find a variable in a function, it goes on top of it to find it.
in the context of f1()
, x is a global variable.
x = 1f1 = function() print(x)f2 = function() { x = 2 f1()}f3 = function(){ x = 3 f4 = function() print(x) f4()}f1()#> [1] 1f2()#> [1] 1f3()#> [1] 3
The functions lapply()
and sapply()
are very handy, you should have them in your toolbox.
# loops over each element of X and apply a function to it:lapply(X = iris[, c(1, 5)], FUN = class)#> $Sepal.Length#> [1] "numeric"#> #> $Species#> [1] "factor"# sapply is the same but coerce the result into a vector:sapply(iris[, c(1, 5)], class)#> Sepal.Length Species #> "numeric" "factor"
Let's numerically compute the probability that x>0 when x∼N(0,1) for varying number of observations: 10, 20, 30, 40, etc, 100.
myfun = function(n) mean(rnorm(n) > 0)sapply(10 * 1:10, myfun)#> [1] 0.6000000 0.3500000 0.6000000 0.5500000 0.2800000 0.5166667 0.4714286#> [8] 0.5250000 0.4222222 0.4900000
Now let's get the variance of the results for 100 repetitions of the experiment:
myfun = function(n, nRepeat){ var(replicate(nRepeat, mean(rnorm(n) > 0)))}sapply(10 * 1:10, myfun, nRepeat = 100)#> [1] 0.027485859 0.013802020 0.008594725 0.005618434 0.005365657 0.004513917#> [7] 0.003145970 0.002894003 0.002501359 0.003228071
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |