nowadays, many applications using text data (NLP & ML)
pretty easy to deal with text in R (and usually efficient)
we will see only the most basic (and usually most useful) text manipulation in R
To concatenate several character strings (henceforth CS), use paste()
:
paste("hi", "everyone")#> [1] "hi everyone"godNames = c("Zeus", "Aphrodite")paste("Hello holy", godNames) # returns a vector#> [1] "Hello holy Zeus" "Hello holy Aphrodite"
# What's the result of:paste(c("iphone", "Samsung"), 1:6)
#> [1] "iphone 1" "Samsung 2" "iphone 3" "Samsung 4" "iphone 5" "Samsung 6"
sep
, which is the separator between two CS (default is " "
)collapse
, if provided, it will glue a vector of CS with the value of collapse
paste("Hello holy", godNames, collapse = " and ") # 1 CS only#> [1] "Hello holy Zeus and Hello holy Aphrodite"paste("Hello holy", godNames, sep = "....") # vector of length 2#> [1] "Hello holy....Zeus" "Hello holy....Aphrodite"paste("Hello holy", godNames, sep = "....", collapse = " and ")#> [1] "Hello holy....Zeus and Hello holy....Aphrodite"
The behavior of collapse is:
charvec_tmp = paste("Hello holy", godNames, sep = "........")
paste(charvec_tmp, collapse = " and ")
paste0
By default the function paste
concatenates with a space between the character elements:
paste("20", "22")#> [1] "20 22"
Use paste0
to concatenate with the empty string:
paste0("20", "22")#> [1] "2022"
I avoids adding the argument sep = ""
in paste
.
Let df = iris
a copy of the iris
data.
Create a unique ID for df
observations.
The character ID should be of the form:
[Species name]_[order of appearance].
# something that may be usefultable(iris$Species) # frequencies#> #> setosa versicolor virginica #> 50 50 50
"Laurent" == "laurent"#> [1] FALSE"bergé" == "berge"#> [1] FALSE"Laurent, Bergé" == "Laurent Bergé"#> [1] FALSE
as you can see, although the values convey the same information, they are treated as different.
when dealing with text data, you first need to format them for meaningful comparisons.
To convert to ASCII, easiest way is to use iconv()
:
iconv("Laurent Bergé, €™", to = "ASCII")#> [1] NAiconv("Laurent Bergé, €™", to = "ASCII//IGNORE")#> [1] "Laurent Berg, "iconv("Laurent Bergé, €™", to = "ASCII//TRANSLIT")#> [1] "Laurent Berge, ?T"
to
defines the bahavior of iconv
:NA
.foxDog = "The Brown Fox Jumps Over The Lazy Dog"tolower(foxDog)#> [1] "the brown fox jumps over the lazy dog"toupper(foxDog)#> [1] "THE BROWN FOX JUMPS OVER THE LAZY DOG"
# to extract a subset of a CS:substr(foxDog, start = 1, stop = 13)#> [1] "The Brown Fox"substr(foxDog, 26, nchar(foxDog))#> [1] "The Lazy Dog"
# You can apply it directly to vectorssubstr(rep(foxDog, 2), c(1, 26), c(13, nchar(foxDog)))#> [1] "The Brown Fox" "The Lazy Dog"
To split a CS, use strsplit()
:
strsplit(foxDog, split = "Jumps Over")#> [[1]]#> [1] "The Brown Fox " " The Lazy Dog"
What do you notice?
# It can be applied to vectors:text = c("Rumble thy bellyful!", "Spit, fire!", "Spout, rain!", "Nor rain, wind, thunder, fire are my daughters.")strsplit(text, split = " ")#> [[1]]#> [1] "Rumble" "thy" "bellyful!"#> #> [[2]]#> [1] "Spit," "fire!"#> #> [[3]]#> [1] "Spout," "rain!" #> #> [[4]]#> [1] "Nor" "rain," "wind," "thunder," "fire" #> [6] "are" "my" "daughters."
you can apply strsplit()
to vectors. Since the number of elements can be varying, returning a list is natural
don't forget brackets, strsplit(text, split)[[1]]
, for single CS
Let's look at this corpus:
textvec = c("The Brown Fox Jumps Over The Lazy Dog", "Nor rain, wind, thunder, fire are my daughters.", "When my information changes, I alter my conclusions.")textvec_split = strsplit(textvec, " ")
The file stopwords_en.RData
contains English stopwords (common words usually relating no specific meaning).{star}Use the function load
to open it.
The operator x %in% s
asks whether the elements of a vector x
belong to the set s
.
5 %in% 1:5#> [1] TRUE"bonjour" %in% c("bonjour", "les", "gens")#> [1] TRUEc("bonjour", "au revoir") %in% c("bonjour", "les", "gens")#> [1] TRUE FALSE
Use %in%
to recreate the following vector of text without stopwords:
textvec = c("The Brown Fox Jumps Over The Lazy Dog", "Nor rain, wind, thunder, fire are my daughters.", "When my information changes, I alter my conclusions.")
Say you have the following sentence:
The king infringes the law on playing curling.
Say you have the following sentence:
The king infringes the law on playing curling.
You want to stem the sentence, i.e. taking off the "ing" to keep only the root of the words.
Say you have the following sentence:
The king infringes the law on playing curling.
You want to stem the sentence, i.e. taking off the "ing" to keep only the root of the words.
The function gsub()
takes in a character string and replaces a string pattern with another string.
# the arguments are the original order of gsubgsub(pattern = "jour", replacement = " soir", x = "Bonjour")#> [1] "Bon soir"
So let's stem the sentence with gsub
.
Let's suppress all the "ing"
:
kingText = "The king infringes the law on playing curling."gsub(pattern = "ing", replacement = "", x = kingText)#> [1] "The k infres the law on play curl."
So let's stem the sentence with gsub
.
Let's suppress all the "ing"
:
kingText = "The king infringes the law on playing curling."gsub(pattern = "ing", replacement = "", x = kingText)#> [1] "The k infres the law on play curl."
Hmm, this was too strong, infringe
became infre
, let's give it another shot:
# a space is added after "ing"gsub("ing ", " ", kingText) #> [1] "The k infringes the law on play curling."
So let's stem the sentence with gsub
.
Let's suppress all the "ing"
:
kingText = "The king infringes the law on playing curling."gsub(pattern = "ing", replacement = "", x = kingText)#> [1] "The k infres the law on play curl."
Hmm, this was too strong, infringe
became infre
, let's give it another shot:
# a space is added after "ing"gsub("ing ", " ", kingText) #> [1] "The k infringes the law on play curling."
That's better. But unfortunately new problems pop:
curling
now is not treatedking
became k
and its meaning is completely lostWe can easily deal with the two issues with regular expressions!
gsub("([[:alpha:]]{3,})ing\\b", "\\1", kingText) #> [1] "The king infringes the law on play curl."
We can easily deal with the two issues with regular expressions!
gsub("([[:alpha:]]{3,})ing\\b", "\\1", kingText) #> [1] "The king infringes the law on play curl."
Regular expressions are extremely powerful tools to deal with text data.
Regular expressions are a language per se which takes time to master, but it's worth it.
Regular expressions can be used in many (all?) programming languages!
We can easily deal with the two issues with regular expressions!
gsub("([[:alpha:]]{3,})ing\\b", "\\1", kingText) #> [1] "The king infringes the law on play curl."
Regular expressions are extremely powerful tools to deal with text data.
Regular expressions are a language per se which takes time to master, but it's worth it.
Regular expressions can be used in many (all?) programming languages!
Learn regular expressions!
In this course I'll detail only a few important features.
For more detailed information, look at ?regexp
or the many regular expression tutorials existing.
In a regex, two backslashes, \\
, are used for special characters.
\\b
means the end of a word, a word consisting of a succession of letters or digits.
gsub("ing\\b", "", kingText) # now works for "curling."#> [1] "The k infringes the law on play curl."
The special argument []
means: any character that matches what's inside the brackets.
gsub("[aeiouy]", "_", kingText)#> [1] "Th_ k_ng _nfr_ng_s th_ l_w _n pl___ng c_rl_ng."
Any vowel is replaced with "_".
The special argument [:alpha:]
works only inside brackets and means all the alphabet:[[:alpha:]]
is equiv. to [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]
gsub("[[:alpha:]]", "_", kingText)#> [1] "___ ____ _________ ___ ___ __ _______ _______."
Only non-letters are not replaced (the space and the point).
Other examples are [:digit:]
and [:punct:]
.
You can put anything you want in the brackets argument: e.g. [[:punct:]123 ]
to match any punctuation, space or digits from 1 to 3.
When you want a pattern to be matched several times:
{a, b}
means: previous pattern appears at least a
times and at most b
times+
means: previous pattern appears at least once (equiv. {1, }
)*
means: previous pattern appears 0 or more times (equiv. {0, }
){star}Yes, this is useful.{Question}
What does the following do?
gsub("\\b[[:alpha:]]{1,3}\\b", "_", kingText)
When you want a pattern to be matched several times:
{a, b}
means: previous pattern appears at least a
times and at most b
times+
means: previous pattern appears at least once (equiv. {1, }
)*
means: previous pattern appears 0 or more times (equiv. {0, }
){star}Yes, this is useful.{Question}
What does the following do?
gsub("\\b[[:alpha:]]{1,3}\\b", "_", kingText)
gsub("\\b[[:alpha:]]{1,3}\\b", "_", kingText)#> [1] "_ king infringes _ _ _ playing curling."
.
" means "anything"Say you want to delete everything after the word king
:
gsub("king.+", "king", kingText)#> [1] "The king"
as you've seen some characters have a special meaning in regular expressions, so if you want to match them, you have to escape them with \\
use "|
" to mean OR
text = "[my.text.in.brakets]"gsub("[", "", text) # error#> Warning in gsub("[", "", text): TRE pattern compilation error 'Missing ']''#> Error in gsub("[", "", text): invalid regular expression '[', reason 'Missing ']''gsub("\\[", "", text) # OK#> [1] "my.text.in.brakets]"gsub("\\[|\\.|\\]", " ", text) # pipe means "or"#> [1] " my text in brakets "
In the replacement, the special argument \\1
means the first element that is in between parentheses.
{Question}
What does that do?
text = "abc123 x22 work 32"gsub("([[:alpha:]]+)([[:digit:]]+)", "\\2\\1", text)
In the replacement, the special argument \\1
means the first element that is in between parentheses.
{Question}
What does that do?
text = "abc123 x22 work 32"gsub("([[:alpha:]]+)([[:digit:]]+)", "\\2\\1", text)
text = "abc123 x22 work 32"gsub("([[:alpha:]]+)([[:digit:]]+)", "\\2\\1", text)#> [1] "123abc 22x work 32"
With all our new knowledge, you now understand how this works:
gsub("([[:alpha:]]{3,})ing\\b", "\\1", kingText) #> [1] "The king infringes the law on play curl."
Create the following regular expressions:
s
s
when a word is at least 3 letters long (without the s
).text = "These guys like rhymes."
grepl()
:text = c("hello", "folks", "goodbye")grepl("e", text) #> [1] TRUE FALSE TRUE
to improve the speed for large vectors: use argument perl = TRUE
other resources:
stringr
provides user-friendly version of base R functionsKeyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |