+ - 0:00:00
Notes for current slide
Notes for next slide

Text Mining

R/Pharma 2020 Text modeling workshop

Emil Hvitfeldt

2020-10-09

1 / 40

{animals} data package

https://github.com/emilhvitfeldt/animals

Toy dataset of over 500 animals

Contains

  • text variable with medium long text descripting the animals
  • Multiple metrics such as diet and lifestyle
2 / 40

Goal

Show how we can turn text into numbers

3 / 40

Goal

Show how we can turn text into numbers

  • Tokenization
  • stop words
  • n-grams
  • tf-idf
  • stemming
  • spacy
3 / 40

Your turn #1

Explore a couple of text descriptions

library(tidyverse)
library(tidytext)
library(animals)

replace ___ with an interger

animals %>%
slice(___) %>%
pull(text)
02:30
4 / 40

Your turn #1 - Result

Explore a couple of text descriptions

library(tidyverse)
library(tidytext)
library(animals)

replace ___ with an interger

animals %>%
slice(1) %>%
pull(text)
## Aardvark Classification and Evolution
## Aardvarks are small pig-like mammals that are
found inhabiting a wide range of different
habitats throughout Africa, south of the Sahara.
They are mostly solitary and spend their days
sleeping in underground burrows to protect them
from the heat of the African sun, emerging in the
cooler evening to search for food. Their name
originates from the Afrikaans language in South
Africa and means Earth Pig, due to their long
snout and pig-like body. Aardvarks are unique
among animals as they are the only surviving
species in their animal family. Until recently it
was widely believed that they were most closely
related to other insectivores such as armadillos
and pangolins but this is not the case with their
closest living relatives actually thought to be
elephants.
## Aardvark Anatomy and Appearance
## Aardvarks have a unique appearance amongst
mammals (and indeed all animals) as they display
physical characteristics of a number of different
animal species. They have medium-sized, almost
hairless bodies and long snouts that make them
look distinctly pig-like at first, with thick
skin that both protects them from the hot sun and
also from being harmed by insect bites. They are
able to close their nostrils to stop dust and
insects from entering their nose. They have
tubular, rabbit-like ears that can stand on end
but can also be folded flat to prevent dirt from
entering them when they are underground.
Aardvarks have strong, claws on each of their
spade-like feet that along with the fact that
their hind legs are longer than their front legs,
makes them strong and capable diggers able to
excavate vast amounts of earth at an alarming
rate. Due to the fact that they spend most of
their lives underground or out hunting in the
dark at night, they have poor eyesight but are
able to easily navigate their surrounding using
their excellent sense of smell to both find prey
and to sense potential danger.
## Aardvark Distribution and Habitat
## Aardvarks are found in a wide variety of
different habitats throughout sub-Saharan Africa
from dry deserts to the moist rainforest regions.
The only stipulation (other than having good
access to plenty of food and water) is to have
good soil in which they can dig their extensive
burrows. Despite being highly skilled at digging
in sandy or clay soil types, rockier regions
prove more of a challenge to create their
underground homes so the aardvark will move to
another area where soil conditions are better
suited to digging. Their burrows can be up to 10
meters (33 ft) long in a home range that can be
anywhere from 2 to 5 kilometres square. Their
burrows often having multiple entrances and are
always left head first so they are able to
identify potential predators easily using their
keen sense of smell.
## Aardvark Behaviour and Lifestyle
## Aardvarks are mainly solitary animals that
come together only to mate and are never found in
large groups. They live in underground burrows to
protect them both from the hot daytime sun and
from predators. Aardvarks are nocturnal mammals,
only leaving the safety of the burrow under the
cover of night when they go in search of food and
water, often travelling several miles in order to
find the biggest termite mounds guided by their
excellent hearing and sense of smell. Despite
often having a large burrow comprised of an
extensive network of tunnels, aardvarks are also
known to be able to quickly excavate small
temporary burrows where they can protect
themselves quickly rather than having to return
to their original dwelling.
## Aardvark Reproduction and Life Cycles
## Aardvarks have specific mating seasons that
occur every year. Depending on the region in
which the aardvark lives young can be born either
in October to November, or May to June in other
areas. Known to have babies most years, female
aardvarks give birth to a single offspring after
a gestation period that usually lasts for around
7 months. Newborn aardvarks often weigh as little
as 2kg and are born with hairless, pink skin in
the safety of their mother's burrow. Baby
aardvarks spend the first two weeks of their
lives in the safety of the underground burrow
before beginning to venture out with their mother
under the cover of night. However, despite
accompanying their mother in search of food they
aren't weaned until they are around three months
old. Young aardvarks live with their mother in
her burrow until they are around six months old
when they move out to dig a burrow of their own.
Although their lifespan in the wild is not
entirely clear, aardvarks tend to live for more
than 20 years in captivity.
## Aardvark Diet and Prey
## The diet of aardvarks is mainly comprised of
ants and termites, with termites being their
preferred food source. Despite this though, they
are known to also eat other insects such as
beetles and insect larvae. Aardvarks are built to
be insectivores, with strong limbs and claws that
are capable of breaking into the harder outer
shell of termite mounds very efficiently. Once
they have broken into the mound they then use
their long, sticky tongue to harvest the insects
inside and eat them whole without chewing as they
are then ground down in their muscular stomachs.
One of the aardvarks most distinctive features is
the fact that they have columnar cheek-teeth that
serve no functional purpose at all. With some
larger ant species that need to be chewed they
use the incisors that are located towards the
back of their mouths. Aardvarks are also able to
use the same techniques to break into underground
ant nests.
## Aardvark Predators and Threats
## Despite the fact that aardvarks are nocturnal
animals that live in the safety of underground
burrows, they are threatened by a number of
different predators throughout their natural
environment. Lions, leopards, hyenas and large
snakes (most notably pythons) are the main
predators of aardvarks but this does vary
depending on where the aardvark lives. Their main
form of defence is to escape very quickly
underground however, they are also known to be
quite aggressive when threatened by these larger
animals. Aardvarks use their strong, sharp claws
to try and injure their attacker along with
kicking the threatening animal with their
powerful back legs. Aardvarks are also threatened
by humans who hunt them and destroy their natural
habitats.
## Aardvark Interesting Facts and Features
## Aardvarks use their long, sticky tongue to lap
up to 50,000 insects a night from inside termite
mounds or underground ant nests. Their worm-like
tongues can actually grow up to 30 cm in length
meaning they can reach more termites further into
the mound. Their love of insects has actually led
aardvarks also being known as Antbears!
Interestingly enough, aardvarks are also thought
to get almost all of the moisture they need from
their prey meaning that they actually have to
physically drink very little water. Aardvarks are
thought to be one of the world's most prolific
diggers with their strong limbs and claws and
shovel-like feet helping them to be able to shift
2ft of soil in just 15 seconds!
## Aardvark Relationship with Humans
## Due to the fact that they spend the daytime
hours hidden in the safety of their underground
burrows, only emerging under the cover of night
to hunt for food, aardvarks are very seldom seen
by many people. In some regions though, they are
hunted by people for food and are becoming
increasingly affected by expanding human
populations as more of their natural habitats
disappear to make way for growing settlements.
## Aardvark Conservation Status and Life Today
## Today, aardvarks are listed by the IUCN as a
species that is of Least Concern. Despite the
fact that population numbers of aardvarks most
certainly declined in some countries, in others,
their numbers remain stable and they are often
commonly found in both protected areas and
regions with suitable habitats. They are however
becoming increasingly affected by habitat loss in
both the form of deforestation and expanding
towns and villages. Due to their incredibly
elusive nature, exact population sizes are not
fully understood.
5 / 40

Your turn #1 - Result

Explore a couple of text descriptions

library(tidyverse)
library(tidytext)
library(animals)

replace ___ with an interger

animals %>%
slice(3) %>%
pull(text)
## Adelie Penguin Classification and Evolution
## The Adelie Penguin is the smallest and most
widely distributed species of Penguin in the
Southern Ocean and is one of only two species of
Penguin found on the Antarctic mainland (the
other being the much larger Emperor Penguin). The
Adelie Penguin was named in 1840 by French
explorer Jules Dumont d'Urville who named the
Penguin for his wife, Adelie. Adelie Penguins
have adapted well to life in the Antarctic as
these migratory Birds winter in the northern
pack-ice before returning south to the Antarctic
coast for the warmer summer months.
## Adelie Penguin Anatomy and Appearance
## The Adelie Penguin is one of the most easily
identifiable Penguin species with a blue-black
back and completely white chest and belly. The
head and beak of the Adelie Penguin are both
black, with a distinctive white ring around each
eye. The strong, pink feet of the Adelie Penguin
are tough and bumpy with nails that not only aid
the Adelie Penguin in climbing the rocky cliffs
to reach its nesting grounds, but also help to
push them along when they are sliding (rowing)
along the ice. Adelie Penguins also use their
webbed feet along with their small flippers to
propel them along when swimming in the cold
waters.
## Adelie Penguin Distribution and Habitat
## The Adelie Penguin is one of the southern-most
Birds in the world as it is found along the
Antarctic coastline and on the islands close to
it. During the winter months, the Adelie Penguins
migrate north where they inhabit large platforms
of ice and have better access to food. During the
warmer summer months, the Adelie Penguins return
south where they head for the coastal beaches in
search of ice-free ground on the rocky slopes
where they can build their nests. More than half
a million Adelie Penguins have formed one of the
largest animal colonies in the world on Ross
Island, an island formed by the activities of
four monstrous volcanoes in the Ross Sea.
## Adelie Penguin Behaviour and Lifestyle
## Like all species of Penguin, the Adelie
Penguin is a highly sociable animal, gathering in
large groups known as colonies, which often
number thousands of Penguin individuals. Although
Adelie Penguins are not known to be terribly
territorial, it is not uncommon for adults to
become aggressive over nesting sites, and have
even been known to steal rocks from the nests of
their neighbours. Adelie Penguins are also known
to hunt in groups as it is thought to reduce the
risk of being eaten by hungry predators. Adelie
Penguins are constantly interacting with one
another, with body language and specific eye
movements thought to be the most common forms of
communication.
## Adelie Penguin Reproduction and Life Cycles
## Adelie Penguins return to their breeding
grounds during the Antarctic summer months of
November and December. Their soft feet are well
designed for walking on land making the trek to
its nesting ground much easier as the Penguin
fasts during this time. Adelie Penguin pairs mate
for life in large colonies, with females laying
two eggs a couple of days apart into a nest built
from rocks. Both the male and female take it in
turns to incubate their eggs while the other goes
off to feed, for up to 10 days at a time. The
Adelie Penguin chicks have an egg-tooth which is
a bump on the top of their beaks, which helps
them to break out of the egg. Once hatched, the
parents still take it in turns to look after
their young while the other goes off to gather
food. After about a month, the chicks congregate
in groups called crèches and are able to fend for
themselves at sea when they are between 2 and 3
months old.
## Adelie Penguin Diet and Prey
## Adelie Penguins are strong and capable
swimmers, obtaining all of their food from the
sea. These Penguins primarily feed on krill which
are found throughout the Antarctic ocean, as well
as Molluscs, Squid and small Fish. The record of
fossilised eggshell accumulated in the Adelie
Penguin colonies over the last 38,000 years
reveals a sudden change from a Fish-based diet to
Krill that started two hundred years ago. This is
thought to be due to the decline of the Antarctic
Fur Seal in the late 1700s and Baleen Whales in
the twentieth century. The reduction of
competition from these predators has resulted in
there being an abundance of Krill, which the
Adelie Penguins are now able to exploit as an
easier source of food.
## Adelie Penguin Predators and Threats
## Adult Adelie Penguins have no land based
predators due to the uncompromising conditions
that they inhabit. In the water however, the
biggest threat to the Adelie Penguin is the
Leopard Seal, which is one of the southern-most
species of Seal and a dominant predator in the
Southern Ocean. These Penguins have learnt to
avoid these predators by swimming in large groups
and not walking on thin ice. The Killer Whale is
the other main predator of the Adelie Penguin,
although they normally hunt larger species of
Penguin further north. South Polar Skuas are
known to prey on the Adelie Penguin's eggs if
left unguarded, along with chicks that have
strayed from a group.
## Adelie Penguin Interesting Facts and Features
## Adelie Penguins inhabit one of the coldest
environments on Earth and so have a thick layer
of fat under their skin helping to keep them
warm. Their feathers help to insulate them and
provide a waterproof layer for extra protection.
The Adelie Penguin is a highly efficient hunter
and is able to eat up to 2kg of food per day,
with a breeding colony thought to consume around
9,000 tonnes of food over 24 hours. The flippers
of the Adelie Penguin make them fantastic at
swimming and they can dive to depths of 175
meters in search of food. Adelie Penguins do not
have teeth as such but instead have tooth-shaped
barbs on their tongue and on the roof of their
mouths. These barbs do not exist for chewing but
instead assist the Penguin to swallow slippery
prey.
## Adelie Penguin Relationship with Humans
## A visit to the Adelie Penguin colonies has
long since been on the programme for tourists to
the Antarctic, who marvel at the vast numbers of
them nesting on the beaches and hunting in the
surrounding waters. This has meant that Adelie
Penguins are one of the most well-known of all
Penguin species today. Early explorers however,
also hunted the Penguins both for their meat and
their eggs in order to survive in such
uncompromising conditions.
## Adelie Penguin Conservation Status and Life
Today
## Despite having been confined to living on
coastal Antarctica, Adelie Penguins are one of
the most common and widespread Penguins in the
southern hemisphere. With more than 2.5 million
breeding pairs found throughout southern
Antarctica, the Adelie Penguin has adapted well
to its polar habitat. Scientists have also been
known to use Adelie Penguin nesting patterns as
indicators of climate change, noticing that they
are able to nest on beaches that were previously
covered in ice. The Adelie Penguin is listed as
Least Concern.
##  
6 / 40

💫 TOKENIZATION 💫

7 / 40

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
8 / 40

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
  • Most common token == word, but sometimes we tokenize in a different way
8 / 40

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
  • Most common token == word, but sometimes we tokenize in a different way
  • An essential part of most text analyses
8 / 40

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
  • Most common token == word, but sometimes we tokenize in a different way
  • An essential part of most text analyses
  • Many options to take into consideration
8 / 40

Tokenization

We can use unnest_tokens() from {tidytext} to turn text into words

animals %>%
select(text) %>%
unnest_tokens(word, text)
## # A tibble: 10,316 x 1
## word
## <chr>
## 1 aardvark
## 2 classification
## 3 and
## 4 evolution
## 5 aardvarks
## 6 are
## 7 small
## 8 pig
## 9 like
## 10 mammals
## # … with 10,306 more rows
9 / 40

Your turn #2

We can look at the most frequent tokens by using count()

animals %>%
unnest_tokens(output = ___, input = text) %>%
count(___, sort = TRUE)
02:30
10 / 40

Your turn #2 - results

We can look at the most frequent tokens by using count()

animals %>%
unnest_tokens(output = word, input = text)
## # A tibble: 10,316 x 4
## diet lifestyle mean_weight word
## <chr> <chr> <dbl> <chr>
## 1 Omnivore Nocturnal 70 aardvark
## 2 Omnivore Nocturnal 70 classification
## 3 Omnivore Nocturnal 70 and
## 4 Omnivore Nocturnal 70 evolution
## 5 Omnivore Nocturnal 70 aardvarks
## 6 Omnivore Nocturnal 70 are
## 7 Omnivore Nocturnal 70 small
## 8 Omnivore Nocturnal 70 pig
## 9 Omnivore Nocturnal 70 like
## 10 Omnivore Nocturnal 70 mammals
## # … with 10,306 more rows
11 / 40

Your turn #2 - results

We can look at the most frequent tokens by using count()

animals %>%
unnest_tokens(output = word, input = text) %>%
count(word, sort = TRUE)
## # A tibble: 1,774 x 2
## word n
## <chr> <int>
## 1 the 682
## 2 and 370
## 3 of 336
## 4 to 329
## 5 african 278
## 6 in 220
## 7 are 166
## 8 is 164
## 9 a 150
## 10 their 137
## # … with 1,764 more rows
12 / 40

Tokenization: whitespace

token_example
## [1] "Their name originates from the Afrikaans language in South Africa and
means Earth Pig, due to their long snout and pig-like body."
13 / 40

Tokenization: whitespace

token_example
## [1] "Their name originates from the Afrikaans language in South Africa and
means Earth Pig, due to their long snout and pig-like body."
strsplit(token_example, "\\s")
## [[1]]
## [1] "Their" "name" "originates" "from" "the" "Afrikaans"
## [7] "language" "in" "South" "Africa" "and" "means"
## [13] "Earth" "Pig," "due" "to" "their" "long"
## [19] "snout" "and" "pig-like" "body."
13 / 40

Tokenization: tokenizers package

token_example
## [1] "Their name originates from the Afrikaans language in South Africa and
means Earth Pig, due to their long snout and pig-like body."
library(tokenizers)
tokenize_words(token_example)
## [[1]]
## [1] "their" "name" "originates" "from" "the" "afrikaans"
## [7] "language" "in" "south" "africa" "and" "means"
## [13] "earth" "pig" "due" "to" "their" "long"
## [19] "snout" "and" "pig" "like" "body"
14 / 40

Tokenization: spaCy library

token_example
## [1] "Their name originates from the Afrikaans language in South Africa and
means Earth Pig, due to their long snout and pig-like body."
library(spacyr)
spacy_tokenize(token_example)
## $text1
## [1] "Their" "name" "originates" "from" "the" "Afrikaans"
## [7] "language" "in" "South" "Africa" "and" "means"
## [13] "Earth" "Pig" "," "due" "to" "their"
## [19] "long" "snout" "and" "pig" "-" "like"
## [25] "body" "."
15 / 40

whitespace

## [[1]]
## [1] "Their" "name" "originates" "from" "the"
"Afrikaans"
## [7] "language" "in" "South" "Africa" "and"
"means"
## [13] "Earth" "Pig," "due" "to" "their" "long"
## [19] "snout" "and" "pig-like" "body."

spaCy library

## $text1
## [1] "Their" "name" "originates" "from" "the"
"Afrikaans"
## [7] "language" "in" "South" "Africa" "and"
"means"
## [13] "Earth" "Pig" "," "due" "to" "their"
## [19] "long" "snout" "and" "pig" "-" "like"
## [25] "body" "."
16 / 40

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
17 / 40

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
  • How should we handle punctuation⁉️
17 / 40

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
  • How should we handle punctuation⁉️
  • What about non-word characters inside words?
17 / 40

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
  • How should we handle punctuation⁉️
  • What about non-word characters inside words?
  • Should compound words be split or multi-word ideas be kept together?
17 / 40

Tokenization for English text is typically much easier than other languages.

18 / 40

N-grams

A sequence of n sequential tokens

19 / 40

N-grams

A sequence of n sequential tokens

  • Captures words that appear together often
19 / 40

N-grams

A sequence of n sequential tokens

  • Captures words that appear together often
  • Can detect negations ("not happy")
19 / 40

N-grams

A sequence of n sequential tokens

  • Captures words that appear together often
  • Can detect negations ("not happy")
  • Larger cardinality
19 / 40

N-grams for n = 1, 2, 3

tokenize_ngrams("due to their long snout and pig-like body.", n = 1)
## [[1]]
## [1] "due" "to" "their" "long" "snout" "and" "pig" "like" "body"
tokenize_ngrams("due to their long snout and pig-like body.", n = 2)
## [[1]]
## [1] "due to" "to their" "their long" "long snout" "snout and" "and pig"
## [7] "pig like" "like body"
tokenize_ngrams("due to their long snout and pig-like body.", n = 3)
## [[1]]
## [1] "due to their" "to their long" "their long snout" "long snout and"
## [5] "snout and pig" "and pig like" "pig like body"
20 / 40

Tokenization

See Chapter 2 for more!

21 / 40

🛑 STOP WORDS 🛑

22 / 40

Stop words

library(stopwords)
stopwords(language = "en", source = "snowball")
## [1] "i" "me" "my" "myself" "we" "our"
## [7] "ours" "ourselves" "you" "your" "yours" "yourself"
## [13] "yourselves" "he" "him" "his" "himself" "she"
## [19] "her" "hers" "herself" "it" "its" "itself"
## [25] "they" "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that" "these"
## [37] "those" "am" "is" "are" "was" "were"
## [43] "be" "been" "being" "have" "has" "had"
## [49] "having" "do" "does" "did" "doing" "would"
## [55] "should" "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've" "you've"
## [67] "we've" "they've" "i'd" "you'd" "he'd" "she'd"
## [73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll"
## [79] "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot"
## [97] "couldn't" "mustn't" "let's" "that's" "who's" "what's"
## [103] "here's" "there's" "when's" "where's" "why's" "how's"
## [109] "a" "an" "the" "and" "but" "if"
## [115] "or" "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about" "against"
## [127] "between" "into" "through" "during" "before" "after"
## [133] "above" "below" "to" "from" "up" "down"
## [139] "in" "out" "on" "off" "over" "under"
## [145] "again" "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all" "any"
## [157] "both" "each" "few" "more" "most" "other"
## [163] "some" "such" "no" "nor" "not" "only"
## [169] "own" "same" "so" "than" "too" "very"
## [175] "will"
23 / 40

Stop words

library(tidytext)
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
24 / 40

Your turn #3

Unscramble this pipe to

  • tokenize text tokens
  • remove stop words
  • count most frequent tokens
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
animals %>%
anti_join(stop_words) %>%
02:30
25 / 40

Your turn #3 - result

animals %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
## # A tibble: 1,471 x 2
## word n
## <chr> <int>
## 1 african 278
## 2 elephant 89
## 3 civet 84
## 4 bush 63
## 5 forest 58
## 6 clawed 51
## 7 adelie 47
## 8 palm 46
## 9 frog 42
## 10 penguin 40
## # … with 1,461 more rows
26 / 40

funky stop words quiz #1

  • he's
  • she's
  • himself
  • herself
00:30
27 / 40

funky stop words quiz #1

  • he's
  • she's
  • himself
  • herself

she's doesn't appear in the SMART list

28 / 40

funky stop words quiz #2

  • owl
  • bee
  • fify
  • system1
00:30
29 / 40

funky stop words quiz #2

  • owl
  • bee
  • fify
  • system1

fify was left undetected for 3 years (2012 to 2015) in scikit-learn

30 / 40

funky stop words quiz #3

  • substantially
  • successfully
  • sufficiently
  • statistically
00:30
31 / 40

funky stop words quiz #3

  • substantially
  • successfully
  • sufficiently
  • statistically

statistically doesn't appear in the Stopwords ISO list

32 / 40

Stop words

33 / 40

Stop words

  • Stop words are context specific
33 / 40

Stop words

  • Stop words are context specific
  • Stop word lexicons can have bias
33 / 40

Stop words

  • Stop words are context specific
  • Stop word lexicons can have bias
  • You can create your own stop word list
33 / 40

🛑 LOOK AT YOUR STOP WORDS 🛑

34 / 40

See Chapter 3 for more! 🛑

35 / 40

🌷 STEMMING 🌷

36 / 40

Stemming

37 / 40

Stemming

Some words are similar (teacher, teachers, teachings, teach) but will be counted separately

37 / 40

Stemming

Some words are similar (teacher, teachers, teachings, teach) but will be counted separately

Stemming is the act of removing characters from the end of a word to get the "stem" of the word

37 / 40

Stemming

Some words are similar (teacher, teachers, teachings, teach) but will be counted separately

Stemming is the act of removing characters from the end of a word to get the "stem" of the word

This task is HIGHLY language dependent

37 / 40

Stemming

library(SnowballC)
animals %>%
unnest_tokens(word, text) %>%
mutate(word_stem = wordStem(word)) %>%
select(word, word_stem)
## # A tibble: 10,316 x 2
## word word_stem
## <chr> <chr>
## 1 aardvark aardvark
## 2 classification classif
## 3 and and
## 4 evolution evolut
## 5 aardvarks aardvark
## 6 are ar
## 7 small small
## 8 pig pig
## 9 like like
## 10 mammals mammal
## # … with 10,306 more rows
38 / 40

See Chapter 4 for more! 🌷

39 / 40

Spacy - more advanced preprocessing

library(spacyr)
spacy_parse(animals$text)
## # A tibble: 11,362 x 7
## doc_id sentence_id token_id token lemma pos entity
## <chr> <int> <int> <chr> <chr> <chr> <chr>
## 1 text1 1 1 "Aardvark" "aardvark" PROPN "ORG_B"
## 2 text1 1 2 "Classification" "classification" PROPN "ORG_I"
## 3 text1 1 3 "and" "and" CCONJ "ORG_I"
## 4 text1 1 4 "Evolution" "evolution" PROPN "ORG_I"
## 5 text1 1 5 "\n" "\n" SPACE "ORG_I"
## 6 text1 1 6 "Aardvarks" "aardvarks" PROPN "ORG_I"
## 7 text1 1 7 "are" "be" VERB ""
## 8 text1 1 8 "small" "small" ADJ ""
## 9 text1 1 9 "pig" "pig" NOUN ""
## 10 text1 1 10 "-" "-" PUNCT ""
## # … with 11,352 more rows
40 / 40

{animals} data package

https://github.com/emilhvitfeldt/animals

Toy dataset of over 500 animals

Contains

  • text variable with medium long text descripting the animals
  • Multiple metrics such as diet and lifestyle
2 / 40
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow