https://github.com/emilhvitfeldt/animals
Toy dataset of over 500 animals
Contains
text
variable with medium long text descripting the animalsdiet
and lifestyle
Show how we can turn text into numbers
Show how we can turn text into numbers
Explore a couple of text
descriptions
library(tidyverse)library(tidytext)library(animals)
replace ___
with an interger
animals %>% slice(___) %>% pull(text)
02:30
Explore a couple of text
descriptions
library(tidyverse)library(tidytext)library(animals)
replace ___
with an interger
animals %>% slice(1) %>% pull(text)
## Aardvark Classification and Evolution## Aardvarks are small pig-like mammals that arefound inhabiting a wide range of differenthabitats throughout Africa, south of the Sahara.They are mostly solitary and spend their dayssleeping in underground burrows to protect themfrom the heat of the African sun, emerging in thecooler evening to search for food. Their nameoriginates from the Afrikaans language in SouthAfrica and means Earth Pig, due to their longsnout and pig-like body. Aardvarks are uniqueamong animals as they are the only survivingspecies in their animal family. Until recently itwas widely believed that they were most closelyrelated to other insectivores such as armadillosand pangolins but this is not the case with theirclosest living relatives actually thought to beelephants.## Aardvark Anatomy and Appearance## Aardvarks have a unique appearance amongstmammals (and indeed all animals) as they displayphysical characteristics of a number of differentanimal species. They have medium-sized, almosthairless bodies and long snouts that make themlook distinctly pig-like at first, with thickskin that both protects them from the hot sun andalso from being harmed by insect bites. They areable to close their nostrils to stop dust andinsects from entering their nose. They havetubular, rabbit-like ears that can stand on endbut can also be folded flat to prevent dirt fromentering them when they are underground.Aardvarks have strong, claws on each of theirspade-like feet that along with the fact thattheir hind legs are longer than their front legs,makes them strong and capable diggers able toexcavate vast amounts of earth at an alarmingrate. Due to the fact that they spend most oftheir lives underground or out hunting in thedark at night, they have poor eyesight but areable to easily navigate their surrounding usingtheir excellent sense of smell to both find preyand to sense potential danger.## Aardvark Distribution and Habitat## Aardvarks are found in a wide variety ofdifferent habitats throughout sub-Saharan Africafrom dry deserts to the moist rainforest regions.The only stipulation (other than having goodaccess to plenty of food and water) is to havegood soil in which they can dig their extensiveburrows. Despite being highly skilled at diggingin sandy or clay soil types, rockier regionsprove more of a challenge to create theirunderground homes so the aardvark will move toanother area where soil conditions are bettersuited to digging. Their burrows can be up to 10meters (33 ft) long in a home range that can beanywhere from 2 to 5 kilometres square. Theirburrows often having multiple entrances and arealways left head first so they are able toidentify potential predators easily using theirkeen sense of smell.## Aardvark Behaviour and Lifestyle## Aardvarks are mainly solitary animals thatcome together only to mate and are never found inlarge groups. They live in underground burrows toprotect them both from the hot daytime sun andfrom predators. Aardvarks are nocturnal mammals,only leaving the safety of the burrow under thecover of night when they go in search of food andwater, often travelling several miles in order tofind the biggest termite mounds guided by theirexcellent hearing and sense of smell. Despiteoften having a large burrow comprised of anextensive network of tunnels, aardvarks are alsoknown to be able to quickly excavate smalltemporary burrows where they can protectthemselves quickly rather than having to returnto their original dwelling.## Aardvark Reproduction and Life Cycles## Aardvarks have specific mating seasons thatoccur every year. Depending on the region inwhich the aardvark lives young can be born eitherin October to November, or May to June in otherareas. Known to have babies most years, femaleaardvarks give birth to a single offspring aftera gestation period that usually lasts for around7 months. Newborn aardvarks often weigh as littleas 2kg and are born with hairless, pink skin inthe safety of their mother's burrow. Babyaardvarks spend the first two weeks of theirlives in the safety of the underground burrowbefore beginning to venture out with their motherunder the cover of night. However, despiteaccompanying their mother in search of food theyaren't weaned until they are around three monthsold. Young aardvarks live with their mother inher burrow until they are around six months oldwhen they move out to dig a burrow of their own.Although their lifespan in the wild is notentirely clear, aardvarks tend to live for morethan 20 years in captivity.## Aardvark Diet and Prey## The diet of aardvarks is mainly comprised ofants and termites, with termites being theirpreferred food source. Despite this though, theyare known to also eat other insects such asbeetles and insect larvae. Aardvarks are built tobe insectivores, with strong limbs and claws thatare capable of breaking into the harder outershell of termite mounds very efficiently. Oncethey have broken into the mound they then usetheir long, sticky tongue to harvest the insectsinside and eat them whole without chewing as theyare then ground down in their muscular stomachs.One of the aardvarks most distinctive features isthe fact that they have columnar cheek-teeth thatserve no functional purpose at all. With somelarger ant species that need to be chewed theyuse the incisors that are located towards theback of their mouths. Aardvarks are also able touse the same techniques to break into undergroundant nests.## Aardvark Predators and Threats## Despite the fact that aardvarks are nocturnalanimals that live in the safety of undergroundburrows, they are threatened by a number ofdifferent predators throughout their naturalenvironment. Lions, leopards, hyenas and largesnakes (most notably pythons) are the mainpredators of aardvarks but this does varydepending on where the aardvark lives. Their mainform of defence is to escape very quicklyunderground however, they are also known to bequite aggressive when threatened by these largeranimals. Aardvarks use their strong, sharp clawsto try and injure their attacker along withkicking the threatening animal with theirpowerful back legs. Aardvarks are also threatenedby humans who hunt them and destroy their naturalhabitats.## Aardvark Interesting Facts and Features## Aardvarks use their long, sticky tongue to lapup to 50,000 insects a night from inside termitemounds or underground ant nests. Their worm-liketongues can actually grow up to 30 cm in lengthmeaning they can reach more termites further intothe mound. Their love of insects has actually ledaardvarks also being known as Antbears!Interestingly enough, aardvarks are also thoughtto get almost all of the moisture they need fromtheir prey meaning that they actually have tophysically drink very little water. Aardvarks arethought to be one of the world's most prolificdiggers with their strong limbs and claws andshovel-like feet helping them to be able to shift2ft of soil in just 15 seconds!## Aardvark Relationship with Humans## Due to the fact that they spend the daytimehours hidden in the safety of their undergroundburrows, only emerging under the cover of nightto hunt for food, aardvarks are very seldom seenby many people. In some regions though, they arehunted by people for food and are becomingincreasingly affected by expanding humanpopulations as more of their natural habitatsdisappear to make way for growing settlements.## Aardvark Conservation Status and Life Today## Today, aardvarks are listed by the IUCN as aspecies that is of Least Concern. Despite thefact that population numbers of aardvarks mostcertainly declined in some countries, in others,their numbers remain stable and they are oftencommonly found in both protected areas andregions with suitable habitats. They are howeverbecoming increasingly affected by habitat loss inboth the form of deforestation and expandingtowns and villages. Due to their incrediblyelusive nature, exact population sizes are notfully understood.
Explore a couple of text
descriptions
library(tidyverse)library(tidytext)library(animals)
replace ___
with an interger
animals %>% slice(3) %>% pull(text)
## Adelie Penguin Classification and Evolution## The Adelie Penguin is the smallest and mostwidely distributed species of Penguin in theSouthern Ocean and is one of only two species ofPenguin found on the Antarctic mainland (theother being the much larger Emperor Penguin). TheAdelie Penguin was named in 1840 by Frenchexplorer Jules Dumont d'Urville who named thePenguin for his wife, Adelie. Adelie Penguinshave adapted well to life in the Antarctic asthese migratory Birds winter in the northernpack-ice before returning south to the Antarcticcoast for the warmer summer months.## Adelie Penguin Anatomy and Appearance## The Adelie Penguin is one of the most easilyidentifiable Penguin species with a blue-blackback and completely white chest and belly. Thehead and beak of the Adelie Penguin are bothblack, with a distinctive white ring around eacheye. The strong, pink feet of the Adelie Penguinare tough and bumpy with nails that not only aidthe Adelie Penguin in climbing the rocky cliffsto reach its nesting grounds, but also help topush them along when they are sliding (rowing)along the ice. Adelie Penguins also use theirwebbed feet along with their small flippers topropel them along when swimming in the coldwaters.## Adelie Penguin Distribution and Habitat## The Adelie Penguin is one of the southern-mostBirds in the world as it is found along theAntarctic coastline and on the islands close toit. During the winter months, the Adelie Penguinsmigrate north where they inhabit large platformsof ice and have better access to food. During thewarmer summer months, the Adelie Penguins returnsouth where they head for the coastal beaches insearch of ice-free ground on the rocky slopeswhere they can build their nests. More than halfa million Adelie Penguins have formed one of thelargest animal colonies in the world on RossIsland, an island formed by the activities offour monstrous volcanoes in the Ross Sea.## Adelie Penguin Behaviour and Lifestyle## Like all species of Penguin, the AdeliePenguin is a highly sociable animal, gathering inlarge groups known as colonies, which oftennumber thousands of Penguin individuals. AlthoughAdelie Penguins are not known to be terriblyterritorial, it is not uncommon for adults tobecome aggressive over nesting sites, and haveeven been known to steal rocks from the nests oftheir neighbours. Adelie Penguins are also knownto hunt in groups as it is thought to reduce therisk of being eaten by hungry predators. AdeliePenguins are constantly interacting with oneanother, with body language and specific eyemovements thought to be the most common forms ofcommunication.## Adelie Penguin Reproduction and Life Cycles## Adelie Penguins return to their breedinggrounds during the Antarctic summer months ofNovember and December. Their soft feet are welldesigned for walking on land making the trek toits nesting ground much easier as the Penguinfasts during this time. Adelie Penguin pairs matefor life in large colonies, with females layingtwo eggs a couple of days apart into a nest builtfrom rocks. Both the male and female take it inturns to incubate their eggs while the other goesoff to feed, for up to 10 days at a time. TheAdelie Penguin chicks have an egg-tooth which isa bump on the top of their beaks, which helpsthem to break out of the egg. Once hatched, theparents still take it in turns to look aftertheir young while the other goes off to gatherfood. After about a month, the chicks congregatein groups called crèches and are able to fend forthemselves at sea when they are between 2 and 3months old.## Adelie Penguin Diet and Prey## Adelie Penguins are strong and capableswimmers, obtaining all of their food from thesea. These Penguins primarily feed on krill whichare found throughout the Antarctic ocean, as wellas Molluscs, Squid and small Fish. The record offossilised eggshell accumulated in the AdeliePenguin colonies over the last 38,000 yearsreveals a sudden change from a Fish-based diet toKrill that started two hundred years ago. This isthought to be due to the decline of the AntarcticFur Seal in the late 1700s and Baleen Whales inthe twentieth century. The reduction ofcompetition from these predators has resulted inthere being an abundance of Krill, which theAdelie Penguins are now able to exploit as aneasier source of food.## Adelie Penguin Predators and Threats## Adult Adelie Penguins have no land basedpredators due to the uncompromising conditionsthat they inhabit. In the water however, thebiggest threat to the Adelie Penguin is theLeopard Seal, which is one of the southern-mostspecies of Seal and a dominant predator in theSouthern Ocean. These Penguins have learnt toavoid these predators by swimming in large groupsand not walking on thin ice. The Killer Whale isthe other main predator of the Adelie Penguin,although they normally hunt larger species ofPenguin further north. South Polar Skuas areknown to prey on the Adelie Penguin's eggs ifleft unguarded, along with chicks that havestrayed from a group.## Adelie Penguin Interesting Facts and Features## Adelie Penguins inhabit one of the coldestenvironments on Earth and so have a thick layerof fat under their skin helping to keep themwarm. Their feathers help to insulate them andprovide a waterproof layer for extra protection.The Adelie Penguin is a highly efficient hunterand is able to eat up to 2kg of food per day,with a breeding colony thought to consume around9,000 tonnes of food over 24 hours. The flippersof the Adelie Penguin make them fantastic atswimming and they can dive to depths of 175meters in search of food. Adelie Penguins do nothave teeth as such but instead have tooth-shapedbarbs on their tongue and on the roof of theirmouths. These barbs do not exist for chewing butinstead assist the Penguin to swallow slipperyprey.## Adelie Penguin Relationship with Humans## A visit to the Adelie Penguin colonies haslong since been on the programme for tourists tothe Antarctic, who marvel at the vast numbers ofthem nesting on the beaches and hunting in thesurrounding waters. This has meant that AdeliePenguins are one of the most well-known of allPenguin species today. Early explorers however,also hunted the Penguins both for their meat andtheir eggs in order to survive in suchuncompromising conditions.## Adelie Penguin Conservation Status and LifeToday## Despite having been confined to living oncoastal Antarctica, Adelie Penguins are one ofthe most common and widespread Penguins in thesouthern hemisphere. With more than 2.5 millionbreeding pairs found throughout southernAntarctica, the Adelie Penguin has adapted wellto its polar habitat. Scientists have also beenknown to use Adelie Penguin nesting patterns asindicators of climate change, noticing that theyare able to nest on beaches that were previouslycovered in ice. The Adelie Penguin is listed asLeast Concern.##
We can use unnest_tokens()
from {tidytext} to turn text
into word
s
animals %>% select(text) %>% unnest_tokens(word, text)
## # A tibble: 10,316 x 1## word ## <chr> ## 1 aardvark ## 2 classification## 3 and ## 4 evolution ## 5 aardvarks ## 6 are ## 7 small ## 8 pig ## 9 like ## 10 mammals ## # … with 10,306 more rows
We can look at the most frequent tokens by using count()
animals %>% unnest_tokens(output = ___, input = text) %>% count(___, sort = TRUE)
02:30
We can look at the most frequent tokens by using count()
animals %>% unnest_tokens(output = word, input = text)
## # A tibble: 10,316 x 4## diet lifestyle mean_weight word ## <chr> <chr> <dbl> <chr> ## 1 Omnivore Nocturnal 70 aardvark ## 2 Omnivore Nocturnal 70 classification## 3 Omnivore Nocturnal 70 and ## 4 Omnivore Nocturnal 70 evolution ## 5 Omnivore Nocturnal 70 aardvarks ## 6 Omnivore Nocturnal 70 are ## 7 Omnivore Nocturnal 70 small ## 8 Omnivore Nocturnal 70 pig ## 9 Omnivore Nocturnal 70 like ## 10 Omnivore Nocturnal 70 mammals ## # … with 10,306 more rows
We can look at the most frequent tokens by using count()
animals %>% unnest_tokens(output = word, input = text) %>% count(word, sort = TRUE)
## # A tibble: 1,774 x 2## word n## <chr> <int>## 1 the 682## 2 and 370## 3 of 336## 4 to 329## 5 african 278## 6 in 220## 7 are 166## 8 is 164## 9 a 150## 10 their 137## # … with 1,764 more rows
token_example
## [1] "Their name originates from the Afrikaans language in South Africa andmeans Earth Pig, due to their long snout and pig-like body."
token_example
## [1] "Their name originates from the Afrikaans language in South Africa andmeans Earth Pig, due to their long snout and pig-like body."
strsplit(token_example, "\\s")
## [[1]]## [1] "Their" "name" "originates" "from" "the" "Afrikaans" ## [7] "language" "in" "South" "Africa" "and" "means" ## [13] "Earth" "Pig," "due" "to" "their" "long" ## [19] "snout" "and" "pig-like" "body."
token_example
## [1] "Their name originates from the Afrikaans language in South Africa andmeans Earth Pig, due to their long snout and pig-like body."
library(tokenizers)tokenize_words(token_example)
## [[1]]## [1] "their" "name" "originates" "from" "the" "afrikaans" ## [7] "language" "in" "south" "africa" "and" "means" ## [13] "earth" "pig" "due" "to" "their" "long" ## [19] "snout" "and" "pig" "like" "body"
token_example
## [1] "Their name originates from the Afrikaans language in South Africa andmeans Earth Pig, due to their long snout and pig-like body."
library(spacyr)spacy_tokenize(token_example)
## $text1## [1] "Their" "name" "originates" "from" "the" "Afrikaans" ## [7] "language" "in" "South" "Africa" "and" "means" ## [13] "Earth" "Pig" "," "due" "to" "their" ## [19] "long" "snout" "and" "pig" "-" "like" ## [25] "body" "."
## [[1]]## [1] "Their" "name" "originates" "from" "the""Afrikaans"## [7] "language" "in" "South" "Africa" "and""means"## [13] "Earth" "Pig," "due" "to" "their" "long"## [19] "snout" "and" "pig-like" "body."
## $text1## [1] "Their" "name" "originates" "from" "the""Afrikaans"## [7] "language" "in" "South" "Africa" "and""means"## [13] "Earth" "Pig" "," "due" "to" "their"## [19] "long" "snout" "and" "pig" "-" "like"## [25] "body" "."
n
sequential tokensn
sequential tokensn
sequential tokensn
sequential tokenstokenize_ngrams("due to their long snout and pig-like body.", n = 1)
## [[1]]## [1] "due" "to" "their" "long" "snout" "and" "pig" "like" "body"
tokenize_ngrams("due to their long snout and pig-like body.", n = 2)
## [[1]]## [1] "due to" "to their" "their long" "long snout" "snout and" "and pig" ## [7] "pig like" "like body"
tokenize_ngrams("due to their long snout and pig-like body.", n = 3)
## [[1]]## [1] "due to their" "to their long" "their long snout" "long snout and" ## [5] "snout and pig" "and pig like" "pig like body"
library(stopwords)stopwords(language = "en", source = "snowball")
## [1] "i" "me" "my" "myself" "we" "our" ## [7] "ours" "ourselves" "you" "your" "yours" "yourself" ## [13] "yourselves" "he" "him" "his" "himself" "she" ## [19] "her" "hers" "herself" "it" "its" "itself" ## [25] "they" "them" "their" "theirs" "themselves" "what" ## [31] "which" "who" "whom" "this" "that" "these" ## [37] "those" "am" "is" "are" "was" "were" ## [43] "be" "been" "being" "have" "has" "had" ## [49] "having" "do" "does" "did" "doing" "would" ## [55] "should" "could" "ought" "i'm" "you're" "he's" ## [61] "she's" "it's" "we're" "they're" "i've" "you've" ## [67] "we've" "they've" "i'd" "you'd" "he'd" "she'd" ## [73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" ## [79] "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't" ## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" ## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" ## [97] "couldn't" "mustn't" "let's" "that's" "who's" "what's" ## [103] "here's" "there's" "when's" "where's" "why's" "how's" ## [109] "a" "an" "the" "and" "but" "if" ## [115] "or" "because" "as" "until" "while" "of" ## [121] "at" "by" "for" "with" "about" "against" ## [127] "between" "into" "through" "during" "before" "after" ## [133] "above" "below" "to" "from" "up" "down" ## [139] "in" "out" "on" "off" "over" "under" ## [145] "again" "further" "then" "once" "here" "there" ## [151] "when" "where" "why" "how" "all" "any" ## [157] "both" "each" "few" "more" "most" "other" ## [163] "some" "such" "no" "nor" "not" "only" ## [169] "own" "same" "so" "than" "too" "very" ## [175] "will"
library(tidytext)stop_words
## # A tibble: 1,149 x 2## word lexicon## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # … with 1,139 more rows
Unscramble this pipe to
unnest_tokens(word, text) %>%count(word, sort = TRUE)animals %>%anti_join(stop_words) %>%
02:30
animals %>% unnest_tokens(word, text) %>% anti_join(stop_words, by = "word") %>% count(word, sort = TRUE)
## # A tibble: 1,471 x 2## word n## <chr> <int>## 1 african 278## 2 elephant 89## 3 civet 84## 4 bush 63## 5 forest 58## 6 clawed 51## 7 adelie 47## 8 palm 46## 9 frog 42## 10 penguin 40## # … with 1,461 more rows
00:30
she's doesn't appear in the SMART list
00:30
fify was left undetected for 3 years (2012 to 2015) in scikit-learn
00:30
statistically doesn't appear in the Stopwords ISO list
Some words are similar (teacher, teachers, teachings, teach) but will be counted separately
Some words are similar (teacher, teachers, teachings, teach) but will be counted separately
Stemming is the act of removing characters from the end of a word to get the "stem" of the word
Some words are similar (teacher, teachers, teachings, teach) but will be counted separately
Stemming is the act of removing characters from the end of a word to get the "stem" of the word
This task is HIGHLY language dependent
library(SnowballC)animals %>% unnest_tokens(word, text) %>% mutate(word_stem = wordStem(word)) %>% select(word, word_stem)
## # A tibble: 10,316 x 2## word word_stem## <chr> <chr> ## 1 aardvark aardvark ## 2 classification classif ## 3 and and ## 4 evolution evolut ## 5 aardvarks aardvark ## 6 are ar ## 7 small small ## 8 pig pig ## 9 like like ## 10 mammals mammal ## # … with 10,306 more rows
library(spacyr)spacy_parse(animals$text)
## # A tibble: 11,362 x 7## doc_id sentence_id token_id token lemma pos entity ## <chr> <int> <int> <chr> <chr> <chr> <chr> ## 1 text1 1 1 "Aardvark" "aardvark" PROPN "ORG_B"## 2 text1 1 2 "Classification" "classification" PROPN "ORG_I"## 3 text1 1 3 "and" "and" CCONJ "ORG_I"## 4 text1 1 4 "Evolution" "evolution" PROPN "ORG_I"## 5 text1 1 5 "\n" "\n" SPACE "ORG_I"## 6 text1 1 6 "Aardvarks" "aardvarks" PROPN "ORG_I"## 7 text1 1 7 "are" "be" VERB "" ## 8 text1 1 8 "small" "small" ADJ "" ## 9 text1 1 9 "pig" "pig" NOUN "" ## 10 text1 1 10 "-" "-" PUNCT "" ## # … with 11,352 more rows
https://github.com/emilhvitfeldt/animals
Toy dataset of over 500 animals
Contains
text
variable with medium long text descripting the animalsdiet
and lifestyle
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |