class: center, middle, title-slide # Text Mining ## R/Pharma 2020 Text modeling workshop ### Emil Hvitfeldt ### 2020-10-09 --- # {animals} data package https://github.com/emilhvitfeldt/animals Toy dataset of over 500 animals Contains - `text` variable with medium long text descripting the animals - Multiple metrics such as `diet` and `lifestyle` --- # Goal Show how we can turn **text** into **numbers** -- - Tokenization - stop words - n-grams - tf-idf - stemming - spacy --- # Your turn #1 .pull-left[ Explore a couple of `text` descriptions ```r library(tidyverse) library(tidytext) library(animals) ``` replace `___` with an interger ```r animals %>% slice(___) %>% pull(text) ``` ]
02
:
30
--- # Your turn #1 - Result .pull-left[ Explore a couple of `text` descriptions ```r library(tidyverse) library(tidytext) library(animals) ``` replace `___` with an interger ```r animals %>% slice(1) %>% pull(text) ``` ] .pull-right[ ``` ## Aardvark Classification and Evolution ## Aardvarks are small pig-like mammals that are found inhabiting a wide range of different habitats throughout Africa, south of the Sahara. They are mostly solitary and spend their days sleeping in underground burrows to protect them from the heat of the African sun, emerging in the cooler evening to search for food. Their name originates from the Afrikaans language in South Africa and means Earth Pig, due to their long snout and pig-like body. Aardvarks are unique among animals as they are the only surviving species in their animal family. Until recently it was widely believed that they were most closely related to other insectivores such as armadillos and pangolins but this is not the case with their closest living relatives actually thought to be elephants. ## Aardvark Anatomy and Appearance ## Aardvarks have a unique appearance amongst mammals (and indeed all animals) as they display physical characteristics of a number of different animal species. They have medium-sized, almost hairless bodies and long snouts that make them look distinctly pig-like at first, with thick skin that both protects them from the hot sun and also from being harmed by insect bites. They are able to close their nostrils to stop dust and insects from entering their nose. They have tubular, rabbit-like ears that can stand on end but can also be folded flat to prevent dirt from entering them when they are underground. Aardvarks have strong, claws on each of their spade-like feet that along with the fact that their hind legs are longer than their front legs, makes them strong and capable diggers able to excavate vast amounts of earth at an alarming rate. Due to the fact that they spend most of their lives underground or out hunting in the dark at night, they have poor eyesight but are able to easily navigate their surrounding using their excellent sense of smell to both find prey and to sense potential danger. ## Aardvark Distribution and Habitat ## Aardvarks are found in a wide variety of different habitats throughout sub-Saharan Africa from dry deserts to the moist rainforest regions. The only stipulation (other than having good access to plenty of food and water) is to have good soil in which they can dig their extensive burrows. Despite being highly skilled at digging in sandy or clay soil types, rockier regions prove more of a challenge to create their underground homes so the aardvark will move to another area where soil conditions are better suited to digging. Their burrows can be up to 10 meters (33 ft) long in a home range that can be anywhere from 2 to 5 kilometres square. Their burrows often having multiple entrances and are always left head first so they are able to identify potential predators easily using their keen sense of smell. ## Aardvark Behaviour and Lifestyle ## Aardvarks are mainly solitary animals that come together only to mate and are never found in large groups. They live in underground burrows to protect them both from the hot daytime sun and from predators. Aardvarks are nocturnal mammals, only leaving the safety of the burrow under the cover of night when they go in search of food and water, often travelling several miles in order to find the biggest termite mounds guided by their excellent hearing and sense of smell. Despite often having a large burrow comprised of an extensive network of tunnels, aardvarks are also known to be able to quickly excavate small temporary burrows where they can protect themselves quickly rather than having to return to their original dwelling. ## Aardvark Reproduction and Life Cycles ## Aardvarks have specific mating seasons that occur every year. Depending on the region in which the aardvark lives young can be born either in October to November, or May to June in other areas. Known to have babies most years, female aardvarks give birth to a single offspring after a gestation period that usually lasts for around 7 months. Newborn aardvarks often weigh as little as 2kg and are born with hairless, pink skin in the safety of their mother's burrow. Baby aardvarks spend the first two weeks of their lives in the safety of the underground burrow before beginning to venture out with their mother under the cover of night. However, despite accompanying their mother in search of food they aren't weaned until they are around three months old. Young aardvarks live with their mother in her burrow until they are around six months old when they move out to dig a burrow of their own. Although their lifespan in the wild is not entirely clear, aardvarks tend to live for more than 20 years in captivity. ## Aardvark Diet and Prey ## The diet of aardvarks is mainly comprised of ants and termites, with termites being their preferred food source. Despite this though, they are known to also eat other insects such as beetles and insect larvae. Aardvarks are built to be insectivores, with strong limbs and claws that are capable of breaking into the harder outer shell of termite mounds very efficiently. Once they have broken into the mound they then use their long, sticky tongue to harvest the insects inside and eat them whole without chewing as they are then ground down in their muscular stomachs. One of the aardvarks most distinctive features is the fact that they have columnar cheek-teeth that serve no functional purpose at all. With some larger ant species that need to be chewed they use the incisors that are located towards the back of their mouths. Aardvarks are also able to use the same techniques to break into underground ant nests. ## Aardvark Predators and Threats ## Despite the fact that aardvarks are nocturnal animals that live in the safety of underground burrows, they are threatened by a number of different predators throughout their natural environment. Lions, leopards, hyenas and large snakes (most notably pythons) are the main predators of aardvarks but this does vary depending on where the aardvark lives. Their main form of defence is to escape very quickly underground however, they are also known to be quite aggressive when threatened by these larger animals. Aardvarks use their strong, sharp claws to try and injure their attacker along with kicking the threatening animal with their powerful back legs. Aardvarks are also threatened by humans who hunt them and destroy their natural habitats. ## Aardvark Interesting Facts and Features ## Aardvarks use their long, sticky tongue to lap up to 50,000 insects a night from inside termite mounds or underground ant nests. Their worm-like tongues can actually grow up to 30 cm in length meaning they can reach more termites further into the mound. Their love of insects has actually led aardvarks also being known as Antbears! Interestingly enough, aardvarks are also thought to get almost all of the moisture they need from their prey meaning that they actually have to physically drink very little water. Aardvarks are thought to be one of the world's most prolific diggers with their strong limbs and claws and shovel-like feet helping them to be able to shift 2ft of soil in just 15 seconds! ## Aardvark Relationship with Humans ## Due to the fact that they spend the daytime hours hidden in the safety of their underground burrows, only emerging under the cover of night to hunt for food, aardvarks are very seldom seen by many people. In some regions though, they are hunted by people for food and are becoming increasingly affected by expanding human populations as more of their natural habitats disappear to make way for growing settlements. ## Aardvark Conservation Status and Life Today ## Today, aardvarks are listed by the IUCN as a species that is of Least Concern. Despite the fact that population numbers of aardvarks most certainly declined in some countries, in others, their numbers remain stable and they are often commonly found in both protected areas and regions with suitable habitats. They are however becoming increasingly affected by habitat loss in both the form of deforestation and expanding towns and villages. Due to their incredibly elusive nature, exact population sizes are not fully understood. ``` ] --- # Your turn #1 - Result .pull-left[ Explore a couple of `text` descriptions ```r library(tidyverse) library(tidytext) library(animals) ``` replace `___` with an interger ```r animals %>% slice(3) %>% pull(text) ``` ] .pull-right[ ``` ## Adelie Penguin Classification and Evolution ## The Adelie Penguin is the smallest and most widely distributed species of Penguin in the Southern Ocean and is one of only two species of Penguin found on the Antarctic mainland (the other being the much larger Emperor Penguin). The Adelie Penguin was named in 1840 by French explorer Jules Dumont d'Urville who named the Penguin for his wife, Adelie. Adelie Penguins have adapted well to life in the Antarctic as these migratory Birds winter in the northern pack-ice before returning south to the Antarctic coast for the warmer summer months. ## Adelie Penguin Anatomy and Appearance ## The Adelie Penguin is one of the most easily identifiable Penguin species with a blue-black back and completely white chest and belly. The head and beak of the Adelie Penguin are both black, with a distinctive white ring around each eye. The strong, pink feet of the Adelie Penguin are tough and bumpy with nails that not only aid the Adelie Penguin in climbing the rocky cliffs to reach its nesting grounds, but also help to push them along when they are sliding (rowing) along the ice. Adelie Penguins also use their webbed feet along with their small flippers to propel them along when swimming in the cold waters. ## Adelie Penguin Distribution and Habitat ## The Adelie Penguin is one of the southern-most Birds in the world as it is found along the Antarctic coastline and on the islands close to it. During the winter months, the Adelie Penguins migrate north where they inhabit large platforms of ice and have better access to food. During the warmer summer months, the Adelie Penguins return south where they head for the coastal beaches in search of ice-free ground on the rocky slopes where they can build their nests. More than half a million Adelie Penguins have formed one of the largest animal colonies in the world on Ross Island, an island formed by the activities of four monstrous volcanoes in the Ross Sea. ## Adelie Penguin Behaviour and Lifestyle ## Like all species of Penguin, the Adelie Penguin is a highly sociable animal, gathering in large groups known as colonies, which often number thousands of Penguin individuals. Although Adelie Penguins are not known to be terribly territorial, it is not uncommon for adults to become aggressive over nesting sites, and have even been known to steal rocks from the nests of their neighbours. Adelie Penguins are also known to hunt in groups as it is thought to reduce the risk of being eaten by hungry predators. Adelie Penguins are constantly interacting with one another, with body language and specific eye movements thought to be the most common forms of communication. ## Adelie Penguin Reproduction and Life Cycles ## Adelie Penguins return to their breeding grounds during the Antarctic summer months of November and December. Their soft feet are well designed for walking on land making the trek to its nesting ground much easier as the Penguin fasts during this time. Adelie Penguin pairs mate for life in large colonies, with females laying two eggs a couple of days apart into a nest built from rocks. Both the male and female take it in turns to incubate their eggs while the other goes off to feed, for up to 10 days at a time. The Adelie Penguin chicks have an egg-tooth which is a bump on the top of their beaks, which helps them to break out of the egg. Once hatched, the parents still take it in turns to look after their young while the other goes off to gather food. After about a month, the chicks congregate in groups called crèches and are able to fend for themselves at sea when they are between 2 and 3 months old. ## Adelie Penguin Diet and Prey ## Adelie Penguins are strong and capable swimmers, obtaining all of their food from the sea. These Penguins primarily feed on krill which are found throughout the Antarctic ocean, as well as Molluscs, Squid and small Fish. The record of fossilised eggshell accumulated in the Adelie Penguin colonies over the last 38,000 years reveals a sudden change from a Fish-based diet to Krill that started two hundred years ago. This is thought to be due to the decline of the Antarctic Fur Seal in the late 1700s and Baleen Whales in the twentieth century. The reduction of competition from these predators has resulted in there being an abundance of Krill, which the Adelie Penguins are now able to exploit as an easier source of food. ## Adelie Penguin Predators and Threats ## Adult Adelie Penguins have no land based predators due to the uncompromising conditions that they inhabit. In the water however, the biggest threat to the Adelie Penguin is the Leopard Seal, which is one of the southern-most species of Seal and a dominant predator in the Southern Ocean. These Penguins have learnt to avoid these predators by swimming in large groups and not walking on thin ice. The Killer Whale is the other main predator of the Adelie Penguin, although they normally hunt larger species of Penguin further north. South Polar Skuas are known to prey on the Adelie Penguin's eggs if left unguarded, along with chicks that have strayed from a group. ## Adelie Penguin Interesting Facts and Features ## Adelie Penguins inhabit one of the coldest environments on Earth and so have a thick layer of fat under their skin helping to keep them warm. Their feathers help to insulate them and provide a waterproof layer for extra protection. The Adelie Penguin is a highly efficient hunter and is able to eat up to 2kg of food per day, with a breeding colony thought to consume around 9,000 tonnes of food over 24 hours. The flippers of the Adelie Penguin make them fantastic at swimming and they can dive to depths of 175 meters in search of food. Adelie Penguins do not have teeth as such but instead have tooth-shaped barbs on their tongue and on the roof of their mouths. These barbs do not exist for chewing but instead assist the Penguin to swallow slippery prey. ## Adelie Penguin Relationship with Humans ## A visit to the Adelie Penguin colonies has long since been on the programme for tourists to the Antarctic, who marvel at the vast numbers of them nesting on the beaches and hunting in the surrounding waters. This has meant that Adelie Penguins are one of the most well-known of all Penguin species today. Early explorers however, also hunted the Penguins both for their meat and their eggs in order to survive in such uncompromising conditions. ## Adelie Penguin Conservation Status and Life Today ## Despite having been confined to living on coastal Antarctica, Adelie Penguins are one of the most common and widespread Penguins in the southern hemisphere. With more than 2.5 million breeding pairs found throughout southern Antarctica, the Adelie Penguin has adapted well to its polar habitat. Scientists have also been known to use Adelie Penguin nesting patterns as indicators of climate change, noticing that they are able to nest on beaches that were previously covered in ice. The Adelie Penguin is listed as Least Concern. ## ``` ] --- class: middle, center, inverse # 💫 TOKENIZATION 💫 --- # Tokenization - The process of splitting text in smaller pieces of text (_tokens_) -- - Most common token == word, but sometimes we tokenize in a different way -- - An essential part of most text analyses -- - Many options to take into consideration --- # Tokenization We can use `unnest_tokens()` from {tidytext} to turn `text` into `word`s .pull-left[ ```r animals %>% select(text) %>% unnest_tokens(word, text) ``` ] .pull-right[ ``` ## # A tibble: 10,316 x 1 ## word ## <chr> ## 1 aardvark ## 2 classification ## 3 and ## 4 evolution ## 5 aardvarks ## 6 are ## 7 small ## 8 pig ## 9 like ## 10 mammals ## # … with 10,306 more rows ``` ] --- # Your turn #2 We can look at the most frequent tokens by using `count()` ```r animals %>% unnest_tokens(output = ___, input = text) %>% count(___, sort = TRUE) ```
02
:
30
--- # Your turn #2 - results We can look at the most frequent tokens by using `count()` ```r animals %>% unnest_tokens(output = word, input = text) ``` ``` ## # A tibble: 10,316 x 4 ## diet lifestyle mean_weight word ## <chr> <chr> <dbl> <chr> ## 1 Omnivore Nocturnal 70 aardvark ## 2 Omnivore Nocturnal 70 classification ## 3 Omnivore Nocturnal 70 and ## 4 Omnivore Nocturnal 70 evolution ## 5 Omnivore Nocturnal 70 aardvarks ## 6 Omnivore Nocturnal 70 are ## 7 Omnivore Nocturnal 70 small ## 8 Omnivore Nocturnal 70 pig ## 9 Omnivore Nocturnal 70 like ## 10 Omnivore Nocturnal 70 mammals ## # … with 10,306 more rows ``` --- # Your turn #2 - results We can look at the most frequent tokens by using `count()` ```r animals %>% unnest_tokens(output = word, input = text) %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 1,774 x 2 ## word n ## <chr> <int> ## 1 the 682 ## 2 and 370 ## 3 of 336 ## 4 to 329 ## 5 african 278 ## 6 in 220 ## 7 are 166 ## 8 is 164 ## 9 a 150 ## 10 their 137 ## # … with 1,764 more rows ``` --- # Tokenization: whitespace ```r token_example ``` ``` ## [1] "Their name originates from the Afrikaans language in South Africa and means Earth Pig, due to their long snout and pig-like body." ``` -- ```r strsplit(token_example, "\\s") ``` ``` ## [[1]] ## [1] "Their" "name" "originates" "from" "the" "Afrikaans" ## [7] "language" "in" "South" "Africa" "and" "means" ## [13] "Earth" "Pig," "due" "to" "their" "long" ## [19] "snout" "and" "pig-like" "body." ``` --- # Tokenization: [tokenizers](https://docs.ropensci.org/tokenizers/) package ```r token_example ``` ``` ## [1] "Their name originates from the Afrikaans language in South Africa and means Earth Pig, due to their long snout and pig-like body." ``` ```r library(tokenizers) tokenize_words(token_example) ``` ``` ## [[1]] ## [1] "their" "name" "originates" "from" "the" "afrikaans" ## [7] "language" "in" "south" "africa" "and" "means" ## [13] "earth" "pig" "due" "to" "their" "long" ## [19] "snout" "and" "pig" "like" "body" ``` --- # Tokenization: [spaCy](https://spacy.io/) library ```r token_example ``` ``` ## [1] "Their name originates from the Afrikaans language in South Africa and means Earth Pig, due to their long snout and pig-like body." ``` ```r library(spacyr) spacy_tokenize(token_example) ``` ``` ## $text1 ## [1] "Their" "name" "originates" "from" "the" "Afrikaans" ## [7] "language" "in" "South" "Africa" "and" "means" ## [13] "Earth" "Pig" "," "due" "to" "their" ## [19] "long" "snout" "and" "pig" "-" "like" ## [25] "body" "." ``` --- .pull-left[ # whitespace ``` ## [[1]] ## [1] "Their" "name" "originates" "from" "the" "Afrikaans" ## [7] "language" "in" "South" "Africa" "and" "means" ## [13] "Earth" "Pig," "due" "to" "their" "long" ## [19] "snout" "and" "pig-like" "body." ``` ] .pull-right[ # spaCy library ``` ## $text1 ## [1] "Their" "name" "originates" "from" "the" "Afrikaans" ## [7] "language" "in" "South" "Africa" "and" "means" ## [13] "Earth" "Pig" "," "due" "to" "their" ## [19] "long" "snout" "and" "pig" "-" "like" ## [25] "body" "." ``` ] --- # Tokenization considerations - Should we turn UPPERCASE letters to lowercase? -- - How should we handle punctuation⁉️ -- - What about non-word characters _inside_ words? -- - Should compound words be split or multi-word ideas be kept together? --- class: inverse, right, middle ## Tokenization for English text is typically **much easier** than other languages. --- # N-grams ## A sequence of `n` sequential tokens -- - Captures words that appear together often -- - Can detect negations ("not happy") -- - Larger cardinality --- ### N-grams for n = 1, 2, 3 ```r tokenize_ngrams("due to their long snout and pig-like body.", n = 1) ``` ``` ## [[1]] ## [1] "due" "to" "their" "long" "snout" "and" "pig" "like" "body" ``` ```r tokenize_ngrams("due to their long snout and pig-like body.", n = 2) ``` ``` ## [[1]] ## [1] "due to" "to their" "their long" "long snout" "snout and" "and pig" ## [7] "pig like" "like body" ``` ```r tokenize_ngrams("due to their long snout and pig-like body.", n = 3) ``` ``` ## [[1]] ## [1] "due to their" "to their long" "their long snout" "long snout and" ## [5] "snout and pig" "and pig like" "pig like body" ``` --- class: middle # Tokenization See [Chapter 2](https://smltar.com/tokenization.html) for more! --- class: inverse, center, middle # 🛑 STOP WORDS 🛑 --- # Stop words ```r library(stopwords) stopwords(language = "en", source = "snowball") ``` ``` ## [1] "i" "me" "my" "myself" "we" "our" ## [7] "ours" "ourselves" "you" "your" "yours" "yourself" ## [13] "yourselves" "he" "him" "his" "himself" "she" ## [19] "her" "hers" "herself" "it" "its" "itself" ## [25] "they" "them" "their" "theirs" "themselves" "what" ## [31] "which" "who" "whom" "this" "that" "these" ## [37] "those" "am" "is" "are" "was" "were" ## [43] "be" "been" "being" "have" "has" "had" ## [49] "having" "do" "does" "did" "doing" "would" ## [55] "should" "could" "ought" "i'm" "you're" "he's" ## [61] "she's" "it's" "we're" "they're" "i've" "you've" ## [67] "we've" "they've" "i'd" "you'd" "he'd" "she'd" ## [73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" ## [79] "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't" ## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" ## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" ## [97] "couldn't" "mustn't" "let's" "that's" "who's" "what's" ## [103] "here's" "there's" "when's" "where's" "why's" "how's" ## [109] "a" "an" "the" "and" "but" "if" ## [115] "or" "because" "as" "until" "while" "of" ## [121] "at" "by" "for" "with" "about" "against" ## [127] "between" "into" "through" "during" "before" "after" ## [133] "above" "below" "to" "from" "up" "down" ## [139] "in" "out" "on" "off" "over" "under" ## [145] "again" "further" "then" "once" "here" "there" ## [151] "when" "where" "why" "how" "all" "any" ## [157] "both" "each" "few" "more" "most" "other" ## [163] "some" "such" "no" "nor" "not" "only" ## [169] "own" "same" "so" "than" "too" "very" ## [175] "will" ``` --- # Stop words ```r library(tidytext) stop_words ``` ``` ## # A tibble: 1,149 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # … with 1,139 more rows ``` --- # Your turn #3 Unscramble this pipe to - tokenize text tokens - remove stop words - count most frequent tokens ```r unnest_tokens(word, text) %>% count(word, sort = TRUE) animals %>% anti_join(stop_words) %>% ```
02
:
30
--- # Your turn #3 - result .pull-left[ ```r animals %>% unnest_tokens(word, text) %>% anti_join(stop_words, by = "word") %>% count(word, sort = TRUE) ``` ] .pull-right[ ``` ## # A tibble: 1,471 x 2 ## word n ## <chr> <int> ## 1 african 278 ## 2 elephant 89 ## 3 civet 84 ## 4 bush 63 ## 5 forest 58 ## 6 clawed 51 ## 7 adelie 47 ## 8 palm 46 ## 9 frog 42 ## 10 penguin 40 ## # … with 1,461 more rows ``` ] --- # funky stop words quiz #1 .pull-left[ - he's - she's - himself - herself ]
00
:
30
--- # funky stop words quiz #1 .pull-left[ - he's - .orange[she's] - himself - herself ] .pull-right[ .orange[she's] doesn't appear in the SMART list ] --- # funky stop words quiz #2 .pull-left[ - owl - bee - fify - system1 ]
00
:
30
--- # funky stop words quiz #2 .pull-left[ - owl - bee - .orange[fify] - system1 ] .pull-right[ .orange[fify] was left undetected for 3 years (2012 to 2015) in scikit-learn ] --- # funky stop words quiz #3 .pull-left[ - substantially - successfully - sufficiently - statistically ]
00
:
30
--- # funky stop words quiz #3 .pull-left[ - substantially - successfully - sufficiently - .orange[statistically] ] .pull-right[ .orange[statistically] doesn't appear in the Stopwords ISO list ] --- class: middle # Stop words -- - Stop words are context specific -- - Stop word lexicons can have bias -- - You can create your own stop word list --- class: inverse, center, middle # 🛑 LOOK AT YOUR STOP WORDS 🛑 --- class: middle ### See [Chapter 3](https://smltar.com/stopwords.html) for more! 🛑 --- class: inverse, center, middle # 🌷 STEMMING 🌷 --- # Stemming -- Some words are .orange[similar] (teacher, teachers, teachings, teach) but will be counted .orange[separately] -- Stemming is the act of removing characters from the end of a word to get the "stem" of the word -- This task is HIGHLY language dependent --- # Stemming .pull-left[ ```r library(SnowballC) animals %>% unnest_tokens(word, text) %>% mutate(word_stem = wordStem(word)) %>% select(word, word_stem) ``` ] .pull-right[ ``` ## # A tibble: 10,316 x 2 ## word word_stem ## <chr> <chr> ## 1 aardvark aardvark ## 2 classification classif ## 3 and and ## 4 evolution evolut ## 5 aardvarks aardvark ## 6 are ar ## 7 small small ## 8 pig pig ## 9 like like ## 10 mammals mammal ## # … with 10,306 more rows ``` ] --- class: middle ### See [Chapter 4](https://smltar.com/stemming.html) for more! 🌷 --- ## Spacy - more advanced preprocessing ```r library(spacyr) spacy_parse(animals$text) ``` ``` ## # A tibble: 11,362 x 7 ## doc_id sentence_id token_id token lemma pos entity ## <chr> <int> <int> <chr> <chr> <chr> <chr> ## 1 text1 1 1 "Aardvark" "aardvark" PROPN "ORG_B" ## 2 text1 1 2 "Classification" "classification" PROPN "ORG_I" ## 3 text1 1 3 "and" "and" CCONJ "ORG_I" ## 4 text1 1 4 "Evolution" "evolution" PROPN "ORG_I" ## 5 text1 1 5 "\n" "\n" SPACE "ORG_I" ## 6 text1 1 6 "Aardvarks" "aardvarks" PROPN "ORG_I" ## 7 text1 1 7 "are" "be" VERB "" ## 8 text1 1 8 "small" "small" ADJ "" ## 9 text1 1 9 "pig" "pig" NOUN "" ## 10 text1 1 10 "-" "-" PUNCT "" ## # … with 11,352 more rows ```