Modeling

# Modeling
## R/Pharma 2020 Text modeling workshop
### Emil Hvitfeldt
### 2020-10-09

---

# Text as data

Let's take a look at the data again

```r
library(tidyverse)
library(animals)

glimpse(animals)
```

```
## Rows: 610
## Columns: 4
## $ text        <chr> "Aardvark Classification and Evolution\nAardvarks are sma…
## $ diet        <chr> "Omnivore", "Unknown", "Carnivore", "Unknown", "Unknown",…
## $ lifestyle   <chr> "Nocturnal", NA, "Diurnal", NA, NA, "Diurnal", "Nocturnal…
## $ mean_weight <dbl> 70.0000, NA, 4.5000, NA, NA, 4500.0000, 2.9500, 0.1225, 1…
```

---

# Text as data

```r
animals %>%
  sample_n(1) %>%
  pull(text)
```

```
## [1] "The guppy (also known as the millionfish) is a small colourful species
of freshwater tropical fish that is found naturally in the rivers and lakes of
South America. There are nearly 300 different types of guppy spread throughout
Barbados, Brazil, Guyana, Netherlands Antilles, Trinidad and Tobago, and
Venezuela.\nThe guppy is one of the most popular types of aquarium tropical
fish in the world as they are small, colourful and easier to keep than many
other species of fish. The guppy generally lives from 3 to 5 years old in
captivity and slightly less in the wild.\nThe guppy has been introduced to most
other countries mainly as a method of mosquito prevention as the guppy eats the
mosquito larva before they are able to fly, therefore slowing down the spread
of malaria.\nThe guppy is an extremely colourful fish and often displays
elaborate patterns on its tail fin. The female guppy and the male guppy can be
identified quite easily as the female guppy has a small, patterned tail where
the tail of the male guppy is much longer and generally has fewer markings. The
female guppy also tends to be larger in size than the male guppy.\nThe guppy
gives birth to live young, meaning that the eggs are first incubated inside the
female guppy and hatch there too. The incubation period of the guppy is about a
month after which the female guppy can give birth to up to 100 baby guppies,
which are called fry. As soon as they are born, the guppy fry are able to eat
and swim around freely. The guppy fry are also able to sense and avoid danger
which is important when around older guppies as they often eat the fry. The
guppy fry have matured in adult guppies within a couple of months.\nAfter
mating just once with a male guppy, the female guppy is able to give birth
numerous times. The female guppy stores the sperm of the male guppy inside her
and just hours after giving birth to her fry, the female guppy is ready to
become pregnant again and will do so using the stored sperm (hence why the
guppy is often called the millionfish).\nThe guppy is an omnivorous animal and
eats a wide range of organic matter that is available in the water. The guppy
mainly feeds on algae and brine shrimp, and often eat particles of food from
the water that have been left by a larger fish.\nThe guppy has many natural
predators in the wild (and in tanks) mainly due to their small size and their
elaborate fins often attract unwanted attention. Birds such as kingfishers and
larger fish are the primary predators of the guppy, so naturally, guppies that
are kept in a tank should be kept with other very small fish to prevent them
from being eaten."
```

---

# Text as data

- Text like this can be used for **supervised** or **predictive** modeling

- We can build both regression and classification models with text data

- We can use the ways language exhibits organization to create features for modeling

---

# Modeling Packages

```r
library(tidymodels)
library(textrecipes)
```

- [tidymodels](https://www.tidymodels.org/) is a collection of packages for modeling and machine learning using tidyverse principles
- [textrecipes](https://textrecipes.tidymodels.org/) extends the recipes package to handle text preprocessing

---

# Modeling workflow

![](images/tidymodels-textrecipes.png)

---

You are a new zoo keeper and your job is to figure out what to feed each animal based on a written description and some other metrics that are available for you
]

Is this a realistic problem to have?

Would it be a good use-case for Machine learning?

]

---

# Notes

- This could be better handled by an expert
- Disastrous results with misclassifications
- Overlap between classes
- Has an "unknown" field
- Uses both numeric/factor variables and text variables

---

# Class imbalance

---

# Let's approach this as a **multiclass classification task**

---

# Data splitting

The testing set is a precious resource which can be used only once

---

# Data splitting

## With {yardstick}

```r
set.seed(1234)
animals_split <- initial_split(animals)

animals_training <- training(animals_split)
animals_testing <- testing(animals_split)
```

---

# Your turn #4

Specify stratification variable `diet` and extract the `testing()` and `training()` dataset

```r
set.seed(1234)
animals_split <- initial_split(animals, strata = ___)

animals_split

animals_training <- ___(animals_split)
animals_testing <- ___(animals_split)
```

---

# Your turn #4 - results

Setting `strata = diet` makes sure that the proportions of diet is preserved in the split

```r
set.seed(1234)
animals_split <- initial_split(animals, strata = diet)

animals_split
```

```
## <Analysis/Assess/Total>
## <459/151/610>
```

```r
animals_training <- training(animals_split)
animals_testing <- testing(animals_split)
```

---

# What mistake have we made already?

---

# What mistake have we made already?

### We did EDA on the whole dataset

### By not restricting to training set => data leakage

---

# Feature selection checklist

- Is it ethical to use this variable? (or even legal?)

- Will this variable be available at prediction time?

- Does this variable contribute to explainability?

---

## Our Variables

Response: `diet`

Categorical: `lifecycle`

Numeric: `mean_weight`

Text: `text`

---

# Categorical: lifestyle

```r
animals_training %>%
  count(lifestyle) %>%
  ggplot(aes(lifestyle, n)) +
  geom_col()
```

---

# How should we deal with lifestyle?

- Missing values
- many categories

---

# {recipes}

Flexible and reproducible preprocessing framework

![:scale 30%](images/recipes.png)

---

# How to build a recipe

1. Start the recipe()
2. Define the variables involved
3. Describe preprocessing step-by-step

---

# recipe()

Creates a recipe for a set of variables

```r
recipe(reponse ~ ., data = data_set)
```

---

# recipe()

Creates a recipe for a set of variables

```r
recipe(diet ~ ., data = animals_training)
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          3
```

---

# step_*()

Complete list at https://recipes.tidymodels.org/reference/

---

# lifestyle steps

`step_unknown()` will replace missing values with `unknown` level

```r
rec_spec <- recipe(diet ~ ., 
                   data = animals_training) %>%
  step_unknown(lifestyle) %>%
  prep()

rec_spec
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          3
## 
## Training data contained 459 data points and 163 incomplete rows. 
## 
## Operations:
## 
## Unknown factor level assignment for lifestyle [trained]
```

---

# lifestyle steps

`step_other()` will pool together low frequency values

---

## Your turn #5