class: center, middle, title-slide # Modeling ## R/Pharma 2020 Text modeling workshop ### Emil Hvitfeldt ### 2020-10-09 --- # Text as data Let's take a look at the data again ```r library(tidyverse) library(animals) glimpse(animals) ``` ``` ## Rows: 610 ## Columns: 4 ## $ text <chr> "Aardvark Classification and Evolution\nAardvarks are sma… ## $ diet <chr> "Omnivore", "Unknown", "Carnivore", "Unknown", "Unknown",… ## $ lifestyle <chr> "Nocturnal", NA, "Diurnal", NA, NA, "Diurnal", "Nocturnal… ## $ mean_weight <dbl> 70.0000, NA, 4.5000, NA, NA, 4500.0000, 2.9500, 0.1225, 1… ``` --- # Text as data ```r animals %>% sample_n(1) %>% pull(text) ``` ``` ## [1] "The guppy (also known as the millionfish) is a small colourful species of freshwater tropical fish that is found naturally in the rivers and lakes of South America. There are nearly 300 different types of guppy spread throughout Barbados, Brazil, Guyana, Netherlands Antilles, Trinidad and Tobago, and Venezuela.\nThe guppy is one of the most popular types of aquarium tropical fish in the world as they are small, colourful and easier to keep than many other species of fish. The guppy generally lives from 3 to 5 years old in captivity and slightly less in the wild.\nThe guppy has been introduced to most other countries mainly as a method of mosquito prevention as the guppy eats the mosquito larva before they are able to fly, therefore slowing down the spread of malaria.\nThe guppy is an extremely colourful fish and often displays elaborate patterns on its tail fin. The female guppy and the male guppy can be identified quite easily as the female guppy has a small, patterned tail where the tail of the male guppy is much longer and generally has fewer markings. The female guppy also tends to be larger in size than the male guppy.\nThe guppy gives birth to live young, meaning that the eggs are first incubated inside the female guppy and hatch there too. The incubation period of the guppy is about a month after which the female guppy can give birth to up to 100 baby guppies, which are called fry. As soon as they are born, the guppy fry are able to eat and swim around freely. The guppy fry are also able to sense and avoid danger which is important when around older guppies as they often eat the fry. The guppy fry have matured in adult guppies within a couple of months.\nAfter mating just once with a male guppy, the female guppy is able to give birth numerous times. The female guppy stores the sperm of the male guppy inside her and just hours after giving birth to her fry, the female guppy is ready to become pregnant again and will do so using the stored sperm (hence why the guppy is often called the millionfish).\nThe guppy is an omnivorous animal and eats a wide range of organic matter that is available in the water. The guppy mainly feeds on algae and brine shrimp, and often eat particles of food from the water that have been left by a larger fish.\nThe guppy has many natural predators in the wild (and in tanks) mainly due to their small size and their elaborate fins often attract unwanted attention. Birds such as kingfishers and larger fish are the primary predators of the guppy, so naturally, guppies that are kept in a tank should be kept with other very small fish to prevent them from being eaten." ``` --- # Text as data - Text like this can be used for **supervised** or **predictive** modeling -- - We can build both regression and classification models with text data -- - We can use the ways language exhibits organization to create features for modeling --- # Modeling Packages ```r library(tidymodels) library(textrecipes) ``` - [tidymodels](https://www.tidymodels.org/) is a collection of packages for modeling and machine learning using tidyverse principles - [textrecipes](https://textrecipes.tidymodels.org/) extends the recipes package to handle text preprocessing --- # Modeling workflow ![](images/tidymodels-textrecipes.png) --- .pull-left[ # Modeling task You are a new zoo keeper and your job is to figure out what to feed each animal based on a written description and some other metrics that are available for you ] -- .pull-right[ ## Question Is this a realistic problem to have? Would it be a good use-case for Machine learning?
02
:
00
] --- # Notes - This could be better handled by an expert - Disastrous results with misclassifications - Overlap between classes - Has an "unknown" field - Uses both numeric/factor variables and text variables --- # Class imbalance <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="700px" style="display: block; margin: auto;" /> --- class: inverse, right, middle # Let's approach this as a **multiclass classification task** --- # Data splitting The testing set is a precious resource which can be used only once <img src="index_files/figure-html/all-split-1.png" width="700px" style="display: block; margin: auto;" /> --- # Data splitting ## With {yardstick} ```r set.seed(1234) animals_split <- initial_split(animals) animals_training <- training(animals_split) animals_testing <- testing(animals_split) ``` --- # Your turn #4 Specify stratification variable `diet` and extract the `testing()` and `training()` dataset ```r set.seed(1234) animals_split <- initial_split(animals, strata = ___) animals_split animals_training <- ___(animals_split) animals_testing <- ___(animals_split) ``` --- # Your turn #4 - results Setting `strata = diet` makes sure that the proportions of diet is preserved in the split ```r set.seed(1234) animals_split <- initial_split(animals, strata = diet) animals_split ``` ``` ## <Analysis/Assess/Total> ## <459/151/610> ``` ```r animals_training <- training(animals_split) animals_testing <- testing(animals_split) ``` --- class: inverse # What mistake have we made already?
02
:
00
--- class: inverse # What mistake have we made already? ### We did EDA on the whole dataset ### By not restricting to training set => data leakage --- # Feature selection checklist -- - Is it ethical to use this variable? (or even legal?) -- - Will this variable be available at prediction time? -- - Does this variable contribute to explainability? --- ## Our Variables Response: `diet` Categorical: `lifecycle` Numeric: `mean_weight` Text: `text` --- # Categorical: lifestyle -- ```r animals_training %>% count(lifestyle) %>% ggplot(aes(lifestyle, n)) + geom_col() ``` <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="700px" style="display: block; margin: auto;" /> --- # How should we deal with lifestyle? - Missing values - many categories
02
:
00
--- class: center # {recipes} Flexible and reproducible preprocessing framework ![:scale 30%](images/recipes.png) --- class: middle # How to build a recipe 1. Start the recipe() 2. Define the variables involved 3. Describe preprocessing step-by-step --- # recipe() Creates a recipe for a set of variables ```r recipe(reponse ~ ., data = data_set) ``` --- # recipe() Creates a recipe for a set of variables ```r recipe(diet ~ ., data = animals_training) ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ``` --- # step_*() Complete list at https://recipes.tidymodels.org/reference/ --- # lifestyle steps `step_unknown()` will replace missing values with `unknown` level ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% prep() rec_spec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Training data contained 459 data points and 163 incomplete rows. ## ## Operations: ## ## Unknown factor level assignment for lifestyle [trained] ``` --- # lifestyle steps `step_other()` will pool together low frequency values --- ## Your turn #5 Specify variables and play around with the threshold, default is 0.05 ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(___, threshold = ___) rec_spec %>% tidy(2) ```
02
:
00
--- ## Your turn #5 - result Specify variables and play around with the threshold, default is 0.05 ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.05) rec_spec %>% prep() %>% tidy() ``` ``` ## # A tibble: 2 x 6 ## number operation type trained skip id ## <int> <chr> <chr> <lgl> <lgl> <chr> ## 1 1 step unknown TRUE FALSE unknown_G81f6 ## 2 2 step other TRUE FALSE other_cJJkI ``` --- ## Your turn #5 - result Specify variables and play around with the threshold, default is 0.05 ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.05) rec_spec %>% prep() %>% tidy(2) ``` ``` ## # A tibble: 4 x 3 ## terms retained id ## <chr> <chr> <chr> ## 1 lifestyle Diurnal other_4BR8s ## 2 lifestyle Herd other_4BR8s ## 3 lifestyle Solitary other_4BR8s ## 4 lifestyle unknown other_4BR8s ``` --- ## Your turn #5 - result Specify variables and play around with the threshold, default is 0.05 ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.1) rec_spec %>% prep() %>% tidy(2) ``` ``` ## # A tibble: 2 x 3 ## terms retained id ## <chr> <chr> <chr> ## 1 lifestyle Solitary other_EG0Sj ## 2 lifestyle unknown other_EG0Sj ``` --- ## Your turn #5 - result Specify variables and play around with the threshold, default is 0.05 ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.01) rec_spec %>% prep() %>% tidy(2) ``` ``` ## # A tibble: 11 x 3 ## terms retained id ## <chr> <chr> <chr> ## 1 lifestyle Colony other_wEbxL ## 2 lifestyle Crepuscular other_wEbxL ## 3 lifestyle Diurnal other_wEbxL ## 4 lifestyle Flock other_wEbxL ## 5 lifestyle Group other_wEbxL ## 6 lifestyle Herd other_wEbxL ## 7 lifestyle Nocturnal other_wEbxL ## 8 lifestyle Pack other_wEbxL ## 9 lifestyle Solitary other_wEbxL ## 10 lifestyle Troop other_wEbxL ## 11 lifestyle unknown other_wEbxL ``` --- ## Dummifying Specify variables and play around with the threshold, default is 0.05 ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.01) %>% step_dummy(lifestyle) rec_spec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Unknown factor level assignment for lifestyle ## Collapsing factor levels for lifestyle ## Dummy variables from lifestyle ``` --- # Numeric - mean_weight ```r animals_training %>% ggplot(aes(mean_weight)) + geom_histogram() ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="700px" style="display: block; margin: auto;" /> --- # Numeric - mean_weight ```r animals_training %>% ggplot(aes(mean_weight)) + geom_histogram() + scale_x_log10() ``` <img src="index_files/figure-html/unnamed-chunk-26-1.png" width="700px" style="display: block; margin: auto;" /> --- # Logging and mean imputation ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.01) %>% step_dummy(lifestyle) %>% step_log(mean_weight) %>% step_meanimpute(mean_weight) ``` --- class: middle # Text preprocessing workflow - turn text into tokens - modify/filter tokens - count tokens --- ## Text preprocessing workflow - tokenize .pull-left[ ```r recipe(diet ~ ., data = animals_training) %>% # Tokenize to words step_tokenize(text) ``` ] .pull-right[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Tokenization for text ``` ] --- ## Text preprocessing workflow - modify tokens .pull-left[ ```r recipe(diet ~ ., data = animals_training) %>% # Tokenize to words step_tokenize(text) %>% # Remove stopwords step_stopwords(text) %>% # Remove less frequent words step_tokenfilter(text, max_tokens = 100) ``` ] .pull-right[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Tokenization for text ## Stop word removal for text ## Text filtering for text ``` ] --- ## Text preprocessing workflow - count tokens .pull-left[ ```r recipe(diet ~ ., data = animals_training) %>% # Tokenize to words step_tokenize(text) %>% # Remove stopwords step_stopwords(text) %>% # Remove less frequent words step_tokenfilter(text, max_tokens = 100) %>% # Calculate term frequencies step_tf(text) ``` ] .pull-right[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Tokenization for text ## Stop word removal for text ## Text filtering for text ## Term frequency with text ``` ] --- ### Your turn #6 play around with the arguments in step_tokenfilter() and see what results we get ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tf(text) rec_spec %>% prep() %>% bake(new_data = NULL) ```
03
:
00
--- ### Your turn #6 - result play around with the arguments in step_tokenfilter() and see what results we get ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tf(text) rec_spec %>% prep() %>% bake(new_data = NULL) ``` --- ### Your turn #6 - result ``` ## # A tibble: 459 x 103 ## lifestyle mean_weight diet tf_text_able tf_text_africa tf_text_african ## <fct> <dbl> <fct> <dbl> <dbl> <dbl> ## 1 Nocturnal 70 Omni… 7 3 1 ## 2 <NA> NA Unkn… 0 1 1 ## 3 Diurnal 4.5 Carn… 4 0 0 ## 4 <NA> NA Unkn… 2 0 0 ## 5 Diurnal 4500 Herb… 3 6 67 ## 6 Nocturnal 2.95 Omni… 1 2 55 ## 7 Diurnal 1950 Herb… 3 3 59 ## 8 Crepuscu… 2.95 Omni… 2 3 44 ## 9 Diurnal 3.5 Carn… 0 3 47 ## 10 Crepuscu… 26.5 Carn… 1 4 51 ## # … with 449 more rows, and 97 more variables: tf_text_along <dbl>, ## # tf_text_also <dbl>, tf_text_although <dbl>, tf_text_animal <dbl>, ## # tf_text_animals <dbl>, tf_text_appearance <dbl>, tf_text_areas <dbl>, ## # tf_text_around <dbl>, tf_text_bear <dbl>, tf_text_birds <dbl>, ## # tf_text_birth <dbl>, tf_text_black <dbl>, tf_text_body <dbl>, ## # tf_text_breed <dbl>, tf_text_can <dbl>, tf_text_common <dbl>, ## # tf_text_despite <dbl>, tf_text_diet <dbl>, tf_text_different <dbl>, ## # tf_text_dog <dbl>, tf_text_due <dbl>, tf_text_eat <dbl>, ## # tf_text_eggs <dbl>, tf_text_elephant <dbl>, tf_text_even <dbl>, ## # tf_text_fact <dbl>, tf_text_feet <dbl>, tf_text_female <dbl>, ## # tf_text_females <dbl>, tf_text_fish <dbl>, tf_text_food <dbl>, ## # tf_text_found <dbl>, tf_text_fur <dbl>, tf_text_generally <dbl>, ## # tf_text_habitat <dbl>, tf_text_however <dbl>, tf_text_human <dbl>, ## # tf_text_humans <dbl>, tf_text_hunt <dbl>, tf_text_hunting <dbl>, ## # tf_text_including <dbl>, tf_text_insects <dbl>, tf_text_just <dbl>, ## # tf_text_known <dbl>, tf_text_large <dbl>, tf_text_larger <dbl>, ## # tf_text_life <dbl>, tf_text_like <dbl>, tf_text_live <dbl>, ## # tf_text_long <dbl>, tf_text_make <dbl>, tf_text_male <dbl>, ## # tf_text_males <dbl>, tf_text_many <dbl>, tf_text_may <dbl>, ## # tf_text_means <dbl>, tf_text_monkey <dbl>, tf_text_months <dbl>, ## # tf_text_much <dbl>, tf_text_name <dbl>, tf_text_natural <dbl>, ## # tf_text_number <dbl>, tf_text_often <dbl>, tf_text_old <dbl>, ## # tf_text_one <dbl>, tf_text_penguin <dbl>, tf_text_people <dbl>, ## # tf_text_population <dbl>, tf_text_populations <dbl>, ## # tf_text_predators <dbl>, tf_text_prey <dbl>, tf_text_range <dbl>, ## # tf_text_rhino <dbl>, tf_text_sea <dbl>, tf_text_size <dbl>, ## # tf_text_small <dbl>, tf_text_smaller <dbl>, tf_text_south <dbl>, ## # tf_text_species <dbl>, tf_text_tend <dbl>, tf_text_thought <dbl>, ## # tf_text_three <dbl>, tf_text_throughout <dbl>, tf_text_tiger <dbl>, ## # tf_text_time <dbl>, tf_text_today <dbl>, tf_text_trees <dbl>, ## # tf_text_two <dbl>, tf_text_usually <dbl>, tf_text_water <dbl>, ## # tf_text_well <dbl>, tf_text_white <dbl>, tf_text_wild <dbl>, ## # tf_text_world <dbl>, tf_text_year <dbl>, tf_text_years <dbl>, ## # tf_text_young <dbl> ``` --- ### Your turn #6 - result If we don't filter the tokens then we get a very large number of columns ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% #step_tokenfilter(text, max_tokens = 100) %>% step_tf(text) rec_spec %>% prep() %>% bake(new_data = NULL) ``` --- ### Your turn #6 - result ``` ## # A tibble: 459 x 14,347 ## lifestyle mean_weight diet tf_text_0.2 tf_text_0.24 tf_text_0.5 ## <fct> <dbl> <fct> <dbl> <dbl> <dbl> ## 1 Nocturnal 70 Omni… 0 0 0 ## 2 <NA> NA Unkn… 0 0 0 ## 3 Diurnal 4.5 Carn… 0 0 0 ## 4 <NA> NA Unkn… 0 0 0 ## 5 Diurnal 4500 Herb… 0 0 0 ## 6 Nocturnal 2.95 Omni… 0 0 0 ## 7 Diurnal 1950 Herb… 0 0 0 ## 8 Crepuscu… 2.95 Omni… 0 0 0 ## 9 Diurnal 3.5 Carn… 0 0 0 ## 10 Crepuscu… 26.5 Carn… 0 0 0 ## # … with 449 more rows, and 14,341 more variables: tf_text_0.5cm <dbl>, ## # tf_text_0.6 <dbl>, tf_text_0.7 <dbl>, tf_text_0.88 <dbl>, ## # tf_text_0.9 <dbl>, tf_text_011 <dbl>, tf_text_014 <dbl>, ## # tf_text_053 <dbl>, tf_text_07 <dbl>, tf_text_071 <dbl>, ## # tf_text_1 <dbl>, `tf_text_1,000` <dbl>, `tf_text_1,000,000` <dbl>, ## # `tf_text_1,000kg` <dbl>, `tf_text_1,000km` <dbl>, ## # `tf_text_1,037` <dbl>, `tf_text_1,100` <dbl>, `tf_text_1,175` <dbl>, ## # `tf_text_1,200` <dbl>, `tf_text_1,200,000` <dbl>, ## # `tf_text_1,215` <dbl>, `tf_text_1,300` <dbl>, `tf_text_1,300m` <dbl>, ## # `tf_text_1,360` <dbl>, `tf_text_1,400` <dbl>, `tf_text_1,440` <dbl>, ## # `tf_text_1,500` <dbl>, `tf_text_1,500m` <dbl>, `tf_text_1,600` <dbl>, ## # `tf_text_1,700` <dbl>, `tf_text_1,760` <dbl>, `tf_text_1,800` <dbl>, ## # tf_text_1.1 <dbl>, tf_text_1.17 <dbl>, tf_text_1.2 <dbl>, ## # tf_text_1.3 <dbl>, tf_text_1.4 <dbl>, tf_text_1.5 <dbl>, ## # tf_text_1.5cm <dbl>, tf_text_1.5kg <dbl>, tf_text_1.5m <dbl>, ## # tf_text_1.6 <dbl>, tf_text_1.7 <dbl>, tf_text_1.8 <dbl>, ## # tf_text_1.9 <dbl>, tf_text_10 <dbl>, `tf_text_10,000` <dbl>, ## # tf_text_10.4 <dbl>, tf_text_10.5 <dbl>, tf_text_100 <dbl>, ## # `tf_text_100,000` <dbl>, tf_text_1000 <dbl>, tf_text_100cm <dbl>, ## # tf_text_100ft <dbl>, tf_text_100kg <dbl>, tf_text_100km <dbl>, ## # tf_text_100m <dbl>, tf_text_102 <dbl>, tf_text_103 <dbl>, ## # tf_text_104 <dbl>, tf_text_105 <dbl>, tf_text_107 <dbl>, ## # tf_text_10cm <dbl>, tf_text_10kg <dbl>, tf_text_10km <dbl>, ## # tf_text_11 <dbl>, `tf_text_11,000` <dbl>, tf_text_110 <dbl>, ## # `tf_text_110,000` <dbl>, tf_text_110cm <dbl>, tf_text_112 <dbl>, ## # tf_text_114 <dbl>, tf_text_116 <dbl>, tf_text_119 <dbl>, ## # tf_text_11mph <dbl>, tf_text_12 <dbl>, `tf_text_12,000` <dbl>, ## # tf_text_12.6 <dbl>, tf_text_120 <dbl>, `tf_text_120,000` <dbl>, ## # tf_text_1200 <dbl>, tf_text_120g <dbl>, tf_text_124 <dbl>, ## # tf_text_124.2 <dbl>, tf_text_125 <dbl>, tf_text_126 <dbl>, ## # `tf_text_127,000` <dbl>, tf_text_129 <dbl>, tf_text_12th <dbl>, ## # tf_text_13 <dbl>, `tf_text_13,000` <dbl>, `tf_text_13,800` <dbl>, ## # tf_text_13.2 <dbl>, tf_text_13.5 <dbl>, tf_text_130 <dbl>, ## # `tf_text_130,000` <dbl>, tf_text_1300 <dbl>, tf_text_132 <dbl>, ## # tf_text_135 <dbl>, tf_text_13cm <dbl>, … ``` --- ### Your turn #7 Swap `step_tf()` with `step_tfidf()` and see the change ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tf(text) rec_spec %>% prep() %>% bake(new_data = NULL) ```
02
:
00
--- ### Your turn #7 - result Swap `step_tf()` with `step_tfidf()` and see the change ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tfidf(text) rec_spec %>% prep() %>% bake(new_data = NULL) ``` --- ### Your turn #7 - result ``` ## # A tibble: 459 x 103 ## lifestyle mean_weight diet tfidf_text_able tfidf_text_afri… ## <fct> <dbl> <fct> <dbl> <dbl> ## 1 Nocturnal 70 Omni… 0.0433 0.0252 ## 2 <NA> NA Unkn… 0 0.0310 ## 3 Diurnal 4.5 Carn… 0.0249 0 ## 4 <NA> NA Unkn… 0.0329 0 ## 5 Diurnal 4500 Herb… 0.0127 0.0344 ## 6 Nocturnal 2.95 Omni… 0.00595 0.0162 ## 7 Diurnal 1950 Herb… 0.0163 0.0221 ## 8 Crepuscu… 2.95 Omni… 0.0114 0.0231 ## 9 Diurnal 3.5 Carn… 0 0.0175 ## 10 Crepuscu… 26.5 Carn… 0.00401 0.0218 ## # … with 449 more rows, and 98 more variables: tfidf_text_african <dbl>, ## # tfidf_text_along <dbl>, tfidf_text_also <dbl>, ## # tfidf_text_although <dbl>, tfidf_text_animal <dbl>, ## # tfidf_text_animals <dbl>, tfidf_text_appearance <dbl>, ## # tfidf_text_areas <dbl>, tfidf_text_around <dbl>, ## # tfidf_text_bear <dbl>, tfidf_text_birds <dbl>, ## # tfidf_text_birth <dbl>, tfidf_text_black <dbl>, ## # tfidf_text_body <dbl>, tfidf_text_breed <dbl>, tfidf_text_can <dbl>, ## # tfidf_text_common <dbl>, tfidf_text_despite <dbl>, ## # tfidf_text_diet <dbl>, tfidf_text_different <dbl>, ## # tfidf_text_dog <dbl>, tfidf_text_due <dbl>, tfidf_text_eat <dbl>, ## # tfidf_text_eggs <dbl>, tfidf_text_elephant <dbl>, ## # tfidf_text_even <dbl>, tfidf_text_fact <dbl>, tfidf_text_feet <dbl>, ## # tfidf_text_female <dbl>, tfidf_text_females <dbl>, ## # tfidf_text_fish <dbl>, tfidf_text_food <dbl>, tfidf_text_found <dbl>, ## # tfidf_text_fur <dbl>, tfidf_text_generally <dbl>, ## # tfidf_text_habitat <dbl>, tfidf_text_however <dbl>, ## # tfidf_text_human <dbl>, tfidf_text_humans <dbl>, ## # tfidf_text_hunt <dbl>, tfidf_text_hunting <dbl>, ## # tfidf_text_including <dbl>, tfidf_text_insects <dbl>, ## # tfidf_text_just <dbl>, tfidf_text_known <dbl>, ## # tfidf_text_large <dbl>, tfidf_text_larger <dbl>, ## # tfidf_text_life <dbl>, tfidf_text_like <dbl>, tfidf_text_live <dbl>, ## # tfidf_text_long <dbl>, tfidf_text_make <dbl>, tfidf_text_male <dbl>, ## # tfidf_text_males <dbl>, tfidf_text_many <dbl>, tfidf_text_may <dbl>, ## # tfidf_text_means <dbl>, tfidf_text_monkey <dbl>, ## # tfidf_text_months <dbl>, tfidf_text_much <dbl>, ## # tfidf_text_name <dbl>, tfidf_text_natural <dbl>, ## # tfidf_text_number <dbl>, tfidf_text_often <dbl>, ## # tfidf_text_old <dbl>, tfidf_text_one <dbl>, tfidf_text_penguin <dbl>, ## # tfidf_text_people <dbl>, tfidf_text_population <dbl>, ## # tfidf_text_populations <dbl>, tfidf_text_predators <dbl>, ## # tfidf_text_prey <dbl>, tfidf_text_range <dbl>, ## # tfidf_text_rhino <dbl>, tfidf_text_sea <dbl>, tfidf_text_size <dbl>, ## # tfidf_text_small <dbl>, tfidf_text_smaller <dbl>, ## # tfidf_text_south <dbl>, tfidf_text_species <dbl>, ## # tfidf_text_tend <dbl>, tfidf_text_thought <dbl>, ## # tfidf_text_three <dbl>, tfidf_text_throughout <dbl>, ## # tfidf_text_tiger <dbl>, tfidf_text_time <dbl>, ## # tfidf_text_today <dbl>, tfidf_text_trees <dbl>, tfidf_text_two <dbl>, ## # tfidf_text_usually <dbl>, tfidf_text_water <dbl>, ## # tfidf_text_well <dbl>, tfidf_text_white <dbl>, tfidf_text_wild <dbl>, ## # tfidf_text_world <dbl>, tfidf_text_year <dbl>, ## # tfidf_text_years <dbl>, tfidf_text_young <dbl> ``` --- ### Your turn #8 Insert `step_ngram()` into recipe after tokenization. Play around with `num_tokens = ` and `min_num_tokens = ` ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tfidf(text) rec_spec %>% prep() %>% bake(new_data = NULL) ``` ``` ## # A tibble: 459 x 103 ## lifestyle mean_weight diet tfidf_text_able tfidf_text_afri… ## <fct> <dbl> <fct> <dbl> <dbl> ## 1 Nocturnal 70 Omni… 0.0433 0.0252 ## 2 <NA> NA Unkn… 0 0.0310 ## 3 Diurnal 4.5 Carn… 0.0249 0 ## 4 <NA> NA Unkn… 0.0329 0 ## 5 Diurnal 4500 Herb… 0.0127 0.0344 ## 6 Nocturnal 2.95 Omni… 0.00595 0.0162 ## 7 Diurnal 1950 Herb… 0.0163 0.0221 ## 8 Crepuscu… 2.95 Omni… 0.0114 0.0231 ## 9 Diurnal 3.5 Carn… 0 0.0175 ## 10 Crepuscu… 26.5 Carn… 0.00401 0.0218 ## # … with 449 more rows, and 98 more variables: tfidf_text_african <dbl>, ## # tfidf_text_along <dbl>, tfidf_text_also <dbl>, ## # tfidf_text_although <dbl>, tfidf_text_animal <dbl>, ## # tfidf_text_animals <dbl>, tfidf_text_appearance <dbl>, ## # tfidf_text_areas <dbl>, tfidf_text_around <dbl>, ## # tfidf_text_bear <dbl>, tfidf_text_birds <dbl>, ## # tfidf_text_birth <dbl>, tfidf_text_black <dbl>, ## # tfidf_text_body <dbl>, tfidf_text_breed <dbl>, tfidf_text_can <dbl>, ## # tfidf_text_common <dbl>, tfidf_text_despite <dbl>, ## # tfidf_text_diet <dbl>, tfidf_text_different <dbl>, ## # tfidf_text_dog <dbl>, tfidf_text_due <dbl>, tfidf_text_eat <dbl>, ## # tfidf_text_eggs <dbl>, tfidf_text_elephant <dbl>, ## # tfidf_text_even <dbl>, tfidf_text_fact <dbl>, tfidf_text_feet <dbl>, ## # tfidf_text_female <dbl>, tfidf_text_females <dbl>, ## # tfidf_text_fish <dbl>, tfidf_text_food <dbl>, tfidf_text_found <dbl>, ## # tfidf_text_fur <dbl>, tfidf_text_generally <dbl>, ## # tfidf_text_habitat <dbl>, tfidf_text_however <dbl>, ## # tfidf_text_human <dbl>, tfidf_text_humans <dbl>, ## # tfidf_text_hunt <dbl>, tfidf_text_hunting <dbl>, ## # tfidf_text_including <dbl>, tfidf_text_insects <dbl>, ## # tfidf_text_just <dbl>, tfidf_text_known <dbl>, ## # tfidf_text_large <dbl>, tfidf_text_larger <dbl>, ## # tfidf_text_life <dbl>, tfidf_text_like <dbl>, tfidf_text_live <dbl>, ## # tfidf_text_long <dbl>, tfidf_text_make <dbl>, tfidf_text_male <dbl>, ## # tfidf_text_males <dbl>, tfidf_text_many <dbl>, tfidf_text_may <dbl>, ## # tfidf_text_means <dbl>, tfidf_text_monkey <dbl>, ## # tfidf_text_months <dbl>, tfidf_text_much <dbl>, ## # tfidf_text_name <dbl>, tfidf_text_natural <dbl>, ## # tfidf_text_number <dbl>, tfidf_text_often <dbl>, ## # tfidf_text_old <dbl>, tfidf_text_one <dbl>, tfidf_text_penguin <dbl>, ## # tfidf_text_people <dbl>, tfidf_text_population <dbl>, ## # tfidf_text_populations <dbl>, tfidf_text_predators <dbl>, ## # tfidf_text_prey <dbl>, tfidf_text_range <dbl>, ## # tfidf_text_rhino <dbl>, tfidf_text_sea <dbl>, tfidf_text_size <dbl>, ## # tfidf_text_small <dbl>, tfidf_text_smaller <dbl>, ## # tfidf_text_south <dbl>, tfidf_text_species <dbl>, ## # tfidf_text_tend <dbl>, tfidf_text_thought <dbl>, ## # tfidf_text_three <dbl>, tfidf_text_throughout <dbl>, ## # tfidf_text_tiger <dbl>, tfidf_text_time <dbl>, ## # tfidf_text_today <dbl>, tfidf_text_trees <dbl>, tfidf_text_two <dbl>, ## # tfidf_text_usually <dbl>, tfidf_text_water <dbl>, ## # tfidf_text_well <dbl>, tfidf_text_white <dbl>, tfidf_text_wild <dbl>, ## # tfidf_text_world <dbl>, tfidf_text_year <dbl>, ## # tfidf_text_years <dbl>, tfidf_text_young <dbl> ```
03
:
00
--- ### Your turn #8 - result Insert `step_ngram()` into recipe after tokenization. Play around with `num_tokens = ` and `min_num_tokens = ` ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_ngram(text, min_num_tokens = 1, num_tokens = 2) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tfidf(text) rec_spec %>% prep() %>% bake(new_data = NULL) %>% names() ``` --- ### Your turn #8 - result ``` ## [1] "lifestyle" "mean_weight" ## [3] "diet" "tfidf_text_able" ## [5] "tfidf_text_able_to" "tfidf_text_african" ## [7] "tfidf_text_along" "tfidf_text_also" ## [9] "tfidf_text_although" "tfidf_text_and_the" ## [11] "tfidf_text_animal" "tfidf_text_animals" ## [13] "tfidf_text_are_also" "tfidf_text_areas" ## [15] "tfidf_text_around" "tfidf_text_as_a" ## [17] "tfidf_text_as_the" "tfidf_text_birds" ## [19] "tfidf_text_black" "tfidf_text_body" ## [21] "tfidf_text_can" "tfidf_text_common" ## [23] "tfidf_text_diet" "tfidf_text_different" ## [25] "tfidf_text_due" "tfidf_text_due_to" ## [27] "tfidf_text_eat" "tfidf_text_eggs" ## [29] "tfidf_text_even" "tfidf_text_female" ## [31] "tfidf_text_females" "tfidf_text_fish" ## [33] "tfidf_text_food" "tfidf_text_for_the" ## [35] "tfidf_text_found" "tfidf_text_found_in" ## [37] "tfidf_text_from_the" "tfidf_text_habitat" ## [39] "tfidf_text_have_been" "tfidf_text_however" ## [41] "tfidf_text_humans" "tfidf_text_in_a" ## [43] "tfidf_text_in_the" "tfidf_text_including" ## [45] "tfidf_text_is_a" "tfidf_text_is_the" ## [47] "tfidf_text_it_is" "tfidf_text_known" ## [49] "tfidf_text_known_to" "tfidf_text_large" ## [51] "tfidf_text_like" "tfidf_text_live" ## [53] "tfidf_text_long" "tfidf_text_male" ## [55] "tfidf_text_males" "tfidf_text_many" ## [57] "tfidf_text_may" "tfidf_text_months" ## [59] "tfidf_text_much" "tfidf_text_name" ## [61] "tfidf_text_natural" "tfidf_text_of_a" ## [63] "tfidf_text_of_the" "tfidf_text_of_their" ## [65] "tfidf_text_often" "tfidf_text_old" ## [67] "tfidf_text_on_the" "tfidf_text_one" ## [69] "tfidf_text_one_of" "tfidf_text_penguin" ## [71] "tfidf_text_population" "tfidf_text_predators" ## [73] "tfidf_text_prey" "tfidf_text_range" ## [75] "tfidf_text_sea" "tfidf_text_size" ## [77] "tfidf_text_small" "tfidf_text_species" ## [79] "tfidf_text_species_of" "tfidf_text_such_as" ## [81] "tfidf_text_that_are" "tfidf_text_that_they" ## [83] "tfidf_text_the_water" "tfidf_text_the_wild" ## [85] "tfidf_text_the_world" "tfidf_text_there_are" ## [87] "tfidf_text_they_are" "tfidf_text_thought" ## [89] "tfidf_text_throughout" "tfidf_text_time" ## [91] "tfidf_text_to_be" "tfidf_text_to_the" ## [93] "tfidf_text_today" "tfidf_text_two" ## [95] "tfidf_text_up_to" "tfidf_text_water" ## [97] "tfidf_text_well" "tfidf_text_white" ## [99] "tfidf_text_wild" "tfidf_text_with_the" ## [101] "tfidf_text_world" "tfidf_text_years" ## [103] "tfidf_text_young" ``` --- ### Your turn #8 - result Order matters! ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_ngram(text, min_num_tokens = 1, num_tokens = 2) %>% step_tokenfilter(text, max_tokens = 100) %>% step_tfidf(text) rec_spec %>% prep() %>% bake(new_data = NULL) %>% names() ``` --- ### Your turn #8 - result ``` ## [1] "lifestyle" "mean_weight" ## [3] "diet" "tfidf_text_able" ## [5] "tfidf_text_africa" "tfidf_text_african" ## [7] "tfidf_text_along" "tfidf_text_also" ## [9] "tfidf_text_although" "tfidf_text_animal" ## [11] "tfidf_text_animals" "tfidf_text_appearance" ## [13] "tfidf_text_areas" "tfidf_text_around" ## [15] "tfidf_text_bear" "tfidf_text_birds" ## [17] "tfidf_text_birth" "tfidf_text_black" ## [19] "tfidf_text_body" "tfidf_text_breed" ## [21] "tfidf_text_can" "tfidf_text_common" ## [23] "tfidf_text_despite" "tfidf_text_diet" ## [25] "tfidf_text_different" "tfidf_text_dog" ## [27] "tfidf_text_due" "tfidf_text_eat" ## [29] "tfidf_text_eggs" "tfidf_text_elephant" ## [31] "tfidf_text_even" "tfidf_text_fact" ## [33] "tfidf_text_feet" "tfidf_text_female" ## [35] "tfidf_text_females" "tfidf_text_fish" ## [37] "tfidf_text_food" "tfidf_text_found" ## [39] "tfidf_text_fur" "tfidf_text_generally" ## [41] "tfidf_text_habitat" "tfidf_text_however" ## [43] "tfidf_text_human" "tfidf_text_humans" ## [45] "tfidf_text_hunt" "tfidf_text_hunting" ## [47] "tfidf_text_including" "tfidf_text_insects" ## [49] "tfidf_text_just" "tfidf_text_known" ## [51] "tfidf_text_large" "tfidf_text_larger" ## [53] "tfidf_text_life" "tfidf_text_like" ## [55] "tfidf_text_live" "tfidf_text_long" ## [57] "tfidf_text_make" "tfidf_text_male" ## [59] "tfidf_text_males" "tfidf_text_many" ## [61] "tfidf_text_may" "tfidf_text_means" ## [63] "tfidf_text_monkey" "tfidf_text_months" ## [65] "tfidf_text_much" "tfidf_text_name" ## [67] "tfidf_text_natural" "tfidf_text_number" ## [69] "tfidf_text_often" "tfidf_text_old" ## [71] "tfidf_text_one" "tfidf_text_penguin" ## [73] "tfidf_text_people" "tfidf_text_population" ## [75] "tfidf_text_populations" "tfidf_text_predators" ## [77] "tfidf_text_prey" "tfidf_text_range" ## [79] "tfidf_text_rhino" "tfidf_text_sea" ## [81] "tfidf_text_size" "tfidf_text_small" ## [83] "tfidf_text_smaller" "tfidf_text_south" ## [85] "tfidf_text_species" "tfidf_text_tend" ## [87] "tfidf_text_thought" "tfidf_text_three" ## [89] "tfidf_text_throughout" "tfidf_text_tiger" ## [91] "tfidf_text_time" "tfidf_text_today" ## [93] "tfidf_text_trees" "tfidf_text_two" ## [95] "tfidf_text_usually" "tfidf_text_water" ## [97] "tfidf_text_well" "tfidf_text_white" ## [99] "tfidf_text_wild" "tfidf_text_world" ## [101] "tfidf_text_year" "tfidf_text_years" ## [103] "tfidf_text_young" ``` --- # Final recipe ```r rec_spec <- recipe(diet ~ ., data = animals_training) %>% step_novel(lifestyle) %>% step_unknown(lifestyle) %>% step_other(lifestyle, threshold = 0.01) %>% step_dummy(lifestyle) %>% step_log(mean_weight) %>% step_meanimpute(mean_weight) %>% step_tokenize(text) %>% step_tokenfilter(text, max_tokens = tune()) %>% step_tfidf(text) ``` Also, what does `tune()` mean here? 🤔 --- class: inverse, right, middle ## What kind of **models** work well for text? --- # Text models Remember that text data is sparse! 😮 -- - Regularized linear models (glmnet) - Support vector machines - naive Bayes - Tree-based models like random forest? --- # Text models Remember that text data is sparse! 😮 - Regularized linear models (glmnet) - Support vector machines - naive Bayes - Tree-based models like random forest? 🙅 --- class: inverse, right, middle # Does text data have to be **sparse**? --- >### You shall know a word by the company it keeps. #### [💬 John Rupert Firth](https://en.wikiquote.org/wiki/John_Rupert_Firth) -- Learn more about word embeddings: - in [Chapter 5](https://smltar.com/embeddings.html) - at [juliasilge.github.io/why-r-webinar/](https://juliasilge.github.io/why-r-webinar/) --- # To specify a model in tidymodels 1\. Pick a **model** 2\. Set the **mode** (if needed) 3\. Set the **engine** --- background-image: url(https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/parsnip.png) background-size: cover .footnote[ Art by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) ] --- # To specify a model in tidymodels All available models are listed at <https://tidymodels.org/find/parsnip> <iframe src="https://tidymodels.org/find/parsnip" width="100%" height="400px"></iframe> --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r svm_rbf() %>% set_mode(mode = "regression") ``` ``` ## Radial Basis Function Support Vector Machine Specification (regression) ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r svm_rbf() %>% set_mode(mode = "classification") ``` ``` ## Radial Basis Function Support Vector Machine Specification (classification) ``` --- class: middle # `set_engine()` The same model can be implemented by multiple computational engines ```r svm_rbf() %>% set_engine("kernlab") ``` ``` ## Radial Basis Function Support Vector Machine Specification (unknown) ## ## Computational engine: kernlab ``` --- class: middle # `set_engine()` The same model can be implemented by multiple computational engines ```r svm_rbf() %>% set_engine("liquidSVM") ``` ``` ## Radial Basis Function Support Vector Machine Specification (unknown) ## ## Computational engine: liquidSVM ``` --- # What makes a model? ```r lasso_spec <- multinom_reg(penalty = tune(), mixture = 1) %>% set_mode("classification") %>% set_engine("glmnet") lasso_spec ``` ``` ## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = tune() ## mixture = 1 ## ## Computational engine: glmnet ``` -- It's `tune()` again! 😟 --- ## Parameters and... hyperparameters? - Some model parameters can be learned from data during fitting/training -- - Some CANNOT 😱 -- - These are **hyperparameters** of a model, and we estimate them by training lots of models with different hyperparameters and comparing them --- # A grid of possible hyperparameters .pull-left[ ```r param_grid <- grid_regular( penalty(range = c(-4, 0)), max_tokens(range = c(100, 500)), levels = c(penalty = 50, max_tokens = 4) ) param_grid ``` ] .pull-right[ ``` ## # A tibble: 200 x 2 ## penalty max_tokens ## <dbl> <int> ## 1 0.0001 100 ## 2 0.000121 100 ## 3 0.000146 100 ## 4 0.000176 100 ## 5 0.000212 100 ## 6 0.000256 100 ## 7 0.000309 100 ## 8 0.000373 100 ## 9 0.000450 100 ## 10 0.000543 100 ## # … with 190 more rows ``` ] --- class: inverse, right, middle # How can we **compare** and **evaluate** these different models? --- background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg) background-size: 60% --- # Spend your data budget ```r set.seed(123) animals_folds <- vfold_cv(animals_training, v = 5, strata = diet) animals_folds ``` ``` ## # 5-fold cross-validation using stratification ## # A tibble: 5 x 2 ## splits id ## <list> <chr> ## 1 <split [365/94]> Fold1 ## 2 <split [367/92]> Fold2 ## 3 <split [367/92]> Fold3 ## 4 <split [368/91]> Fold4 ## 5 <split [369/90]> Fold5 ``` --- class: middle, center, inverse # ✨ CROSS-VALIDATION ✨ --- background-image: url(images/cross-validation/Slide2.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide3.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide4.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide5.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide6.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide7.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide8.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide9.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide10.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide11.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- class: inverse, right, middle # Spend your data wisely to create **simulated** validation sets --- class: inverse, right, middle # Now we have **resamples**, **features**, plus a **model** --- .pull-left[ ## Create a workflow ```r wf_spec <- workflow() %>% add_recipe(rec_spec) %>% add_model(lasso_spec) wf_spec ``` ] .pull-right[ ``` ## ══ Workflow ══════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: multinom_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────── ## 9 Recipe Steps ## ## ● step_novel() ## ● step_unknown() ## ● step_other() ## ● step_dummy() ## ● step_log() ## ● step_meanimpute() ## ● step_tokenize() ## ● step_tokenfilter() ## ● step_tfidf() ## ## ── Model ───────────────────────────────────────────────────────────────── ## Multinomial Regression Model Specification (classification) ## ## Main Arguments: ## penalty = tune() ## mixture = 1 ## ## Computational engine: glmnet ``` ] --- class: inverse, right, middle # What is a `workflow()`? --- ## Time to tune! ⚡ ```r set.seed(42) lasso_rs <- tune_grid( wf_spec, resamples = animals_folds, grid = param_grid, control = control_grid(save_pred = TRUE, verbose = TRUE) ) ``` --- ## Time to tune! ⚡ ``` ## # Tuning results ## # 5-fold cross-validation using stratification ## # A tibble: 5 x 5 ## splits id .metrics .notes .predictions ## <list> <chr> <list> <list> <list> ## 1 <split [365/9… Fold1 <tibble [400 × … <tibble [0 × … <tibble [18,800 × … ## 2 <split [367/9… Fold2 <tibble [400 × … <tibble [0 × … <tibble [18,400 × … ## 3 <split [367/9… Fold3 <tibble [400 × … <tibble [0 × … <tibble [18,400 × … ## 4 <split [368/9… Fold4 <tibble [400 × … <tibble [0 × … <tibble [18,200 × … ## 5 <split [369/9… Fold5 <tibble [400 × … <tibble [0 × … <tibble [18,000 × … ``` --- # Look at the tuning results 👀 ```r collect_metrics(lasso_rs) ``` ``` ## # A tibble: 400 x 8 ## penalty max_tokens .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.0001 100 accuracy multiclass 0.565 5 0.0157 Recipe1_Mo… ## 2 0.0001 100 roc_auc hand_till 0.806 5 0.00639 Recipe1_Mo… ## 3 0.000121 100 accuracy multiclass 0.565 5 0.0157 Recipe1_Mo… ## 4 0.000121 100 roc_auc hand_till 0.807 5 0.00638 Recipe1_Mo… ## 5 0.000146 100 accuracy multiclass 0.565 5 0.0157 Recipe1_Mo… ## 6 0.000146 100 roc_auc hand_till 0.808 5 0.00633 Recipe1_Mo… ## 7 0.000176 100 accuracy multiclass 0.565 5 0.0157 Recipe1_Mo… ## 8 0.000176 100 roc_auc hand_till 0.809 5 0.00638 Recipe1_Mo… ## 9 0.000212 100 accuracy multiclass 0.569 5 0.0174 Recipe1_Mo… ## 10 0.000212 100 roc_auc hand_till 0.810 5 0.00669 Recipe1_Mo… ## # … with 390 more rows ``` --- # Look at the tuning results 👀 ```r autoplot(lasso_rs) ``` <img src="index_files/figure-html/unnamed-chunk-64-1.png" width="700px" style="display: block; margin: auto;" /> --- # Look at the tuning results 👀 ```r lasso_rs %>% show_best("roc_auc") ``` ``` ## # A tibble: 5 x 8 ## penalty max_tokens .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.0339 500 roc_auc hand_till 0.926 5 0.00728 Recipe4_Model… ## 2 0.0409 500 roc_auc hand_till 0.925 5 0.00718 Recipe4_Model… ## 3 0.0281 500 roc_auc hand_till 0.925 5 0.00828 Recipe4_Model… ## 4 0.0233 500 roc_auc hand_till 0.924 5 0.00870 Recipe4_Model… ## 5 0.0193 500 roc_auc hand_till 0.924 5 0.00928 Recipe4_Model… ``` --- # Your turn #9 Run you onw model and see what results you can find!
10
:
00
--- # The **best** 🥇 hyperparameters ```r best_roc_auc <- select_best(lasso_rs, "roc_auc") best_roc_auc ``` ``` ## # A tibble: 1 x 3 ## penalty max_tokens .config ## <dbl> <int> <chr> ## 1 0.0339 500 Recipe4_Model032 ``` --- # Evaluate the best model 📐 ```r collect_predictions(lasso_rs, parameters = best_roc_auc) ``` ``` ## # A tibble: 459 x 11 ## id .pred_Carnivore .pred_Herbivore .pred_Omnivore .pred_Unknown ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Fold1 0.124 0.0715 0.121 0.684 ## 2 Fold1 0.821 0.0533 0.113 0.0127 ## 3 Fold1 0.351 0.330 0.278 0.0402 ## 4 Fold1 0.0496 0.0430 0.0915 0.816 ## 5 Fold1 0.205 0.0415 0.0768 0.677 ## 6 Fold1 0.468 0.179 0.305 0.0492 ## 7 Fold1 0.352 0.196 0.420 0.0325 ## 8 Fold1 0.149 0.0840 0.181 0.586 ## 9 Fold1 0.829 0.108 0.0556 0.00688 ## 10 Fold1 0.139 0.636 0.206 0.0192 ## # … with 449 more rows, and 6 more variables: .row <int>, ## # max_tokens <int>, penalty <dbl>, .pred_class <fct>, diet <fct>, ## # .config <chr> ``` --- ## Evaluate the best model 📏 ```r collect_predictions(lasso_rs, parameters = best_roc_auc) %>% roc_curve(truth = diet, .pred_Carnivore:.pred_Unknown) %>% autoplot() ``` --- ## Evaluate the best model 📏 <img src="index_files/figure-html/unnamed-chunk-70-1.png" width="700px" style="display: block; margin: auto;" /> --- ## Evaluate the best model 📏 ```r collect_predictions(lasso_rs, parameters = best_roc_auc) %>% group_by(id) %>% roc_curve(truth = diet, .pred_Carnivore:.pred_Unknown) %>% autoplot() ``` --- ## Evaluate the best model 📏 <img src="index_files/figure-html/unnamed-chunk-72-1.png" width="700px" style="display: block; margin: auto;" /> --- # Update the workflow We can update our workflow with the best performing hyperparameters. ```r wf_spec_final <- finalize_workflow(wf_spec, best_roc_auc) ``` This workflow is ready to go! It can now be applied to new data. --- class: inverse, right, middle # How is our model **thinking**? --- ## Variable importance ```r library(vip) wf_spec_final %>% fit(animals_training) %>% pull_workflow_fit() %>% vi() %>% filter(!str_detect(Variable, "tfidf")) %>% filter(Importance != 0) ``` ``` ## # A tibble: 4 x 3 ## Variable Importance Sign ## <chr> <dbl> <chr> ## 1 lifestyle_Pack 1.04 POS ## 2 lifestyle_Crepuscular 0.649 POS ## 3 lifestyle_Solitary 0.0422 POS ## 4 lifestyle_unknown -0.578 NEG ``` --- ## Variable importance ```r vi_data <- wf_spec_final %>% fit(animals_training) %>% pull_workflow_fit() %>% vi() %>% mutate(Variable = str_remove_all(Variable, "tfidf_text_")) %>% filter(Importance != 0) ``` --- ## Variable importance ```r vi_data ``` ``` ## # A tibble: 69 x 3 ## Variable Importance Sign ## <chr> <dbl> <chr> ## 1 carnivores 552. POS ## 2 carnivorous 461. POS ## 3 prey 210. POS ## 4 hunt 176. POS ## 5 numbers 170. POS ## 6 population 115. POS ## 7 ocean 110. POS ## 8 near 91.6 POS ## 9 feet 87.8 POS ## 10 hatch 86.1 POS ## # … with 59 more rows ``` --- # Final fit We will now use `last_fit()` to **fit** our model one last time on our training data and **evaluate** it on our testing data. ```r final_fit <- last_fit( wf_spec_final, animals_split ) ``` --- class: inverse, right, middle # Notice that this is the **first** and **only** time we have used our **testing data** --- # Evaluate on the **test** data 📐 ```r final_fit %>% collect_metrics() ``` ``` ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.768 ## 2 roc_auc hand_till 0.912 ``` --- ```r final_fit %>% collect_predictions() %>% conf_mat(truth = diet, .pred_class) %>% autoplot(type = "heatmap") ``` <img src="index_files/figure-html/unnamed-chunk-79-1.png" width="700px" style="display: block; margin: auto;" /> --- class: center, middle # Thanks! ##[smltar.com](https://smltar.com/) <img style="border-radius: 50%;" src="https://github.com/EmilHvitfeldt.png" width="150px"/> ###
[EmilHvitfeldt](https://github.com/EmilHvitfeldt/) ###
[@Emil_Hvitfeldt](https://twitter.com/Emil_Hvitfeldt) ###
[emilhvitfeldt](linkedin.com/in/emilhvitfeldt/) ###
[www.hvitfeldt.me](www.hvitfeldt.me)