Optimizing Intent
Pretrained Embeddings: Intent Classifier Sklearn
What is it?
Uses spacy
Uses word embeddings (vector representations of words)
Similar words get converted to similar numeric matrices
Trains linear SVM - optimized with gridsearch (hyperparameter tuning to determine optimal values for model)
Research paper: word2vec
Depends on langauge (choose spacy's "en" for english)
Can customize word embeddings (look into Facebook's fastText)
Then link model (spaCy guide)
Pros and Cons
Not much data needed
Pretrained word embeddings
Not available outside of English
Does not cover domain specific words (product names, acrynoms)
Supervised Embeddings: Intent Classifier TensorFlow Embedding
What is it?
Pipeline name in Rasa: EmbeddingIntentClassifier
trains word embeddings from scratch
Typically used with CountVectorsFeaturizer
Counts how often distinct words in training data appear in message
Then feeds this vector to intent classifier
Another count vector created for intent label (for multiple intent support)
Configuring CountVectorsFeaturizer (recommended featurizer for this classifier):
change analyzer property to char (using ngram counts instead of word token counts)
More robust against typos
Note: Bag-of-words is unigram, n-grams lets you detect multiple words together (if its bigrams you can detect United States, not just United)
Pros and Cons
Supports multiple intents per message
Needs more training data (learning from scratch)
Language independent
Common Problems
Lack of Training Data
Try Chatito
Try Rasa Core Interactive
Out-of-Vocabulary Words
Typos, words you did not think of
If using
EmbeddingIntentClassifier
:More training data
Add examples that include
OOV_token
(out-of-vocab token)
Similar Intents
Eg. You have 2 intents:
provide_name
andprovide_data
provide_name
: It is Lauraprovide_date
: It is on MondayThese 2 are similar from NLU perspective - comine into one intent called
inform
Skewed data
Maintain equal # of examples per intent
Last updated