Key Word Finder
High-Level Approach
Text pre-processing
noise removal
normalisation
Data Exploration
Word cloud to understand frequently used words
Top 20 single words, bi-grams, tri-grams
Convert text to vector of word counts
Convert text to vector of term frequencies
Sort terms in descending order based on term frequencies to identify top N keywords
Normalization
Handling multiple ocurrences of the same word eg. Learning, learn, learned, learner will just be "learn"
Stemming
Removes suffices
Lemmatization
works based on root of word (more advanced)
Removing stopwords
Stop words include large # of prepositions, pronouns, conjunctions, etc. in sentences. (these must be removed)
default list in python nltk library
May want to add more
Convert wordcloud to ML-friendly format
Has two parts:
Tokenisation
converting cts text into list of words
Vectorisation
list of words converted to matrix of integers
We use bag of words model
max_df: ignores terms that have a document frequency higher than given threshold (corpus-specific stop words) - ensures we have not commonly used words
max_features: determines number of columns in matrix
n-gram range: would want to look at single words, two words (bi-grams), three words (tri-grams)
returns an encoded vector with a length of entire vocabulary
Converting to matrix of integers
Using TF-IDF vectoriser
Downfalls of just using mere word count from countVectoriser:
large counts of certain common words may dilute impact of more context specific words in corpus
TF-IDF overcomes this - penalizes words that appear several times across document
Scores words based on context, not frequency 2 components to TF-IDF:
TF: Term Frequency (Frequency of term / Total # of terms)
IDF: Inverse Document Frequency (log(Total documents) / # of documents with term)
Last updated