from sklearn import preprocessingfrom keras.utils import to_categoricallabel_encoder = preprocessing.LabelEncoder()# the y that is being passed in is a pandas series of categorical datay = label_encoder.fit_transform(y)X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=1)y_train =to_categorical(y_train)y_test =to_categorical(y_test)# prints the classes of y print(label_encoder.classes_)
Preparing Text Input
Minimizing noise
To minimize noise in our text, we process the text by removing puncutations, numbers, and excessive spacing.
Convert text into suitable input format for model
Our model only understands numeric values, so we have to convert our textual input into vectors
We use pretrained word embeddings to create these vectors.
Creating vectors for each word provided by gloVe
from tqdm.notebook import tqdmembeddings_dict ={}# Download glove.6B.300d.txt from the gloVe websitewithopen(f"{ROOT}/glove.6B.300d.txt", "r", encoding="utf8")as glove_file:for line intqdm(glove_file): records = line.split() word = records[0] vector_dim = np.asarray(records[1:], dtype="float32") embeddings_dict[word]= vector_dim