Natural Language Processing
NLP note
Tokenization
tokenization is a process which is dividing sentences into pieces.
sentence
- I love my dog
- I love my cat
- I like your hair
without Keras API
token_idx = {}
for sentence in sentences:
for word in sentence.split():
if word not in token_idx:
token_idx[word.lower()] = len(token_idx) + 1
result = ‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘cat’: 5, ‘like’: 6, ‘your’: 7, ‘hair’: 8
with Keras API
token_list = {}
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
token_list = tokenizer.word_index
result = ‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘cat’: 5, ‘like’: 6, ‘your’: 7, ‘hair’: 8
The results are same ‘without Keras API’ and ‘with Keras API’
However, if there is ‘quotation mark’ or ‘exclamation mark’ or other special character.
sentence
- I love my dog!
- I love my dog
- I like your hair
without Keras API
result = ‘i’: 6, ‘love’: 2, ‘my’: 3, ‘dog!’: 4, ‘dog’: 5, ‘like’: 6, ‘your’: 7, ‘hair’: 8
with Keras API
result = ‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘like’: 5, ‘your’: 6, ‘hair’: 7
Keras API knows that ‘dog!’ and ‘dog’ have no difference.
On the other hand, The result will be changed for ‘without Keras API’. So It has to be added few more lines.
without Keras API
token_idx = {}
for sentence in sentences:
for word in sentence.split():
if word not in token_list:
for char in word:
if char in '!?,.;:~#$%^&*':
word = word.replace(char, '')
token_list[word.lower()] = len(token_list) + 1
One hot coding
It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed. (geeksforgeeks)
without Keras API
max_length = 10
result = np.zeros(shape=(len(sentences),
max_length, max(token_list.values()) + 1))
for i, sentence in enumerate(sentences):
for j, word in list(enumerate(sentence.split()))[:max_length]:
index = token_list.get(word)
result[i, j, index] = 1
with Keras API
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 1000)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
one_hot_results = tokenizer.texts_to_matrix(sentences, mode='binary')
sequences = [[2, 3, 4, 5], [2, 3, 4, 5], [2, 6, 7, 8]]
Supoose there are some new words in the test_sets that were not trained in the training_set. In order not to lose the length of the sequence, we simply use ‘oov_token=’<OOV>’.
tokenizer = Tokenizer(num_words = 1000, oov_token='<OOV>')
For images, most of them are same sizes so we do not need to do anything.
However, sentences are not same usually in NLP. so we need to use ‘pad’.
sentence
- I love my dog
- I think I love my cat a lot
- I really like your hair style
from keras.preprocessing.sequence import pad_sequences
pad_sequences = pad_sequences(sequences) # we can choose max length for each sentences(# of padding)
pad_sequneces = [[ 0 0 0 0 2 3 4 5], [ 2 6 2 3 4 7 8 9], [ 0 0 2 10 11 12 13 14]]
Embedding
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers. (Wikipedia)
wrod2vec
word2vec has CBOW(Continuous bag-of-words) and Skip-Gram. It expresses it in a distributed vector that has a meaning, not in a meaningless one-hot encoding vector.
However, it costs a lot in order to initialize softmax function if there are a number of sets of words.
(not sure)
For example
- cat + cute = kitten
- dog + cute = puppy
with keras API
X = []
Y = []
for row in indexed_corpus:
x, y = keras.preprocessing.sequence.skipgrams(sequence=row, vocabulary_size=vocab_size, window_size=window_size,
negative_samples=1.0, shuffle=True, categorical=False, sampling_table=None, seed=None)
X = X + list(x)
Y = Y + list(y)
glove
Perform matrix factorization to find embedded vectors that minimize the objective function.
load glove pre-trained embedding
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
########################################################
model.layers[0].set_weights([embedding_matrix])
Reference
- TensorFlow