Natural Language Processing

3 minute read

NLP note

Tokenization

tokenization is a process which is dividing sentences into pieces.

sentence

  • I love my dog
  • I love my cat
  • I like your hair

without Keras API

token_idx = {}

for sentence in sentences:
  for word in sentence.split():
    if word not in token_idx:
      token_idx[word.lower()] = len(token_idx) + 1

result = ‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘cat’: 5, ‘like’: 6, ‘your’: 7, ‘hair’: 8

with Keras API

token_list = {}

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
token_list = tokenizer.word_index

result = ‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘cat’: 5, ‘like’: 6, ‘your’: 7, ‘hair’: 8

The results are same ‘without Keras API’ and ‘with Keras API’

However, if there is ‘quotation mark’ or ‘exclamation mark’ or other special character.

sentence

  • I love my dog!
  • I love my dog
  • I like your hair

without Keras API

result = ‘i’: 6, ‘love’: 2, ‘my’: 3, ‘dog!’: 4, ‘dog’: 5, ‘like’: 6, ‘your’: 7, ‘hair’: 8

with Keras API

result = ‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘like’: 5, ‘your’: 6, ‘hair’: 7

Keras API knows that ‘dog!’ and ‘dog’ have no difference.

On the other hand, The result will be changed for ‘without Keras API’. So It has to be added few more lines.

without Keras API

token_idx = {}

for sentence in sentences:
  for word in sentence.split():
    if word not in token_list:
      for char in word:
        if char in '!?,.;:~#$%^&*':
          word = word.replace(char, '')
      token_list[word.lower()] = len(token_list) + 1

One hot coding

It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed. (geeksforgeeks)

without Keras API

max_length = 10

result = np.zeros(shape=(len(sentences),
                         max_length, max(token_list.values()) + 1))

for i, sentence in enumerate(sentences):
  for j, word in list(enumerate(sentence.split()))[:max_length]:
    index = token_list.get(word)
    result[i, j, index] = 1

with Keras API

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = 1000)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

one_hot_results = tokenizer.texts_to_matrix(sentences, mode='binary')

sequences = [[2, 3, 4, 5], [2, 3, 4, 5], [2, 6, 7, 8]]

Supoose there are some new words in the test_sets that were not trained in the training_set. In order not to lose the length of the sequence, we simply use ‘oov_token=’<OOV>’.

tokenizer = Tokenizer(num_words = 1000, oov_token='<OOV>')

For images, most of them are same sizes so we do not need to do anything.
However, sentences are not same usually in NLP. so we need to use ‘pad’.

sentence

  • I love my dog
  • I think I love my cat a lot
  • I really like your hair style
from keras.preprocessing.sequence import pad_sequences

pad_sequences = pad_sequences(sequences) # we can choose max length for each sentences(# of padding)

pad_sequneces = [[ 0 0 0 0 2 3 4 5], [ 2 6 2 3 4 7 8 9], [ 0 0 2 10 11 12 13 14]]

Embedding

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers. (Wikipedia)

wrod2vec

word2vec has CBOW(Continuous bag-of-words) and Skip-Gram. It expresses it in a distributed vector that has a meaning, not in a meaningless one-hot encoding vector.
However, it costs a lot in order to initialize softmax function if there are a number of sets of words.
(not sure)

For example

  • cat + cute = kitten
  • dog + cute = puppy

with keras API

X = []
Y = []
for row in indexed_corpus:
    x, y = keras.preprocessing.sequence.skipgrams(sequence=row, vocabulary_size=vocab_size, window_size=window_size,
                    negative_samples=1.0, shuffle=True, categorical=False, sampling_table=None, seed=None)
    X = X + list(x)
    Y = Y + list(y)

glove

Perform matrix factorization to find embedded vectors that minimize the objective function.

load glove pre-trained embedding

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
########################################################           
model.layers[0].set_weights([embedding_matrix])

Reference

  • TensorFlow

Tags:

Categories:

Updated: