4.2.3. FastText#

A common problem in Natural Processing Language (NLP) tasks is to capture the context in which the word has been used. A single word with the same spelling and pronunciation (homonyms) can be used in multiple contexts and a potential solution to the above problem is making word embeddings.

FastText is a library created by the Facebook Research Team for efficient learning of word representations like Word2Vec (link to previous chapter) or GloVe (link to previous chapter) and sentence classification and is a type of static word embedding (link to previous chapter). If you want you can read the official fastText paper.

Note

FastText differs in the sense that Word2Vec (link to previous chapter) treats every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character.

For example:

word sunny is composed of [sun, sunn, sunny], [sunny, unny, nny] etc, where \(n\) could range from 1 to the length of the word.

Examples of different length character n-grams are given below:

Image Source

Word Length(n) Character n-grams
eating 3 <ea, eat, ati, tin, ing, ng>
eating 4 <eat, eati, atin, ting, ing>
eating 5 <eati, eatin, ating, ting>
eating 6 <eatin, eating, ating>

Thus FastText works well with rare words. So, even if a word wasn’t seen during training, it can be broken down into n-grams to get its embeddings. Word2Vec (link to previous chapter) and GloVe (link to previous chapter) both fail to provide any vector representation for words that are not in the model dictionary. This is a huge advantage of this method.

FastText model from python genism library#

To train your own embeddings, you can either use the official CLI tool or use the fasttext implementation available in gensim.

You can install and import gensim library and then use gensim library to extract most similar words from the model that you downloaded from FastText.

Assume we use the same corpus as we have used in the GloVe (link to previous chapter) model

Import essential libraries#

from gensim.models.fasttext import FastText
documents = ['this is the first document',
             'this document is the second document',
             'this is the third one',
             'is this the first document']

Tokenize

We will first tokenize the above documents list (extract words from the sentences).

word_tokens = []

for document in documents:
    words = []
    for word in document.split(" "):
        words.append(word)
    word_tokens.append(words)

word_tokens
[['this', 'is', 'the', 'first', 'document'],
 ['this', 'document', 'is', 'the', 'second', 'document'],
 ['this', 'is', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']]

Defining values for parameters#

The hyperparameters used in this model are:

  • size: Dimensionality of the word vectors. window=window_size,

  • min_count: The model ignores all words with total frequency lower than this.

  • sample: The threshold for configuring which higher-frequency words are randomly down sampled, useful range is (0, 1e-5).

  • workers: Use these many worker threads to train the model (=faster training with multicore machines).

  • sg: Training algorithm: skip-gram if sg=1, otherwise CBOW.

  • iter: Number of iterations (epochs) over the corpus.

embedding_size = 300
window_size = 2
min_word = 1
down_sampling = 1e-2

Let’s train Gensim fastText word embeddings model with our own custom data:

fast_Text_model = FastText(word_tokens,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      workers = 4,
                      sg=1,
                      iter=100)

Explore Gensim fastText model#

# Check word embedding for a perticular word

fast_Text_model.wv['document'].shape
(300,)
# Check top 5 similar word for a given word by gensim fastText

fast_Text_model.wv.most_similar('first', topn=5)
[('document', 0.9611383676528931),
 ('this', 0.9607083797454834),
 ('the', 0.9569987058639526),
 ('third', 0.956832766532898),
 ('is', 0.9551167488098145)]
# Check similarity score between two word

fast_Text_model.wv.similarity('second', 'first')
0.9406048

FastText models from Official CLI tool#

Building fasttext python module#

In order to build fasttext module for python, use the following:

!git clone https://github.com/facebookresearch/fastText.git
Cloning into 'fastText'...
remote: Enumerating objects: 3930, done.
remote: Counting objects: 100% (944/944), done.
remote: Compressing objects: 100% (140/140), done.
remote: Total 3930 (delta 854), reused 804 (delta 804), pack-reused 2986
Receiving objects: 100% (3930/3930), 8.24 MiB | 2.34 MiB/s, done.
Resolving deltas: 100% (2505/2505), done.
%cd fastText
!make
!cp fasttext ../
%cd ..
!./fasttext
usage: fasttext <command> <args>

The commands supported by fasttext are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  test-label              print labels with precision and recall scores
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  print-ngrams            print ngrams given a trained model and word
  nn                      query for nearest neighbors
  analogies               query for analogies
  dump                    dump arguments,dictionary,input/output vectors

If everything was installed correctly then, you should see the list of available commands for FastText as the output.

If you want to learn word representations using Skipgram and CBOW models from FastText model, we will see how we can implement both these methods to learn vector representations for a sample text file file.txt using fasttext.

Skipgram

./fasttext skipgram -input file.txt -output model

CBOW

./fasttext cbow -input file.txt -output model

Let us see the parameters defined above in steps for easy understanding.

./fasttext – It is used to invoke the FastText library.

skipgram/cbow – It is where you specify whether skipgram or cbow is to be used to create the word representations.

-input – This is the name of the parameter which specifies the following word to be used as the name of the file used for training. This argument should be used as is.

data.txt – a sample text file over which we wish to train the skipgram or cbow model. Change this name to the name of the text file you have.

-output – This is the name of the parameter which specifies the following word to be used as the name of the model being created. This argument is to be used as is.

model – This is the name of the model created.

Running the above command will create two files named model.bin and model.vec.

model.bin contains the model parameters, dictionary and the hyperparameters and can be used to compute word vectors.

model.vec is a text file that contains the word vectors for one word per line.

!./fasttext skipgram -input file.txt -output model
Read 0M words
Number of words:  39
Number of labels: 0
Progress: 100.0% words/sec/thread:    5233 lr:  0.000000 avg.loss:  4.118974 ETA:   0h 0m 0s

Finding similar words#

You can also find the words most similar to a given word. This functionality is provided by the nn parameter. Let’s see how we can find the most similar words to “happy”.

!./fasttext nn model.bin
Query word? happy
with 0.133388
skip-gram 0.0966653
can 0.0563167
to 0.0551814
and 0.046508
more 0.0456839
Word2vec 0.0375318
are 0.0350034
for 0.0350024
which 0.0321014
Query word? wrd
skip-gram 0.201936
words 0.199973
syntactic 0.164848
similar 0.164541
a 0.154628
more 0.152884
to 0.145891
word 0.141979
can 0.137356
word2vec 0.128606
Query word? ^C

Explore further

Code Source for Official CLI tool section

Code Source for gensim part