# 4.2.3. FastText

A common problem in Natural Processing Language (NLP) tasks is to capture the context in which the word has been used. A single word with the same spelling and pronunciation (`homonyms`) can be used in multiple contexts and a potential solution to the above problem is making word embeddings.

**FastText** is a library created by the Facebook Research Team for efficient learning of word representations like [Word2Vec](https://pythonandml.github.io/dlbook/content/word_embeddings/word2vec.html) (link to previous chapter) or [GloVe](https://pythonandml.github.io/dlbook/content/word_embeddings/glove.html) (link to previous chapter) and sentence classification and is a type of [static word embedding](https://pythonandml.github.io/dlbook/content/word_embeddings/static_word_embeddings.html) (link to previous chapter). If you want you can read the official [fastText paper](https://arxiv.org/pdf/1607.04606.pdf).

:::{note}
`FastText` differs in the sense that [Word2Vec](https://pythonandml.github.io/dlbook/content/word_embeddings/word2vec.html) (link to previous chapter) treats every **single word** as the **smallest unit** whose vector representation is to be found but `FastText` assumes a word to be formed by a **n-grams of character**. 

For example:

word `sunny` is composed of `[sun, sunn, sunny], [sunny, unny, nny]`  etc, where $n$ could range from 1 to the length of the word.
:::

**Examples of different length character n-grams are given below:**

![](images/fasttext_3_grams_list.png)

[Image Source](https://amitness.com/2020/06/fasttext-embeddings/)

<table>
<thead>
<tr>
<th>Word</th>
<th>Length(n)</th>
<th>Character n-grams</th>
</tr>
</thead>
<tbody>
<tr>
<td>eating</td>
<td>3</td>
<td>&lt;ea, eat, ati, tin, ing, ng&gt;</td>
</tr>
<tr>
<td>eating</td>
<td>4</td>
<td>&lt;eat, eati, atin, ting, ing&gt;</td>
</tr>
<tr>
<td>eating</td>
<td>5</td>
<td>&lt;eati, eatin, ating, ting&gt;</td>
</tr>
<tr>
<td>eating</td>
<td>6</td>
<td>&lt;eatin, eating, ating&gt;</td>
</tr>
</tbody>
</table>

Thus **FastText** works well with rare words. So, even if a word wasn't seen during training, it can be broken down into n-grams to get its embeddings. [Word2Vec](https://pythonandml.github.io/dlbook/content/word_embeddings/word2vec.html) (link to previous chapter) and [GloVe](https://pythonandml.github.io/dlbook/content/word_embeddings/glove.html) (link to previous chapter) both fail to provide any vector representation for words that are not in the model dictionary. This is a huge advantage of this method.

### FastText model from python genism library

To train your own embeddings, you can either use the [official CLI tool](https://fasttext.cc/docs/en/unsupervised-tutorial.html) or use the fasttext implementation available in gensim.

You can install and import gensim library and then use gensim library to extract most similar words from the model that you downloaded from FastText.

Assume we use the same corpus as we have used in the [GloVe](https://pythonandml.github.io/dlbook/content/word_embeddings/glove.html) (link to previous chapter) model

#### Import essential libraries

In [1]:
from gensim.models.fasttext import FastText

In [3]:
documents = ['this is the first document',
             'this document is the second document',
             'this is the third one',
             'is this the first document']

**Tokenize**

We will first tokenize the above documents list (extract words from the sentences).


In [4]:
word_tokens = []

for document in documents:
    words = []
    for word in document.split(" "):
        words.append(word)
    word_tokens.append(words)

word_tokens

[['this', 'is', 'the', 'first', 'document'],
 ['this', 'document', 'is', 'the', 'second', 'document'],
 ['this', 'is', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']]

#### Defining values for parameters

The hyperparameters used in this model are:

* `size`: Dimensionality of the word vectors. window=window_size,
* `min_count`: The model ignores all words with total frequency lower than this.
* `sample`: The threshold for configuring which higher-frequency words are randomly down sampled, useful range is (0, 1e-5).
* `workers`: Use these many worker threads to train the model (=faster training with multicore machines).
* `sg`: Training algorithm: skip-gram if sg=1, otherwise CBOW.
* `iter`: Number of iterations (epochs) over the corpus.

In [9]:
embedding_size = 300
window_size = 2
min_word = 1
down_sampling = 1e-2

Let’s train Gensim fastText word embeddings model with our own custom data:

In [10]:
fast_Text_model = FastText(word_tokens,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      workers = 4,
                      sg=1,
                      iter=100)

#### Explore Gensim fastText model

In [13]:
# Check word embedding for a perticular word

fast_Text_model.wv['document'].shape

(300,)

In [14]:
# Check top 5 similar word for a given word by gensim fastText

fast_Text_model.wv.most_similar('first', topn=5)

[('document', 0.9611383676528931),
 ('this', 0.9607083797454834),
 ('the', 0.9569987058639526),
 ('third', 0.956832766532898),
 ('is', 0.9551167488098145)]

In [16]:
# Check similarity score between two word

fast_Text_model.wv.similarity('second', 'first')

0.9406048

### FastText models from Official CLI tool

#### Building fasttext python module

In order to build fasttext module for python, use the following:


In [19]:
!git clone https://github.com/facebookresearch/fastText.git

Cloning into 'fastText'...
remote: Enumerating objects: 3930, done.[K
remote: Counting objects: 100% (944/944), done.[K
remote: Compressing objects: 100% (140/140), done.[K
remote: Total 3930 (delta 854), reused 804 (delta 804), pack-reused 2986[K
Receiving objects: 100% (3930/3930), 8.24 MiB | 2.34 MiB/s, done.
Resolving deltas: 100% (2505/2505), done.


In [None]:
%cd fastText
!make
!cp fasttext ../
%cd ..

In [27]:
!./fasttext

usage: fasttext <command> <args>

The commands supported by fasttext are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  test-label              print labels with precision and recall scores
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  print-ngrams            print ngrams given a trained model and word
  nn                      query for nearest neighbors
  analogies               query for analogies
  dump                    dump arguments,dictionary,input/output vectors



If everything was installed correctly then, you should see the list of available commands for FastText as the output.

If you want to learn word representations using **Skipgram** and **CBOW models** from FastText model, we will see how we can implement both these methods to learn vector representations for a sample text file [file.txt](https://github.com/pythonandml/dlbook/blob/main/content/word_embeddings/datasets/file.txt) using fasttext.

**Skipgram**

> ./fasttext skipgram -input file.txt -output model

**CBOW**

> ./fasttext cbow -input file.txt -output model

Let us see the parameters defined above in steps for easy understanding.

`./fasttext` – It is used to invoke the FastText library.

`skipgram/cbow` – It is where you specify whether skipgram or cbow is to be used to create the word representations.

`-input` – This is the name of the parameter which specifies the following word to be used as the name of the file used for training. This argument should be used as is.

`data.txt` – a sample text file over which we wish to train the skipgram or cbow model. Change this name to the name of the text file you have.

`-output` – This is the name of the parameter which specifies the following word to be used as the name of the model being created. This argument is to be used as is.

`model` – This is the name of the model created.

Running the above command will create two files named `model.bin` and `model.vec`. 

**model.bin** contains the model parameters, dictionary and the hyperparameters and can be used to compute word vectors. 

**model.vec** is a text file that contains the word vectors for one word per line.

In [33]:
!./fasttext skipgram -input file.txt -output model

Read 0M words
Number of words:  39
Number of labels: 0
Progress: 100.0% words/sec/thread:    5233 lr:  0.000000 avg.loss:  4.118974 ETA:   0h 0m 0s


#### Print word vectors of a word

In order to get the word vectors for a word or set of words, save them in a text file. For example, here is a sample text file named [queries.txt](https://github.com/pythonandml/dlbook/blob/main/content/word_embeddings/datasets/queries.txt) that contains some random words. 

> This is a sample document whose word vectors I want to calculate per line.

We will get the vector representation of these words using the model we trained above.

In [34]:
!./fasttext print-word-vectors model.bin < queries.txt

This 0.001488 -0.00088977 0.001269 -0.0026087 -0.0030958 0.0006547 0.00033601 -0.0025968 0.000359 0.00016549 -0.0057352 -0.00076519 -0.0029626 -0.0015348 -0.0021845 -0.00076492 -0.0018045 -0.0022352 -0.0020006 -7.1538e-05 0.002477 0.00073409 0.00051157 0.001207 -0.00080267 -0.0021785 -0.0020973 0.0010278 -0.0014228 6.9732e-05 -0.003072 -0.00084118 0.0032873 0.00042565 0.0031798 -0.00062606 0.0024818 -0.001486 0.0021485 -0.0011063 0.00080187 -0.00066582 -0.0023849 0.001545 0.001557 -0.0035687 -0.0018026 -0.00066306 -0.0023016 0.00095573 -0.00026604 0.0051665 -0.00076879 -0.00067671 8.395e-05 0.00076354 -0.0011667 0.0014229 -3.6249e-06 0.0022986 -0.0022901 -0.00097346 -0.0013723 0.003118 0.0011788 -0.0015126 0.0011676 -0.0016475 -5.6458e-05 -0.0019892 -0.00063471 3.005e-05 0.00016219 0.00065521 -0.0020108 -0.0019475 -0.0012301 -0.0041074 0.00050122 0.0010706 -0.0025458 -0.0028248 0.0022962 0.00011252 -0.0022692 0.0010398 -0.00054782 -0.0036951 0.0012078 -0.0026392 0.0013318 0.0013911 0.0

To check word vectors for a single word without saving into a file, you can do:



In [35]:
!echo "word" | ./fasttext print-word-vectors model.bin

word -0.00025409 0.00096744 0.0049497 -0.00032588 0.0016006 0.0024075 -0.0038169 0.00021031 -0.0016116 0.0022946 -0.0059559 -0.0072992 -0.0038031 0.0026737 0.00026646 -0.0010559 0.0021195 -0.0073557 -0.00035755 0.0013653 -0.00034896 -0.00020773 -0.0024795 0.0027182 0.0052009 -0.0042148 -0.00073169 -0.00045605 -0.0018099 -0.00032808 0.0017144 0.0022335 0.0037804 0.0069224 -0.0022954 -0.00083255 -0.00029555 -0.00019974 -6.4909e-05 0.00029662 0.0027022 0.0015674 0.00036462 0.0013067 0.0042851 -0.00037913 0.0046911 0.002086 -0.0019307 0.0038958 -0.0070587 0.0081234 0.00021799 -0.0070302 0.001991 0.0071041 0.0045719 0.0010437 0.0032133 0.0057495 -0.0034484 -0.0053813 -0.0056952 0.0021939 -0.0020668 -0.0011078 -0.0030795 0.00056493 -0.0071458 -0.00065485 -0.0020349 0.00079504 0.0030109 0.0022615 0.00083377 0.0028798 0.00045376 -0.0075627 0.001661 0.0010637 0.0041424 -0.0021542 0.0065332 0.00084069 0.005215 0.0038575 0.001616 -0.0031342 0.00034762 0.0019841 -0.0013023 0.00014848 0.004405 0.00

#### Finding similar words

You can also find the words most similar to a given word. This functionality is provided by the nn parameter. Let’s see how we can find the most similar words to “happy”.

In [36]:
!./fasttext nn model.bin

Query word? happy
with 0.133388
skip-gram 0.0966653
can 0.0563167
to 0.0551814
and 0.046508
more 0.0456839
Word2vec 0.0375318
are 0.0350034
for 0.0350024
which 0.0321014
Query word? wrd
skip-gram 0.201936
words 0.199973
syntactic 0.164848
similar 0.164541
a 0.154628
more 0.152884
to 0.145891
word 0.141979
can 0.137356
word2vec 0.128606
Query word? ^C


**Explore further**

[Code Source for Official CLI tool section](https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classification-using-fasttext-nlp-facebook/)

[Code Source for gensim part](https://thinkinfi.com/fasttext-word-embeddings-python-implementation/)