{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Q_ptvt03X9l_" }, "source": [ "# 4.2.3. FastText\n", "\n", "A common problem in Natural Processing Language (NLP) tasks is to capture the context in which the word has been used. A single word with the same spelling and pronunciation (`homonyms`) can be used in multiple contexts and a potential solution to the above problem is making word embeddings.\n", "\n", "**FastText** is a library created by the Facebook Research Team for efficient learning of word representations like [Word2Vec](https://pythonandml.github.io/dlbook/content/word_embeddings/word2vec.html) (link to previous chapter) or [GloVe](https://pythonandml.github.io/dlbook/content/word_embeddings/glove.html) (link to previous chapter) and sentence classification and is a type of [static word embedding](https://pythonandml.github.io/dlbook/content/word_embeddings/static_word_embeddings.html) (link to previous chapter). If you want you can read the official [fastText paper](https://arxiv.org/pdf/1607.04606.pdf).\n", "\n", ":::{note}\n", "`FastText` differs in the sense that [Word2Vec](https://pythonandml.github.io/dlbook/content/word_embeddings/word2vec.html) (link to previous chapter) treats every **single word** as the **smallest unit** whose vector representation is to be found but `FastText` assumes a word to be formed by a **n-grams of character**. \n", "\n", "For example:\n", "\n", "word `sunny` is composed of `[sun, sunn, sunny], [sunny, unny, nny]` etc, where $n$ could range from 1 to the length of the word.\n", ":::\n", "\n", "**Examples of different length character n-grams are given below:**\n", "\n", "\n", "\n", "[Image Source](https://amitness.com/2020/06/fasttext-embeddings/)\n", "\n", "
Word | \n", "Length(n) | \n", "Character n-grams | \n", "
---|---|---|
eating | \n", "3 | \n", "<ea, eat, ati, tin, ing, ng> | \n", "
eating | \n", "4 | \n", "<eat, eati, atin, ting, ing> | \n", "
eating | \n", "5 | \n", "<eati, eatin, ating, ting> | \n", "
eating | \n", "6 | \n", "<eatin, eating, ating> | \n", "