Faraji Shikanda Ombonya | Backend Software Engineer

In this article, I'll show you how to train a Word2vec model using custom data. You can then use the trained model to perform tasks such as similarity search or powering a recommendation engine.

Digital art by Dall-E

What is Word2vec

Word2vec is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. See this documentation to learn more about Word2vec and how you can use it.

Prepare Training Data

To train a Word2vec model, you will need data. The data will typically be a list of lists of tokens. We will call this list of lists of tokens, sentences. Each sentence in the collection of sentences will consist of a list of tokens. Therefore, the majority of the work will be preparing the sentences. For large datasets, Word2vec’s documentation recommends that you stream the list directly from disk or network. This improves the memory efficiency of the function. To achieve this, we will create a python generator that streams one sentence at a time from the disk. See this python’s wiki to learn more about generators and how to use them.

Using a generator is memory efficient because we’re not loading the whole dataset in memory at once. Instead, we are loading chunks of 1000 rows and tokenizing and returning only one row at a time. It is also important to note that the generator function is wrapped inside a custom iterator class. This is because you can only iterate once over a generator. Introducing the class wrapper ensures that the Word2vec model can efficiently iterate over the dataset as many times as it would like.

Initial setup

Line 1 to 9 imports the necessary packages.
Line 12 to 17 Sets up the django environment.

Streaming the Data

Line 20 defines a class called Corpus. This class encapsulates all the logic required to stream one tokenized sentence at a time.
Line 21 defines the dunder init method that initializes the class.
Line 22 pre-compiles the regex for better performance.
Line 23 Initializers BertTokenizer. We use Bert Tokenizer to tokenize the sentence to get a list of tokens.
Line 24 creates a variable that contains the path to the dataset.
Line 26 defines the dunder iter method. The presence of a dunder iter method on our class means that you can iterate over an object instantiated from this class.
Line 27 and 28 The dunder iter method iterates over the iterator object returned by calling our generator function and yields the elements in the iterable. We wrap tqdm around our iterator object to get progress updates. See tqdm's documentation to learn more about it. (https://tqdm.github.io/)
Line 30 defines a sentence generator called sentences. A generator looks like a normal python function, but with one key difference. A generator yields an item in a collection instead of returning the entire collection like a function would.
Line 31 reads a chunk of the dataset from disk and starts a for loop to iterate over the chunk.
Line 32 uses the dataframe’s iterrows method to lazily iterate over each item in the dataframe
Line 33 to 37 extracts attributes from each row.
Line 38 creates a sentence from the attributes
Line 39 cleans the sentence with the pre compiled regex
Line 40 tokenizes the sentence with tokenizer instance
Finally, line 42 yields the tokenized sentence

Training the model

After preparing the data, we can now use the tokenized sentences to train the Word2vec model.

Line 45 saves the current time. This is used to track how long it takes the script to run.
Line 46 instantiates an iterable object from the custom iterable class called Corpus
Line 48 to 55 is where the actual training happens. To train the model, we instantiate the Word2Vec class and pass parameters to the class's constructor function. The parameters are as follows:
Line 57 to 60 saves two versions of the model do disk, the binary and the text version.
Line 62 to 64 calculates and prints to the console how long the script took to run.

Conclusion

Training a Word2vec model involves preparing the dataset by efficiently reading the dataset from disk, and passing data to the model’s constructor function. Finally, the model can be saved to disk and later loaded by an application that intends to use it.