This post in intended to help readers to understand the algorithm behind our recommendation engine. We use Collaborative Filtering (CF) to power our recommendation engine. Though there are many CF algorithms, we chose item2vec, which is based on the popular word2vec algorithm, primarily used for Natural Language Processing (NLP).
What is NLP?
Computers understand programming language that is precise, unambiguous and highly structured. Human speech, however, is not always precise — it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects, and social context. Natural Language Processing (NLP) is focused on enabling computers to understand and process human languages.
How often do you use search engines in a day? If you feel you depend a lot on search engines like Google, then you are kind of dependent on NLP as well.
When you type a few words into the search box, you immediately prompted with a lot of relevant options to auto complete your search. This is just one of many applications of NLP in our daily lives, other examples include auto-completing words, language translation, and more.
NLP uses word embeddings to map words or phrases from a vocabulary into some mathematical form, or vectors. This representation has two important and advantageous properties:
Dimensionality Reduction: It is a more efficient representation.
Contextual Similarity: It is a more expressive representation.
Don’t scratch your head. Things will become clearer shortly.
We chose one particular NLP technique to recommend relevant products.
Why We Chose Word2vec as the NLP Technique
Word2vec is an NLP technique that produces word embeddings.
Say you’re given sentences such as:
Beijing is China’s capital city
Madrid, Spain’s central capital, is a city of elegant boulevards
Tokyo is the enormous and wealthy capital city of Japan
A cursory glance at the three sentences above reveals a pattern around city, capital and their country names going together. Likewise, extending this to multiple similar sentences such as the above can yield the relationship shown below:
Image Source
Word2vec is a shallow neural network that trains sentences to learn patterns by turning words into mathematical objects. This mathematical object, commonly known as a vector, has size (magnitude) and direction, shown as arrows above. Since we have vectors, we can now perform mathematical operations such as addition, subtraction, or measure similarity between them.
Word2vec gained popularity by being able to transform words into vectors and come up with mathematical operations like these:
King – Man + Woman = Queen
Moscow – Russia + France = Paris
If word2vec is able to capture this contextual information from sentences, why can’t it be extended to user behavior, where sentences could represent the items the user has purchased/viewed?
This is exactly what item2vec, the word2vec variant we use, does.
Using item2vec, we arrange Colombia Phone Numbers List user purchases into sentences, maintaining the sequences of items purchased, and then use a neural network to transform those item sequences into vectors/mathematical objects. Now it is possible to conduct mathematical operations on the items in the sequence to find similar items to recommend.
Individual_Recommendations
User A’s past watched history can be represented as [Game of Thrones, Avengers, Captain Marvel].
User B’s past watched history can be represented as [Avengers, Captain Marvel, House of Cards].
To get there, we do 4 things:
Create a training set from the historical user purchase item sequences. The training set consists of pairs of a single item input and a related item output.
Turn these items from strings into numerical representations called vectors.
Feed the vectorized training set into a Neural Network, which helps us optimize the size of the vector representations.
Use the optimized vector representations to determine similarity between items by calculating their mathematical closeness/similarity using techniques like pearson correlation and cosine similarity.
We can then rank our item recommendations accordingly.
To create the training set, we generate our input-output pairs from items that are next to each other in the items sequences we have generated. For example:
Item Sequence Training Input – Output Pairs
Game of Thrones Avengers Captain Marvel → (Game of Thrones, Avengers)
Game of Thrones Avengers Captain Marvel → (Game of Thrones, Avengers)
(Avengers, Captain Marvel)
Game of Thrones Avengers Captain Marvel → (Captain Marvel, Avengers)
Avengers Captain Marvel House of Cards → (Avengers, Captain Marvel)
Avengers Captain Marvel House of Cards → (Captain Marvel, Avengers)
(Captain Marvel, House of Cards)
Avengers Captain Marvel House of Cards → (House of Cards, Captain Marvel)
Likewise, we can create the training dataset in a similar manner for other users. If in the dataset, we get more cases of (Avengers, Captain Marvel)