Every now and then, you need embeddings when training machine learning models. But what exactly is such an embedding and why do we use it?

Basically, an embedding is used when we want to map some representation into another dimensional space. Doesn’t make things much clearer, does it?

So, let’s consider an example: we want to train a recommender system on a movie database (typical Netflix use case). We have many movies and information about the ratings of users given to movies.

Assume we have 10 users and 100 movies. When training a machine learning model, we need to represent the users and movies as numbers, so we could just create vectors of size 10 for users and size 100 for movies respectively.

For user 4, the vector would have all zeros except for at the fourth position where there is a 1. That would be called a sparse vector as most digits are 0.

[0 0 0 1 0 0 0 0 ...] - user number 4
len([0 0 0 1 0 0 0 ...]) = 10

Using this representation, we can identify our users and our movies, but the representation doesn’t really tell us much about our problem. Ideally we would want to learn something about the relations between users and movies, i.e. we want to know if user 2 is similar to user 4 or not and we want to know if movie 15 is similar to movie 25 or not. And even more importantly: we want to know if user 4 likes movie 15 given their ratings of other movies.

Embeddings can help us in this case, because we can map a user and a movie to a higher dimensional space, so it is not just a single number, but rather has several “properties” that we can use. You can think of these “properties” as dimensions of interest, e.g. one dimension could be whether a movie is filled with action or more quiet. The same dimension for the user could then mean whether that user tends to like action filled movies or not.

So instead of having a big sparse vector for a user or a movie, we want to have a representation of the interests of the user or the type of the movie. An embedding can do that for us as it is like a lookup table to map from a value to a content vector.

When we think that our movies can be described in 3 different dimensions and thus also the preferences of our users in 3 dimensions, we would want a 3 dimensional vector per user or movie:

-> Mapping from [0 0 0 1 0 0 0 0 ...] or user index 3 to [0.54 0.23 -0.4]

Embeddings are basically a big lookup table mapping from values to vectors. The cool thing is that the resulting vectors in turn can be used as parameters in our model, so we can change them during training to create embeddings which better encapsulate the meaning which represents our data.

Over time, by using gradient descent, our model adjusts the embeddings to better represent the interest of our users. This can be done by initializing the embeddings at random and then using the user and movie embeddings to predict what ratings a user would give for each movie. As we have some real ratings from users, we can calculate an error rate from the deviation of our predictions. Then gradient descent is used to reduce that error by tweaking our embeddings and then repeat. In the end, the embeddings represent a good model of users interests and movie representations and we can find movies a user might like by finding movies which have embeddings similar to embeddings which the user rated highly positive.

Code often tells us more than plain text, so let’s take a look at a PyTorch example of how to make such an embedding (the learning part is not explained here):

import torch
from torch.nn import Embedding

num_users = 10
num_movies = 100
embedding_dimensions = 3

user_embedding = Embedding(num_users, embedding_dimensions)
movies_embedding = Embedding(num_movies, embedding_dimensions)

Getting the representation of our first user (user number 0):

tensor([[ 0.8759,  1.3393, -0.8696]], grad_fn=<EmbeddingBackward>)

This is how easy it is to use embeddings in PyTorch.

There are many other use cases for embeddings. For example in language models, embeddings are used to map words to meaningful semantic vectors, so we can detect similarity between words and map them semantically. One famous example for word vectors is that you can take the embedding vector of the word “king” and then subtract the word vector of the word “man” and add the word vector of the word “woman” to approximately end up at the word vector of “queen”.