For most of NLP problem, especially for texted based problem, Word embedding is the first step to do of implementation any machine learning algorithm.
What is word embedding?Word embedding is the process to change human language to word vector that machine can understand. Word vector is a way to represent a word as a set of numbers that can used to train in ML algorithm.Example: I love you -> [[1, 2, 3], [1, 4, 3], [2, 3, 4]]
Easiest "Word Embedding": one-hot encodingWe can take one-hot encoding as the easiest word embedding approach. The basic idea is to give a unique id for all words in the dictionary from 0 to n, then create a zero vector having the same length as dictionary. For each vector corresponding to each word, change the value to one for the location that index equals to id of the word. For example, I love you can change to [ [1, 0, 0], [0, 1, 0], [0, 0, 1] ]. in this example, we use one hot encoding to change word "I" from a string "I" to a word vector [1, 0, 0]. in other word [1, 0, 0] represent word 'I'. Well, one hot encoding is easy, but it has some drawbacks: The vector dimension increase with the size of the dictionary. in previous example, we embedded 'I' to [1, 0, 0] since our dictionary only contains 3 words("I", "love" and "you"). if we add a new world "she" in our dictionary, word vector will become to 4 dimensions. Vector for 'i' probably will change to "[1, 0, 0, 0]".Anther problem of using one-hot is that word lose some information in this embedding process. like world "SXM" has the same meaning of "Sirius XM", or"King" has some similarity with word "queen" as well. by using one-hot. we will lose all of those information
A new way to represent a wordEssentially, word is a representation of some object. and object can be described by features. For example, king represent a male human-beings having a lot of power in highest class in the society, it's not a food. we can represent all of words by choosing different feature to describe them. technologically, if we can describe all words by choosing the correct features. Here 's a example:
this provides us a new way to do word embedding rather than doing one-hot encoding. which is to use feature to create our word vector.For example, based on upon table, we can create a new vector to represent King as [-0.95, 0.93, 0.7, 0.02] What is the good thing about this ?First, the dimension is depended on how many features we choose instead of length of the dictionary. Second, we can find the similarity between different words, for example King and Queen will have a lot commons from some feature level. Even more, you can calculate the distance between different dimensions. if we use King to minus Queen, we will get a very close result as using Man to minus Women. Following image maybe can give you better intuition:
we call the table in previous section as embedding matrix, which can be used to create word vector.there's the process:
- we can create one hot encoding for each word in the dictionary by using the order the shows up in the dictionary, such as for word King, it will be [0, 0, 1, 0, 0, 0, 0], let's called it Ov.
- let's use W to indicate our embedding matrix
- using the one-hot vector to multiple embedding matrix W, we can get the word vector of word King:
- V = Vo * M
- For word "king", we can get the word vector as [-0.95, 0.93, 0.7, 0.02]
- This process is the definition of word embedding.
Well, it's impossible to choose features manually... but we can use machine learning algorithm to do this for us! First, we need to prepare a lot of data for training purpose. it will be a bunch of sentences we get from real world Then, use Ml algorithm to train it! Algorithm 1: Skip-gram For example, we get one training data like this
then we have a training model to predict the word around some specific word. such as predict as for word juice, we want to predict next six word for "juice", it will be cereal. so input will be orange output will be cereal. we can get a model like this (only choose first half to simplify the problem):
it's a very hard problem to tackle if we want to predict what is the word around some word. BUT, we don't want this model have a very good performance, we only want to get the parameters trained in this model, it will be the embedding matrix we are looking for!
Algorithm 2: Skip-gram with Negative Sampling the problem of skip-gram is that too many out we have here, it will bring huge impact of our training time, so we can modify the input like followings:
Algorithm 3: CBOWCBOW is similar with skip-gram, but skip gram is using a word to predict context, CBOW is using the context to predict the word itself ;





Comments