Procs
proc embedding[T; Idx: byte or char or SomeInteger](vocab_id: Tensor[Idx]; weight: Tensor[T]): Tensor[T]
-
Returns embeddings from a weight embedding matrix and vocab_id to represent the part of the global vocabulary present.
The main use-case is for natural language processing. Words (or characters or group of words) need to be encoded into arbitrary integers first that will be used to index the weight embedding matrix.
During training, words that are related will get become close in some dimensions of the embedding.
For example, if we want to encode a text containing 10000 different words into a 300-dimensional vector, we will require a 10000, 300 embedding matrix.
Make sure to add an index to represent <UNKNOWN> words. (Words present during test that didn't exist in the training vocabulary)
If working with variable-length sequences a <START>, <STOP> and <PAD> "words" are also useful
In summary it's a lookup table that maps words to meanings in a high-dimensional space and that can be trained.
Input:
- A tensor of vocabulary indices, either:
- A weight matrix that maps those indices to the embedding of shape vocabulary_size, embedding_size
Result:
- Depending on the input vocabulary: