Rohan Paul (@rohanpaul_ai)
2024-08-09 | โค๏ธ 311 | ๐ 46
LLM Basics - Binary Quantization ๐ฅ
๐งต A thread - 1/n ๐
The concept itself isnโt new, but whatโs reignited interest is the recent announcement from @cohere regarding their support for int8 and binary embeddings in their Cohere embed v3.
๐ First, in essence, embeddings are numerical representations of more complex objects, like text, images, audio, etc. Specifically, the objects are represented as n-dimensional vectors.
After transforming the complex objects, you can determine their similarity by calculating the similarity of the respective embeddings! This is crucial for many use cases: it serves as the backbone for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and much more.
๐ Binary Quantization for embeddings
Unlike quantization in models where you reduce the precision of weights, quantization for embeddings refers to a post-processing step for the embeddings themselves. In particular, binary quantization refers to the conversion of the float32 values in an embedding to 1-bit values, resulting in a 32x reduction in memory and storage usage.
โจ Binary quantization example
Vector embeddings are usually generated by embedding models, such as Cohereโs embed v3, and a single vector embeddings will in the following form.
[0.056, -0.128, -0.029, 0.047, โฆ, 0.135]
To quantize float32 embeddings to binary, we simply threshold normalized embeddings at 0
That is, because these embeddings have very small absolute numbers close to zero, you can turn them into a binary vector:
1: If the value is greater or equal to 0.
0: If the value is smaller than 0.
So that you get something like this.
[1, 0, 0, โฆ, 1]
๋ฏธ๋์ด
