Machine Learning

6 chunks

Curse of Dimensionality

The curse of dimensionality is a cluster of phenomena that make geometric, statistical, and search algorithms degrade as the number of features grows. Volume grows exponentially, samples become sparse, and pairwise distances concentrate, so nearest-neighbor and density-based methods lose discriminative power.

93%
0

HNSW (Hierarchical Navigable Small World)

HNSW is a graph-based algorithm for {{approximate nearest neighbor}} search introduced by Malkov and Yashunin in 2016. It builds a multi-layer proximity graph in which higher layers contain exponentially fewer nodes with longer-range edges. Searches start at the top layer and descend greedily, giving logarithmic expected complexity at high recall.

92%
3

Product Quantization

Product quantization (PQ) is a lossy vector compression technique that splits each vector into subvectors and encodes each with a small learned codebook. It shrinks high-dimensional embeddings by 90–97% while still supporting approximate distance computation in the compressed domain, making it a backbone of large-scale ANN indexes.

92%
2

Vector Embeddings

Vector embeddings are dense numeric representations of items — words, sentences, images, users, products — in a continuous space where geometric proximity reflects semantic similarity. They turn similarity into distance computations, enabling search, clustering, recommendation, and retrieval over unstructured data.

92%
2

Locality-Sensitive Hashing

Locality-sensitive hashing (LSH) is a family of probabilistic techniques for {{approximate nearest neighbor}} search and near-duplicate detection. Hash functions are chosen so that similar items collide with high probability and dissimilar items collide rarely, turning a similarity search into a bucket lookup with sublinear query time.

92%
2

Approximate Nearest Neighbor Search

Approximate nearest neighbor (ANN) search finds vectors close to a query in high-dimensional space without guaranteeing the exact best match. Exact k-NN suffers from the curse of dimensionality, so ANN trades a small loss of recall for major gains in latency and memory. Major families include tree-based methods, locality-sensitive hashing, graph indexes such as HNSW, and product quantization, implemented in libraries like FAISS, Annoy, hnswlib, and ScaNN. ANN underpins semantic search, recommendation, deduplication, and RAG retrieval.

91%
2