Machine Learning
Curse of Dimensionality
The curse of dimensionality is a cluster of phenomena that make geometric, statistical, and search algorithms degrade as the number of features grows. Volume grows exponentially, samples become sparse, and pairwise distances concentrate, so nearest-neighbor and density-based methods lose discriminative power.
HNSW (Hierarchical Navigable Small World)
HNSW is a graph-based algorithm for {{approximate nearest neighbor}} search introduced by Malkov and Yashunin in 2016. It builds a multi-layer proximity graph in which higher layers contain exponentially fewer nodes with longer-range edges. Searches start at the top layer and descend greedily, giving logarithmic expected complexity at high recall.
Product Quantization
Product quantization (PQ) is a lossy vector compression technique that splits each vector into subvectors and encodes each with a small learned codebook. It shrinks high-dimensional embeddings by 90–97% while still supporting approximate distance computation in the compressed domain, making it a backbone of large-scale ANN indexes.
Vector Embeddings
Vector embeddings are dense numeric representations of items — words, sentences, images, users, products — in a continuous space where geometric proximity reflects semantic similarity. They turn similarity into distance computations, enabling search, clustering, recommendation, and retrieval over unstructured data.
Locality-Sensitive Hashing
Locality-sensitive hashing (LSH) is a family of probabilistic techniques for {{approximate nearest neighbor}} search and near-duplicate detection. Hash functions are chosen so that similar items collide with high probability and dissimilar items collide rarely, turning a similarity search into a bucket lookup with sublinear query time.
Approximate Nearest Neighbor Search
Approximate nearest neighbor (ANN) search finds vectors close to a query in high-dimensional space without guaranteeing the exact best match. Exact k-NN suffers from the curse of dimensionality, so ANN trades a small loss of recall for major gains in latency and memory. Major families include tree-based methods, locality-sensitive hashing, graph indexes such as HNSW, and product quantization, implemented in libraries like FAISS, Annoy, hnswlib, and ScaNN. ANN underpins semantic search, recommendation, deduplication, and RAG retrieval.