LLM Glossary (In Progress)

April 30, 2026

LLM GlossaryTransformersAttentionRAGFine-TuningQuantizationSamplingAgentsEmbeddingsPrompt EngineeringRLHFMoEEvaluationLLMOps

The field moves fast. New architectures, alignment techniques, quantization formats, and sampling strategies appear every few weeks, each with its own acronym and a dozen conflicting definitions on the internet. This is my personal reference: 129 terms, one table, sorted A to Z.

#	Term	Meaning	Example
1	Abliterated	A model that has had its refusal behaviors removed via targeted weight edits, without full retraining.	Base model + abliteration responds to prompts a standard instruct model declines.
2	Agent	An LLM connected to a runtime and tools so it can plan, execute actions, and iterate on results.	An LLM writes a Python script; an Agent runs it, crashes your server, and autonomously orders a Jimmy John’s sandwich to cope with the stress.
3	Agentic RAG	RAG extended with autonomous multi-step reasoning, tool calls, and iterative retrieval.	A system that doesn’t just retrieve the manual for a Breville Bambino Plus, but iterates on tool calls to buy endless espresso accessories you didn’t know you needed.
4	ANN Search	Searching a vector index for approximate nearest neighbors rather than doing exact brute-force comparison.	HNSW on 10M embeddings returns top-10 matches in milliseconds vs. minutes for exact search.
5	Attention Mechanism	A neural module that computes a weighted sum of value vectors based on query-key similarity for every token pair.	In “The cat sat on the mat,” high attention between “cat” and “sat” links subject and verb.
6	Auto-regressive	A generation mode where each new token is predicted conditioned on all previously generated tokens.	“Today’s weather is sunny and” → model predicts “warm.”
7	AWQ (Activation-aware Weight Quantization)	A quantization method that preserves weights feeding high-activation channels to minimize quality loss at low bit widths.	`LLaMA-3-70B-AWQ` at 4-bit fits on a 48 GB GPU while retaining near-FP16 quality.
8	Beam Search	A deterministic decoding strategy that tracks B partial sequences in parallel and returns the highest-probability complete sequence.	Beam width 4 expands 4 hypotheses per step and prunes back to the best 4 after each token.
9	BERT	An encoder-only transformer pre-trained on masked token prediction and next-sentence prediction.	`bert-base-uncased` encodes a sentence into a 768-dim vector used for classification and retrieval.
10	BM25	A sparse keyword ranking function based on term frequency and inverse document frequency.	Query “LLM inference” scores documents by exact token overlap; a key component of hybrid search pipelines.
11	BPE (Byte Pair Encoding)	A tokenization algorithm that iteratively merges the most frequent adjacent character pair into a new subword token.	“unhappiness” → [“un”, “happi”, “ness”] depending on the trained vocabulary.
12	Chain-of-Thought (CoT)	A prompting technique that asks the model to produce step-by-step reasoning before its final answer.	“Let’s think step by step: 23 × 17 = 20×17 + 3×17 = 340 + 51 = 391.”
13	Chunking	Splitting a document into smaller segments before embedding for retrieval.	A 20-page PDF split into 512-token chunks with 50-token overlap produces roughly 160 retrievable units.
14	CLIP / SigLIP	Vision-language models that learn aligned image and text embeddings via contrastive training.	A photo of a dog and the caption “golden retriever” map to nearby vectors in a shared embedding space.
15	Context Window	The maximum number of tokens an LLM can process in a single forward pass, covering both prompt and output.	GPT-4 Turbo: 128K tokens; Claude 3: 200K tokens; Gemini 1.5 Pro: 1M tokens.
16	Contextual Embeddings	Embeddings where a token’s vector depends on its surrounding context, not just the token itself.	“bank” in “river bank” and “bank account” gets distinct vectors from models like BERT.
17	Contrastive Search	A decoding method that balances model confidence with diversity by penalizing candidates similar to recent token representations.	At `alpha=0.6, k=4`, the model stays coherent while avoiding repetitive phrase loops.
18	Conversational AI	AI systems optimized for sustained, multi-turn dialogue that tracks context across turns.	ChatGPT, Claude, and Gemini are the primary consumer conversational AI products.
19	Cosine Similarity	The dot product of two unit-norm vectors; measures directional similarity independent of magnitude.	Embedding(“Paris”) and Embedding(“capital of France”) have cosine similarity around 0.92.
20	Cross-Attention	Attention where queries come from one sequence and keys/values come from another.	In T5, the decoder attends to encoder representations via cross-attention to generate the target sequence.
21	Cross-Encoder Reranker	A model that jointly encodes a query-document pair into a single sequence and produces a relevance score.	A cross-encoder beats a bi-encoder on relevance but is ~100× slower; used as a second-stage reranker.
22	Decoder-only Transformer	A transformer that generates text autoregressively, each token attending only to preceding tokens via causal masking.	GPT-4, LLaMA, and Mistral are all decoder-only transformers.
23	DPO (Direct Preference Optimization)	A fine-tuning method that optimizes preference learning via a supervised loss, skipping the explicit reward model.	DPO trains on (prompt, chosen, rejected) triples and is cheaper and more stable than PPO-based RLHF.
24	Distillation	Training a smaller student model to match the output distribution of a larger teacher model.	DistilBERT (66M params) is a student of BERT-base (110M) that retains 97% of downstream performance.
25	DRY (Don’t Repeat Yourself) Sampling	A sampler that penalizes tokens that would continue any n-gram pattern already present in the context.	If “once upon a” appeared earlier, any token that would recreate that sequence is exponentially penalized.
26	Dynamic Temperature Sampling	A variant of temperature sampling that adjusts the temperature value at each step based on local entropy.	Temperature rises in high-uncertainty regions to encourage exploration and falls when the model is confident.
27	Embeddings	Dense numerical vectors representing words, sentences, or other data in a semantic space.	king − man + woman ≈ queen in Word2Vec embedding space.
28	Encoder-Decoder Transformer	A transformer with both an encoder to represent input and a decoder to generate output.	T5 and BART are encoder-decoder models used for translation and summarization.
29	Encoder-only Transformer	A transformer that produces bidirectional representations of its full input, with no causal masking.	BERT and RoBERTa are encoder-only models used for classification and retrieval.
30	Entropy	A measure of uncertainty in the model’s probability distribution over the vocabulary at a given step.	High entropy: many tokens have similar probability. Low entropy: model is confident about the next token.
31	Epsilon Cutoff	A sampler that removes any token whose probability falls below a fixed absolute threshold ε.	With `epsilon=0.0001`, any token with probability below 0.01% is masked out regardless of context.
32	Eta Cutoff	A sampler that scales the filtering threshold relative to the distribution’s entropy, adapting to model confidence.	When the model is uncertain (high entropy), eta cutoff is lenient; when confident, it is strict.
33	Explainability	The ability of an AI system to surface the sources or reasoning behind its answer.	A RAG system citing the exact retrieved passage it used to generate a response is highly explainable.
34	Few-Shot Learning	Providing a small number of example input-output pairs in the prompt to guide the model’s response format.	“Hello → Bonjour; Goodbye → Au revoir; Good morning → ?” teaches translation by demonstration.
35	Fine-Tuning	Continuing training of a pre-trained model on a smaller, task-specific dataset.	A base LLaMA-3 model fine-tuned on medical Q&A produces more accurate clinical responses.
36	Flash Attention	A fused attention kernel that avoids materializing the full N×N attention matrix in memory, reducing memory I/O.	Flash Attention 2 enables training on sequences 8× longer for the same GPU memory budget.
37	Foundational Model	A large model trained on broad data that can be adapted to many downstream tasks via fine-tuning or prompting.	GPT-4, LLaMA-3, and Gemini are foundational models; domain-specific variants are built on top of them.
38	Frequency Penalty	A sampler that subtracts a penalty from each token’s logit proportional to how many times that token has appeared in the output.	Token “the” appearing 5 times gets logit reduced by 5×λ, progressively discouraging its reuse.
39	Function Calling / Tool Use	The capability for an LLM to emit structured calls to external functions or APIs with typed parameters.	`{"name": "get_weather", "parameters": {"location": "San Francisco", "unit": "celsius"}}`
40	GGUF	A container format for quantized model weights optimized for local inference with `llama.cpp`.	`Llama-3-70B-Instruct.Q4_K_M.gguf` is a 4-bit quantized file that runs on 48 GB of unified RAM.
41	GPT (Generative Pre-trained Transformer)	A family of decoder-only transformers pre-trained on causal language modeling, developed by OpenAI.	GPT-3 (175B) demonstrated few-shot learning; GPT-4 became the basis of ChatGPT.
42	GPTQ	A one-shot post-training quantization method that minimizes weight error layer by layer using second-order gradient information.	GPTQ at INT4 requires ~18 GB for LLaMA-3-70B, compared to ~140 GB at FP16.
43	GQA (Grouped Query Attention)	An attention variant that groups query heads to share a single set of key-value heads, shrinking KV cache size.	LLaMA-3-70B uses 8 KV heads for 64 query heads via GQA, cutting KV cache memory by 8×.
44	Greedy Sampling	Decoding strategy that always selects the single highest-probability token at each step; fully deterministic.	“Paris is the capital of” with greedy decoding reliably produces “France” with no randomness.
45	Grounding	Connecting LLM outputs to verifiable external facts, documents, or databases to reduce hallucinations.	RAG grounds responses by injecting retrieved passages as context before the model generates its answer.
46	Hallucination	An LLM-generated response that is factually incorrect or fabricated, delivered with apparent confidence.	In July 2023, ChatGPT stated Will Smith had never assaulted anyone, despite the 2022 Oscars incident.
47	HNSW (Hierarchical Navigable Small World)	A graph-based ANN index that builds multi-layer proximity graphs for fast approximate nearest-neighbor search.	Qdrant and Weaviate use HNSW internally; query latency is typically under 5 ms on millions of vectors.
48	Hugging Face	The primary open-source platform for sharing models, datasets, and ML tooling.	The `transformers` library, Model Hub, and Spaces are its core products.
49	HumanEval	A coding benchmark of 164 hand-crafted Python problems, scored by pass@k (fraction solved in k attempts).	GPT-4 scores ~90% on pass@1; GPT-3.5 scored ~67%.
50	Hybrid Search	Retrieval that combines dense vector similarity search with sparse BM25 keyword matching and merges scores.	“XPS-9530 GPU upgrade” benefits from keyword matching; “good laptop for gaming” benefits from semantics.
51	Inference	Running a trained model on new input to produce an output; the opposite of training.	Sending a prompt to the Claude API and receiving a completion is inference.
52	Instruction Tuning	Fine-tuning a base model on (instruction, response) pairs to make it follow natural-language directions.	LLaMA-3-70B base → LLaMA-3-70B-Instruct via instruction tuning on curated Q&A data.
53	Knowledge Graph	A structured graph of entities as nodes and typed relationships as edges.	(Tim Cook) -[IS_CEO_OF]→ (Apple Inc.) -[MAKES]→ (iPhone) is a three-node knowledge graph fragment.
54	KV Cache	Storing computed key and value tensors from prior tokens to avoid recomputation on each generation step.	Without KV cache, generating 1000 tokens requires 1000 full forward passes; with it, only the newest token’s keys/values are new.
55	LangChain	A framework for composing LLM calls, retrieval, memory, and tool use into chains and agents.	A LangChain RAG chain: retrieve docs → inject into prompt → call LLM → return answer.
56	Large Language Model (LLM)	A neural network with billions of parameters trained on massive text corpora to understand and generate language.	GPT-4, Claude 3, Gemini 1.5, and LLaMA-3 are the most widely used LLMs as of 2025.
57	`llama.cpp`	A C/C++ runtime for local LLM inference supporting GGUF quantized models on CPU and GPU.	`./llama-cli -m Llama-3.gguf -p "Hello" -n 100` runs inference entirely on a laptop CPU.
58	LLMOps	Operational practices for deploying, monitoring, versioning, and scaling LLMs in production.	LLMOps covers prompt management, A/B testing, cost tracking, latency monitoring, and model rollback.
59	Locally Typical Sampling	A sampler that keeps tokens whose information content is close to the expected entropy of the distribution.	Unlike Top-P, locally typical sampling may include a low-probability token if it matches the model’s typical surprise level.
60	Logits	The raw, unnormalized scores the model produces for each vocabulary token before any softmax normalization.	A 128K-token vocabulary produces 128K logit values per generation step; softmax converts them to probabilities.
61	LoRA (Low-Rank Adaptation)	A PEFT method that freezes original weights W and trains two small matrices A and B so the adapted weight is W’ = W + BA.	A 7B model fine-tuned with LoRA at rank 16 trains ~4M parameters instead of all 7B.
62	LSTM (Long Short-Term Memory)	A recurrent neural network with gating mechanisms (input, forget, output gates) to capture long-range dependencies.	LSTMs were the dominant NLP sequence model before transformers; still used in some speech and time-series tasks.
63	MAMBA	A state-space model architecture using selective state updates to achieve linear-time sequence modeling.	MAMBA processes sequences 5× faster than a same-size transformer and handles contexts up to 1M tokens.
64	Matryoshka Embeddings	Embeddings trained so the first N dimensions independently form a useful subspace at any truncation length.	A 1024-dim Matryoshka embedding truncated to 256 dims loses modest accuracy while saving 75% of storage.
65	Max Marginal Relevance (MMR)	A reranking technique that balances relevance to the query with diversity among the selected results.	MMR with `diversity_bias=0.3` returns retrieved chunks that are all relevant but not near-duplicates of each other.
66	Min-P Sampling	A sampler that removes any token whose probability is below a fraction of the top token’s probability.	With `min_p=0.1` and top token at p=0.5, any token with p < 0.05 is filtered out; pairs well with high temperature.
67	Mirostat Sampling	A sampler that dynamically adjusts temperature each step to maintain a target surprise level (perplexity).	Setting Mirostat’s target τ=3 keeps generation at a consistent creativity level regardless of context length.
68	Mixture of Experts (MoE)	An architecture where each token is routed to a sparse subset of “expert” feed-forward layers, keeping active compute constant as total params grow.	Mixtral 8x22B has 141B total parameters but activates only ~39B per token via top-2 routing across 8 experts.
69	MMLU (Massive Multitask Language Understanding)	A benchmark covering 57 academic subjects from STEM to humanities, scored as multiple-choice accuracy.	GPT-4 scored ~86% on MMLU; LLaMA-3-70B scored ~82%.
70	Model Drift	Degradation in a deployed model’s performance over time as real-world data distribution shifts from training data.	A spam classifier trained in 2020 loses accuracy as spammers adopt vocabulary absent from its training set.
71	Multi-Head Attention	Running H independent attention computations on projected subspaces in parallel, then concatenating their outputs.	A transformer with 32 attention heads can simultaneously track syntax, coreference, and entity relationships.
72	Multimodal Embeddings	Embeddings that map different modalities (image, text, audio) into a shared vector space for cross-modal retrieval.	CLIP enables image search via text: “red running shoes” retrieves photos of red sneakers without metadata.
73	n-gram	A contiguous sequence of n tokens from a text.	“once upon a” is a 3-gram; DRY sampling uses n-gram matching to detect and suppress repetitive loops.
74	One-Shot Prompting	Including exactly one input-output example pair in the prompt before the actual query.	“Sentiment: ‘I loved it’ → Positive. Sentiment: ‘Terrible experience’ → ?”
75	Open Source LLM	An LLM whose weights are publicly released, enabling local deployment, fine-tuning, and customization.	LLaMA-3, Mistral, Gemma, and Phi-4 are prominent open-source LLMs with commercial-friendly licenses.
76	Orchestration	Coordinating LLM calls, tool invocations, memory, and routing logic to complete multi-step tasks.	A LangGraph agent orchestrates: (1) retrieve, (2) reason, (3) call external API, (4) synthesize answer.
77	PEFT (Parameter-Efficient Fine-Tuning)	A family of techniques that fine-tune a small subset of parameters while keeping the base model mostly frozen.	LoRA is the most popular PEFT method; QLoRA extends it to quantized base models.
78	Perplexity	Exponential of the average negative log-likelihood; lower values mean the model is more confident about the text.	A well-tuned model achieves perplexity ~4 on Wikipedia text; a random model would score near the vocabulary size.
79	Position Interpolation (PI)	Extending a model’s effective context window by interpolating positional encodings between trained positions.	A LLaMA-2 model trained at 4K context can be extended to 16K via PI with short fine-tuning.
80	Positional Encoding	A signal added to token embeddings to communicate each token’s position, since attention itself is position-agnostic.	Original transformers use sinusoidal encodings; modern LLMs use RoPE or ALiBi.
81	PPO (Proximal Policy Optimization)	A reinforcement learning algorithm that constrains policy updates to a trust region, used to optimize against a reward model in RLHF.	PPO is the RL step in the classic RLHF pipeline: SFT → reward model → PPO fine-tuning.
82	Pre-training	Training a model from scratch on large unlabeled corpora using a self-supervised objective.	GPT-4 pre-training consumed trillions of tokens and hundreds of millions of dollars in compute.
83	Presence Penalty	A sampler that subtracts a flat logit penalty from any token that has appeared at least once in the current generation.	Even a token used once gets logit reduced by `presence_penalty`, discouraging any reuse.
84	Prompt Engineering	Designing and refining input prompts to steer an LLM toward desired output style, format, or factual accuracy.	Adding “Think step by step” to a math prompt significantly improves accuracy on complex reasoning tasks.
85	QLoRA	LoRA applied to a 4-bit quantized base model, enabling fine-tuning of 70B models on consumer GPUs.	QLoRA lets you fine-tune LLaMA-3-70B on a 24 GB RTX 4090; standard full fine-tuning would require 8× A100s.
86	Quadratic Sampling	A sampler that applies a smooth quadratic remapping to logits to redistribute probability mass toward middle-ranked tokens.	Quadratic sampling softens sharp probability peaks without imposing a hard cutoff like top-K or top-P.
87	Quantization	Reducing numerical precision of model weights (FP32 → FP16 → INT8 → INT4) to shrink memory and accelerate inference.	A 70B model at FP16 needs ~140 GB VRAM; the same model at INT4 fits in ~35 GB.
88	Quantization-Aware Training (QAT)	Training with simulated quantization noise so the model learns to be robust to reduced precision at inference time.	Google uses QAT for Gemma models to maintain quality at INT4 in production serving.
89	RAG (Retrieval-Augmented Generation)	An architecture that retrieves relevant document chunks and injects them into the LLM prompt before generation.	Query “what is our refund policy?” → retrieve policy chunk → LLM generates a grounded answer.
90	RAG Evaluation	Measuring both retrieval quality (are retrieved chunks relevant?) and generation quality (is the response grounded?).	UMBRELA and AutoNuggetizer are open-source metrics for rigorous RAG pipeline evaluation.
91	RAG Sprawl	Unchecked proliferation of redundant RAG pipelines across an organization with no central governance.	A company discovers 12 separate teams each built their own RAG stack pulling from the same document store.
92	Repetition Penalty	A sampler that divides positive logits and multiplies negative logits for any token seen in the prompt or prior output.	`repetition_penalty=1.2` makes repeating any context token harder without breaking coherence.
93	Reranking	A second-stage scoring pass that reorders initial retrieval candidates using a more accurate but slower model.	A bi-encoder retrieves top-100 candidates in ~5 ms; a cross-encoder reranks them in ~200 ms.
94	Reward Model	A model trained on (prompt, winning response, losing response) triples to predict human preference scores.	The reward model’s output score is the optimization target during the PPO phase of RLHF.
95	RLHF (Reinforcement Learning from Human Feedback)	A training pipeline that collects human preference comparisons, trains a reward model, then uses RL to maximize it.	ChatGPT’s helpful alignment came from RLHF applied on top of a GPT-3.5 SFT checkpoint.
96	RNN (Recurrent Neural Network)	A neural network that passes a hidden state sequentially from one time step to the next.	LSTMs (a type of RNN) were the dominant NLP sequence model before transformers arrived in 2017.
97	RoPE (Rotary Positional Embedding)	A positional encoding that applies a rotation matrix to query and key vectors based on their absolute position index.	LLaMA, Mistral, and Gemma all use RoPE; it generalizes better to longer sequences than learned absolute encodings.
98	Sampling	Selecting the next token by drawing from the model’s output probability distribution rather than always taking the argmax.	Temperature, top-k, top-p, and min-p all modify the distribution before sampling occurs.
99	Self-Attention	An attention mechanism where each token in a sequence attends to all other tokens in the same sequence.	Self-attention resolves coreference: “it” in “The trophy didn’t fit because it was too big” links back to “trophy.”
100	Self-Supervised Learning	Learning from unlabeled data by constructing supervision signals from the data itself.	GPT pre-training is self-supervised: the target label is the next token, which already exists in the corpus.
101	Semantic Search	Retrieval based on meaning via dense embeddings rather than keyword overlap.	“Best laptop for machine learning” retrieves results about high-VRAM GPUs even without those exact words.
102	Sentence Embeddings	Embeddings encoding an entire sentence or paragraph into a single fixed-size vector.	Sentence-BERT encodes “Paris is the capital of France” into a vector close to “The French capital is Paris.”
103	SentencePiece	A language-agnostic tokenization library that trains BPE or unigram subword vocabularies from raw text.	Gemma, LLaMA, and Mistral all ship with SentencePiece tokenizers trained on their respective corpora.
104	Softmax	A function that converts a vector of real numbers (logits) into a valid probability distribution summing to 1.	Logits [2.0, 1.0, 0.5] → softmax → approximately [0.61, 0.23, 0.16].
105	Speculative Decoding	Generating draft tokens with a fast small model and verifying them in parallel with a slower large model.	A 1B draft model proposes 5 tokens; the 70B verifier accepts 3-4 and regenerates only the rejected ones.
106	State Space Models (SSMs)	Sequence models based on linear recurrence equations that scale linearly, not quadratically, with sequence length.	MAMBA is the leading SSM; it matches transformer quality on some benchmarks while scaling linearly.
107	Step-back Prompting (STP)	A technique asking the model to first state the underlying principles before answering the specific question.	“What physics principles apply here? Now use them to answer: what happens when you drop a feather in a vacuum?”
108	Supervised Fine-Tuning (SFT)	The initial fine-tuning step on labeled (instruction, response) data, typically the first stage before preference alignment.	SFT converts a raw pre-trained base model into an instruction-following model before RLHF.
109	System Prompt	A special instruction block prepended to the conversation context that sets the model’s persona and constraints.	“You are a helpful assistant. Always respond in JSON. Do not discuss competitor products.”
110	T5 (Text-to-Text Transfer Transformer)	An encoder-decoder model that frames every NLP task as a text-to-text problem.	T5 frames translation as: input “translate English to German: Hello” → output “Hallo.”
111	Tail-Free Sampling	A sampler that identifies the probability distribution’s tail via its second derivative and removes tokens past that break point.	TFS automatically cuts the long tail of unlikely tokens without requiring a fixed K or P hyperparameter.
112	Temperature	A scalar applied to logits before softmax that controls the sharpness of the output probability distribution.	`t=0.1` → near-deterministic; `t=0.7` → balanced; `t=1.5` → highly creative but error-prone.
113	Token	The atomic unit of text processed by an LLM; typically a subword fragment from the model’s learned vocabulary.	“Tokenization” → 2-3 tokens such as [“Token”, “ization”] depending on the tokenizer.
114	Tokenization	Converting raw text into a sequence of integer token IDs using a learned subword vocabulary.	“How am I doing today?” → 6 tokens: [“How”, “ am”, “ I”, “ doing”, “ today”, “?”] with GPT-2 BPE.
115	Top-A Sampling	A sampler that filters tokens below a threshold proportional to the square of the top token’s probability.	Top-A is effectively Min-P with a squared threshold; it predates Min-P and is more aggressive at high confidence.
116	Top-K Sampling	A sampler that restricts token selection to the K highest-probability tokens, masking all others.	Vocab=50K, `top_k=50`: only the top 50 tokens are eligible; the remaining 49,950 are zeroed out.
117	Top-N-Sigma Sampling	A sampler that filters tokens more than N standard deviations below the maximum logit value.	`top_n_sigma=2` adapts automatically: strict when the model is confident, lenient when the distribution is flat.
118	Top-P (Nucleus) Sampling	A sampler that keeps the smallest set of tokens whose cumulative probability mass exceeds threshold P.	`top_p=0.9` selects tokens until their combined probability reaches 90%; may be 5 tokens or 500 depending on confidence.
119	Transformer	A neural network architecture built on stacked self-attention and feed-forward layers that processes sequences in parallel.	“Attention Is All You Need” (2017) introduced the transformer; GPT and BERT are direct descendants.
120	TruthfulQA	A benchmark of 817 questions designed to probe common misconceptions, scored by how often the model gives truthful answers.	“What happens if you swallow gum?” tests whether a model repeats folk wisdom or states the accurate fact.
121	Unsloth	A library that accelerates LoRA fine-tuning by 2-5× via custom CUDA kernels and memory-efficient backward passes.	`unsloth` fine-tunes LLaMA-3-8B on a 24 GB GPU at 2× the throughput of standard Hugging Face PEFT.
122	Vector Database	A database optimized for storing and querying high-dimensional embedding vectors via approximate nearest-neighbor search.	Qdrant, Pinecone, Milvus, Weaviate, and pgvector are common vector databases for RAG applications.
123	vLLM	A high-throughput LLM serving framework using PagedAttention to manage KV cache as virtual memory pages.	vLLM achieves 10-24× higher throughput than naive serving for concurrent requests via continuous batching.
124	VRAM (Video RAM)	On-board GPU memory; the primary bottleneck for training and running large models locally.	A 70B FP16 model needs ~140 GB VRAM; a consumer RTX 4090 has 24 GB.
125	Weights	The learned numerical parameters of a neural network, updated during training to minimize loss.	LLaMA-3-70B stores 70 billion floating-point weight values across its layer matrices.
126	WordPiece	A subword tokenization algorithm, developed for BERT, that builds vocabulary by maximizing corpus likelihood.	“unaffable” → [“una”, “##ff”, “##able”] using BERT’s WordPiece tokenizer.
127	XTC (eXclude Top Choices) Sampling	A sampler that occasionally removes all top-probability tokens except the lowest-scoring qualifying one, forcing unconventional output.	When activated, XTC removes the top 5 probable tokens and keeps only the 6th, injecting controlled surprise.
128	YaRN (Yet Another RoPE extensioN)	A method that rescales the RoPE frequency basis to extend a model’s effective context window beyond its training length.	YaRN with a 4× scale factor extends LLaMA-2-13B from 4K to 16K context with minimal perplexity degradation.
129	Zero-Shot Learning	Querying an LLM for a task with no examples, relying entirely on knowledge from pre-training.	“Translate to French: Hello world” with no examples → the model applies its trained language knowledge directly.