The field moves fast. New architectures, alignment techniques, quantization formats, and sampling strategies appear every few weeks, each with its own acronym and a dozen conflicting definitions on the internet. This is my personal reference: 129 terms, one table, sorted A to Z.
| # | Term | Meaning | Example |
|---|---|---|---|
| 1 | Abliterated | A model that has had its refusal behaviors removed via targeted weight edits, without full retraining. | Base model + abliteration responds to prompts a standard instruct model declines. |
| 2 | Agent | An LLM connected to a runtime and tools so it can plan, execute actions, and iterate on results. | An LLM writes a Python script; an Agent runs it, crashes your server, and autonomously orders a Jimmy John’s sandwich to cope with the stress. |
| 3 | Agentic RAG | RAG extended with autonomous multi-step reasoning, tool calls, and iterative retrieval. | A system that doesn’t just retrieve the manual for a Breville Bambino Plus, but iterates on tool calls to buy endless espresso accessories you didn’t know you needed. |
| 4 | ANN Search | Searching a vector index for approximate nearest neighbors rather than doing exact brute-force comparison. | HNSW on 10M embeddings returns top-10 matches in milliseconds vs. minutes for exact search. |
| 5 | Attention Mechanism | A neural module that computes a weighted sum of value vectors based on query-key similarity for every token pair. | In “The cat sat on the mat,” high attention between “cat” and “sat” links subject and verb. |
| 6 | Auto-regressive | A generation mode where each new token is predicted conditioned on all previously generated tokens. | “Today’s weather is sunny and” → model predicts “warm.” |
| 7 | AWQ (Activation-aware Weight Quantization) | A quantization method that preserves weights feeding high-activation channels to minimize quality loss at low bit widths. | LLaMA-3-70B-AWQ at 4-bit fits on a 48 GB GPU while retaining near-FP16 quality. |
| 8 | Beam Search | A deterministic decoding strategy that tracks B partial sequences in parallel and returns the highest-probability complete sequence. | Beam width 4 expands 4 hypotheses per step and prunes back to the best 4 after each token. |
| 9 | BERT | An encoder-only transformer pre-trained on masked token prediction and next-sentence prediction. | bert-base-uncased encodes a sentence into a 768-dim vector used for classification and retrieval. |
| 10 | BM25 | A sparse keyword ranking function based on term frequency and inverse document frequency. | Query “LLM inference” scores documents by exact token overlap; a key component of hybrid search pipelines. |
| 11 | BPE (Byte Pair Encoding) | A tokenization algorithm that iteratively merges the most frequent adjacent character pair into a new subword token. | “unhappiness” → [“un”, “happi”, “ness”] depending on the trained vocabulary. |
| 12 | Chain-of-Thought (CoT) | A prompting technique that asks the model to produce step-by-step reasoning before its final answer. | “Let’s think step by step: 23 × 17 = 20×17 + 3×17 = 340 + 51 = 391.” |
| 13 | Chunking | Splitting a document into smaller segments before embedding for retrieval. | A 20-page PDF split into 512-token chunks with 50-token overlap produces roughly 160 retrievable units. |
| 14 | CLIP / SigLIP | Vision-language models that learn aligned image and text embeddings via contrastive training. | A photo of a dog and the caption “golden retriever” map to nearby vectors in a shared embedding space. |
| 15 | Context Window | The maximum number of tokens an LLM can process in a single forward pass, covering both prompt and output. | GPT-4 Turbo: 128K tokens; Claude 3: 200K tokens; Gemini 1.5 Pro: 1M tokens. |
| 16 | Contextual Embeddings | Embeddings where a token’s vector depends on its surrounding context, not just the token itself. | “bank” in “river bank” and “bank account” gets distinct vectors from models like BERT. |
| 17 | Contrastive Search | A decoding method that balances model confidence with diversity by penalizing candidates similar to recent token representations. | At alpha=0.6, k=4, the model stays coherent while avoiding repetitive phrase loops. |
| 18 | Conversational AI | AI systems optimized for sustained, multi-turn dialogue that tracks context across turns. | ChatGPT, Claude, and Gemini are the primary consumer conversational AI products. |
| 19 | Cosine Similarity | The dot product of two unit-norm vectors; measures directional similarity independent of magnitude. | Embedding(“Paris”) and Embedding(“capital of France”) have cosine similarity around 0.92. |
| 20 | Cross-Attention | Attention where queries come from one sequence and keys/values come from another. | In T5, the decoder attends to encoder representations via cross-attention to generate the target sequence. |
| 21 | Cross-Encoder Reranker | A model that jointly encodes a query-document pair into a single sequence and produces a relevance score. | A cross-encoder beats a bi-encoder on relevance but is ~100× slower; used as a second-stage reranker. |
| 22 | Decoder-only Transformer | A transformer that generates text autoregressively, each token attending only to preceding tokens via causal masking. | GPT-4, LLaMA, and Mistral are all decoder-only transformers. |
| 23 | DPO (Direct Preference Optimization) | A fine-tuning method that optimizes preference learning via a supervised loss, skipping the explicit reward model. | DPO trains on (prompt, chosen, rejected) triples and is cheaper and more stable than PPO-based RLHF. |
| 24 | Distillation | Training a smaller student model to match the output distribution of a larger teacher model. | DistilBERT (66M params) is a student of BERT-base (110M) that retains 97% of downstream performance. |
| 25 | DRY (Don’t Repeat Yourself) Sampling | A sampler that penalizes tokens that would continue any n-gram pattern already present in the context. | If “once upon a” appeared earlier, any token that would recreate that sequence is exponentially penalized. |
| 26 | Dynamic Temperature Sampling | A variant of temperature sampling that adjusts the temperature value at each step based on local entropy. | Temperature rises in high-uncertainty regions to encourage exploration and falls when the model is confident. |
| 27 | Embeddings | Dense numerical vectors representing words, sentences, or other data in a semantic space. | king − man + woman ≈ queen in Word2Vec embedding space. |
| 28 | Encoder-Decoder Transformer | A transformer with both an encoder to represent input and a decoder to generate output. | T5 and BART are encoder-decoder models used for translation and summarization. |
| 29 | Encoder-only Transformer | A transformer that produces bidirectional representations of its full input, with no causal masking. | BERT and RoBERTa are encoder-only models used for classification and retrieval. |
| 30 | Entropy | A measure of uncertainty in the model’s probability distribution over the vocabulary at a given step. | High entropy: many tokens have similar probability. Low entropy: model is confident about the next token. |
| 31 | Epsilon Cutoff | A sampler that removes any token whose probability falls below a fixed absolute threshold ε. | With epsilon=0.0001, any token with probability below 0.01% is masked out regardless of context. |
| 32 | Eta Cutoff | A sampler that scales the filtering threshold relative to the distribution’s entropy, adapting to model confidence. | When the model is uncertain (high entropy), eta cutoff is lenient; when confident, it is strict. |
| 33 | Explainability | The ability of an AI system to surface the sources or reasoning behind its answer. | A RAG system citing the exact retrieved passage it used to generate a response is highly explainable. |
| 34 | Few-Shot Learning | Providing a small number of example input-output pairs in the prompt to guide the model’s response format. | “Hello → Bonjour; Goodbye → Au revoir; Good morning → ?” teaches translation by demonstration. |
| 35 | Fine-Tuning | Continuing training of a pre-trained model on a smaller, task-specific dataset. | A base LLaMA-3 model fine-tuned on medical Q&A produces more accurate clinical responses. |
| 36 | Flash Attention | A fused attention kernel that avoids materializing the full N×N attention matrix in memory, reducing memory I/O. | Flash Attention 2 enables training on sequences 8× longer for the same GPU memory budget. |
| 37 | Foundational Model | A large model trained on broad data that can be adapted to many downstream tasks via fine-tuning or prompting. | GPT-4, LLaMA-3, and Gemini are foundational models; domain-specific variants are built on top of them. |
| 38 | Frequency Penalty | A sampler that subtracts a penalty from each token’s logit proportional to how many times that token has appeared in the output. | Token “the” appearing 5 times gets logit reduced by 5×λ, progressively discouraging its reuse. |
| 39 | Function Calling / Tool Use | The capability for an LLM to emit structured calls to external functions or APIs with typed parameters. | {"name": "get_weather", "parameters": {"location": "San Francisco", "unit": "celsius"}} |
| 40 | GGUF | A container format for quantized model weights optimized for local inference with llama.cpp. |
Llama-3-70B-Instruct.Q4_K_M.gguf is a 4-bit quantized file that runs on 48 GB of unified RAM. |
| 41 | GPT (Generative Pre-trained Transformer) | A family of decoder-only transformers pre-trained on causal language modeling, developed by OpenAI. | GPT-3 (175B) demonstrated few-shot learning; GPT-4 became the basis of ChatGPT. |
| 42 | GPTQ | A one-shot post-training quantization method that minimizes weight error layer by layer using second-order gradient information. | GPTQ at INT4 requires ~18 GB for LLaMA-3-70B, compared to ~140 GB at FP16. |
| 43 | GQA (Grouped Query Attention) | An attention variant that groups query heads to share a single set of key-value heads, shrinking KV cache size. | LLaMA-3-70B uses 8 KV heads for 64 query heads via GQA, cutting KV cache memory by 8×. |
| 44 | Greedy Sampling | Decoding strategy that always selects the single highest-probability token at each step; fully deterministic. | “Paris is the capital of” with greedy decoding reliably produces “France” with no randomness. |
| 45 | Grounding | Connecting LLM outputs to verifiable external facts, documents, or databases to reduce hallucinations. | RAG grounds responses by injecting retrieved passages as context before the model generates its answer. |
| 46 | Hallucination | An LLM-generated response that is factually incorrect or fabricated, delivered with apparent confidence. | In July 2023, ChatGPT stated Will Smith had never assaulted anyone, despite the 2022 Oscars incident. |
| 47 | HNSW (Hierarchical Navigable Small World) | A graph-based ANN index that builds multi-layer proximity graphs for fast approximate nearest-neighbor search. | Qdrant and Weaviate use HNSW internally; query latency is typically under 5 ms on millions of vectors. |
| 48 | Hugging Face | The primary open-source platform for sharing models, datasets, and ML tooling. | The transformers library, Model Hub, and Spaces are its core products. |
| 49 | HumanEval | A coding benchmark of 164 hand-crafted Python problems, scored by pass@k (fraction solved in k attempts). | GPT-4 scores ~90% on pass@1; GPT-3.5 scored ~67%. |
| 50 | Hybrid Search | Retrieval that combines dense vector similarity search with sparse BM25 keyword matching and merges scores. | “XPS-9530 GPU upgrade” benefits from keyword matching; “good laptop for gaming” benefits from semantics. |
| 51 | Inference | Running a trained model on new input to produce an output; the opposite of training. | Sending a prompt to the Claude API and receiving a completion is inference. |
| 52 | Instruction Tuning | Fine-tuning a base model on (instruction, response) pairs to make it follow natural-language directions. | LLaMA-3-70B base → LLaMA-3-70B-Instruct via instruction tuning on curated Q&A data. |
| 53 | Knowledge Graph | A structured graph of entities as nodes and typed relationships as edges. | (Tim Cook) -[IS_CEO_OF]→ (Apple Inc.) -[MAKES]→ (iPhone) is a three-node knowledge graph fragment. |
| 54 | KV Cache | Storing computed key and value tensors from prior tokens to avoid recomputation on each generation step. | Without KV cache, generating 1000 tokens requires 1000 full forward passes; with it, only the newest token’s keys/values are new. |
| 55 | LangChain | A framework for composing LLM calls, retrieval, memory, and tool use into chains and agents. | A LangChain RAG chain: retrieve docs → inject into prompt → call LLM → return answer. |
| 56 | Large Language Model (LLM) | A neural network with billions of parameters trained on massive text corpora to understand and generate language. | GPT-4, Claude 3, Gemini 1.5, and LLaMA-3 are the most widely used LLMs as of 2025. |
| 57 | llama.cpp |
A C/C++ runtime for local LLM inference supporting GGUF quantized models on CPU and GPU. | ./llama-cli -m Llama-3.gguf -p "Hello" -n 100 runs inference entirely on a laptop CPU. |
| 58 | LLMOps | Operational practices for deploying, monitoring, versioning, and scaling LLMs in production. | LLMOps covers prompt management, A/B testing, cost tracking, latency monitoring, and model rollback. |
| 59 | Locally Typical Sampling | A sampler that keeps tokens whose information content is close to the expected entropy of the distribution. | Unlike Top-P, locally typical sampling may include a low-probability token if it matches the model’s typical surprise level. |
| 60 | Logits | The raw, unnormalized scores the model produces for each vocabulary token before any softmax normalization. | A 128K-token vocabulary produces 128K logit values per generation step; softmax converts them to probabilities. |
| 61 | LoRA (Low-Rank Adaptation) | A PEFT method that freezes original weights W and trains two small matrices A and B so the adapted weight is W’ = W + BA. | A 7B model fine-tuned with LoRA at rank 16 trains ~4M parameters instead of all 7B. |
| 62 | LSTM (Long Short-Term Memory) | A recurrent neural network with gating mechanisms (input, forget, output gates) to capture long-range dependencies. | LSTMs were the dominant NLP sequence model before transformers; still used in some speech and time-series tasks. |
| 63 | MAMBA | A state-space model architecture using selective state updates to achieve linear-time sequence modeling. | MAMBA processes sequences 5× faster than a same-size transformer and handles contexts up to 1M tokens. |
| 64 | Matryoshka Embeddings | Embeddings trained so the first N dimensions independently form a useful subspace at any truncation length. | A 1024-dim Matryoshka embedding truncated to 256 dims loses modest accuracy while saving 75% of storage. |
| 65 | Max Marginal Relevance (MMR) | A reranking technique that balances relevance to the query with diversity among the selected results. | MMR with diversity_bias=0.3 returns retrieved chunks that are all relevant but not near-duplicates of each other. |
| 66 | Min-P Sampling | A sampler that removes any token whose probability is below a fraction of the top token’s probability. | With min_p=0.1 and top token at p=0.5, any token with p < 0.05 is filtered out; pairs well with high temperature. |
| 67 | Mirostat Sampling | A sampler that dynamically adjusts temperature each step to maintain a target surprise level (perplexity). | Setting Mirostat’s target τ=3 keeps generation at a consistent creativity level regardless of context length. |
| 68 | Mixture of Experts (MoE) | An architecture where each token is routed to a sparse subset of “expert” feed-forward layers, keeping active compute constant as total params grow. | Mixtral 8x22B has 141B total parameters but activates only ~39B per token via top-2 routing across 8 experts. |
| 69 | MMLU (Massive Multitask Language Understanding) | A benchmark covering 57 academic subjects from STEM to humanities, scored as multiple-choice accuracy. | GPT-4 scored ~86% on MMLU; LLaMA-3-70B scored ~82%. |
| 70 | Model Drift | Degradation in a deployed model’s performance over time as real-world data distribution shifts from training data. | A spam classifier trained in 2020 loses accuracy as spammers adopt vocabulary absent from its training set. |
| 71 | Multi-Head Attention | Running H independent attention computations on projected subspaces in parallel, then concatenating their outputs. | A transformer with 32 attention heads can simultaneously track syntax, coreference, and entity relationships. |
| 72 | Multimodal Embeddings | Embeddings that map different modalities (image, text, audio) into a shared vector space for cross-modal retrieval. | CLIP enables image search via text: “red running shoes” retrieves photos of red sneakers without metadata. |
| 73 | n-gram | A contiguous sequence of n tokens from a text. | “once upon a” is a 3-gram; DRY sampling uses n-gram matching to detect and suppress repetitive loops. |
| 74 | One-Shot Prompting | Including exactly one input-output example pair in the prompt before the actual query. | “Sentiment: ‘I loved it’ → Positive. Sentiment: ‘Terrible experience’ → ?” |
| 75 | Open Source LLM | An LLM whose weights are publicly released, enabling local deployment, fine-tuning, and customization. | LLaMA-3, Mistral, Gemma, and Phi-4 are prominent open-source LLMs with commercial-friendly licenses. |
| 76 | Orchestration | Coordinating LLM calls, tool invocations, memory, and routing logic to complete multi-step tasks. | A LangGraph agent orchestrates: (1) retrieve, (2) reason, (3) call external API, (4) synthesize answer. |
| 77 | PEFT (Parameter-Efficient Fine-Tuning) | A family of techniques that fine-tune a small subset of parameters while keeping the base model mostly frozen. | LoRA is the most popular PEFT method; QLoRA extends it to quantized base models. |
| 78 | Perplexity | Exponential of the average negative log-likelihood; lower values mean the model is more confident about the text. | A well-tuned model achieves perplexity ~4 on Wikipedia text; a random model would score near the vocabulary size. |
| 79 | Position Interpolation (PI) | Extending a model’s effective context window by interpolating positional encodings between trained positions. | A LLaMA-2 model trained at 4K context can be extended to 16K via PI with short fine-tuning. |
| 80 | Positional Encoding | A signal added to token embeddings to communicate each token’s position, since attention itself is position-agnostic. | Original transformers use sinusoidal encodings; modern LLMs use RoPE or ALiBi. |
| 81 | PPO (Proximal Policy Optimization) | A reinforcement learning algorithm that constrains policy updates to a trust region, used to optimize against a reward model in RLHF. | PPO is the RL step in the classic RLHF pipeline: SFT → reward model → PPO fine-tuning. |
| 82 | Pre-training | Training a model from scratch on large unlabeled corpora using a self-supervised objective. | GPT-4 pre-training consumed trillions of tokens and hundreds of millions of dollars in compute. |
| 83 | Presence Penalty | A sampler that subtracts a flat logit penalty from any token that has appeared at least once in the current generation. | Even a token used once gets logit reduced by presence_penalty, discouraging any reuse. |
| 84 | Prompt Engineering | Designing and refining input prompts to steer an LLM toward desired output style, format, or factual accuracy. | Adding “Think step by step” to a math prompt significantly improves accuracy on complex reasoning tasks. |
| 85 | QLoRA | LoRA applied to a 4-bit quantized base model, enabling fine-tuning of 70B models on consumer GPUs. | QLoRA lets you fine-tune LLaMA-3-70B on a 24 GB RTX 4090; standard full fine-tuning would require 8× A100s. |
| 86 | Quadratic Sampling | A sampler that applies a smooth quadratic remapping to logits to redistribute probability mass toward middle-ranked tokens. | Quadratic sampling softens sharp probability peaks without imposing a hard cutoff like top-K or top-P. |
| 87 | Quantization | Reducing numerical precision of model weights (FP32 → FP16 → INT8 → INT4) to shrink memory and accelerate inference. | A 70B model at FP16 needs ~140 GB VRAM; the same model at INT4 fits in ~35 GB. |
| 88 | Quantization-Aware Training (QAT) | Training with simulated quantization noise so the model learns to be robust to reduced precision at inference time. | Google uses QAT for Gemma models to maintain quality at INT4 in production serving. |
| 89 | RAG (Retrieval-Augmented Generation) | An architecture that retrieves relevant document chunks and injects them into the LLM prompt before generation. | Query “what is our refund policy?” → retrieve policy chunk → LLM generates a grounded answer. |
| 90 | RAG Evaluation | Measuring both retrieval quality (are retrieved chunks relevant?) and generation quality (is the response grounded?). | UMBRELA and AutoNuggetizer are open-source metrics for rigorous RAG pipeline evaluation. |
| 91 | RAG Sprawl | Unchecked proliferation of redundant RAG pipelines across an organization with no central governance. | A company discovers 12 separate teams each built their own RAG stack pulling from the same document store. |
| 92 | Repetition Penalty | A sampler that divides positive logits and multiplies negative logits for any token seen in the prompt or prior output. | repetition_penalty=1.2 makes repeating any context token harder without breaking coherence. |
| 93 | Reranking | A second-stage scoring pass that reorders initial retrieval candidates using a more accurate but slower model. | A bi-encoder retrieves top-100 candidates in ~5 ms; a cross-encoder reranks them in ~200 ms. |
| 94 | Reward Model | A model trained on (prompt, winning response, losing response) triples to predict human preference scores. | The reward model’s output score is the optimization target during the PPO phase of RLHF. |
| 95 | RLHF (Reinforcement Learning from Human Feedback) | A training pipeline that collects human preference comparisons, trains a reward model, then uses RL to maximize it. | ChatGPT’s helpful alignment came from RLHF applied on top of a GPT-3.5 SFT checkpoint. |
| 96 | RNN (Recurrent Neural Network) | A neural network that passes a hidden state sequentially from one time step to the next. | LSTMs (a type of RNN) were the dominant NLP sequence model before transformers arrived in 2017. |
| 97 | RoPE (Rotary Positional Embedding) | A positional encoding that applies a rotation matrix to query and key vectors based on their absolute position index. | LLaMA, Mistral, and Gemma all use RoPE; it generalizes better to longer sequences than learned absolute encodings. |
| 98 | Sampling | Selecting the next token by drawing from the model’s output probability distribution rather than always taking the argmax. | Temperature, top-k, top-p, and min-p all modify the distribution before sampling occurs. |
| 99 | Self-Attention | An attention mechanism where each token in a sequence attends to all other tokens in the same sequence. | Self-attention resolves coreference: “it” in “The trophy didn’t fit because it was too big” links back to “trophy.” |
| 100 | Self-Supervised Learning | Learning from unlabeled data by constructing supervision signals from the data itself. | GPT pre-training is self-supervised: the target label is the next token, which already exists in the corpus. |
| 101 | Semantic Search | Retrieval based on meaning via dense embeddings rather than keyword overlap. | “Best laptop for machine learning” retrieves results about high-VRAM GPUs even without those exact words. |
| 102 | Sentence Embeddings | Embeddings encoding an entire sentence or paragraph into a single fixed-size vector. | Sentence-BERT encodes “Paris is the capital of France” into a vector close to “The French capital is Paris.” |
| 103 | SentencePiece | A language-agnostic tokenization library that trains BPE or unigram subword vocabularies from raw text. | Gemma, LLaMA, and Mistral all ship with SentencePiece tokenizers trained on their respective corpora. |
| 104 | Softmax | A function that converts a vector of real numbers (logits) into a valid probability distribution summing to 1. | Logits [2.0, 1.0, 0.5] → softmax → approximately [0.61, 0.23, 0.16]. |
| 105 | Speculative Decoding | Generating draft tokens with a fast small model and verifying them in parallel with a slower large model. | A 1B draft model proposes 5 tokens; the 70B verifier accepts 3-4 and regenerates only the rejected ones. |
| 106 | State Space Models (SSMs) | Sequence models based on linear recurrence equations that scale linearly, not quadratically, with sequence length. | MAMBA is the leading SSM; it matches transformer quality on some benchmarks while scaling linearly. |
| 107 | Step-back Prompting (STP) | A technique asking the model to first state the underlying principles before answering the specific question. | “What physics principles apply here? Now use them to answer: what happens when you drop a feather in a vacuum?” |
| 108 | Supervised Fine-Tuning (SFT) | The initial fine-tuning step on labeled (instruction, response) data, typically the first stage before preference alignment. | SFT converts a raw pre-trained base model into an instruction-following model before RLHF. |
| 109 | System Prompt | A special instruction block prepended to the conversation context that sets the model’s persona and constraints. | “You are a helpful assistant. Always respond in JSON. Do not discuss competitor products.” |
| 110 | T5 (Text-to-Text Transfer Transformer) | An encoder-decoder model that frames every NLP task as a text-to-text problem. | T5 frames translation as: input “translate English to German: Hello” → output “Hallo.” |
| 111 | Tail-Free Sampling | A sampler that identifies the probability distribution’s tail via its second derivative and removes tokens past that break point. | TFS automatically cuts the long tail of unlikely tokens without requiring a fixed K or P hyperparameter. |
| 112 | Temperature | A scalar applied to logits before softmax that controls the sharpness of the output probability distribution. | t=0.1 → near-deterministic; t=0.7 → balanced; t=1.5 → highly creative but error-prone. |
| 113 | Token | The atomic unit of text processed by an LLM; typically a subword fragment from the model’s learned vocabulary. | “Tokenization” → 2-3 tokens such as [“Token”, “ization”] depending on the tokenizer. |
| 114 | Tokenization | Converting raw text into a sequence of integer token IDs using a learned subword vocabulary. | “How am I doing today?” → 6 tokens: [“How”, “ am”, “ I”, “ doing”, “ today”, “?”] with GPT-2 BPE. |
| 115 | Top-A Sampling | A sampler that filters tokens below a threshold proportional to the square of the top token’s probability. | Top-A is effectively Min-P with a squared threshold; it predates Min-P and is more aggressive at high confidence. |
| 116 | Top-K Sampling | A sampler that restricts token selection to the K highest-probability tokens, masking all others. | Vocab=50K, top_k=50: only the top 50 tokens are eligible; the remaining 49,950 are zeroed out. |
| 117 | Top-N-Sigma Sampling | A sampler that filters tokens more than N standard deviations below the maximum logit value. | top_n_sigma=2 adapts automatically: strict when the model is confident, lenient when the distribution is flat. |
| 118 | Top-P (Nucleus) Sampling | A sampler that keeps the smallest set of tokens whose cumulative probability mass exceeds threshold P. | top_p=0.9 selects tokens until their combined probability reaches 90%; may be 5 tokens or 500 depending on confidence. |
| 119 | Transformer | A neural network architecture built on stacked self-attention and feed-forward layers that processes sequences in parallel. | “Attention Is All You Need” (2017) introduced the transformer; GPT and BERT are direct descendants. |
| 120 | TruthfulQA | A benchmark of 817 questions designed to probe common misconceptions, scored by how often the model gives truthful answers. | “What happens if you swallow gum?” tests whether a model repeats folk wisdom or states the accurate fact. |
| 121 | Unsloth | A library that accelerates LoRA fine-tuning by 2-5× via custom CUDA kernels and memory-efficient backward passes. | unsloth fine-tunes LLaMA-3-8B on a 24 GB GPU at 2× the throughput of standard Hugging Face PEFT. |
| 122 | Vector Database | A database optimized for storing and querying high-dimensional embedding vectors via approximate nearest-neighbor search. | Qdrant, Pinecone, Milvus, Weaviate, and pgvector are common vector databases for RAG applications. |
| 123 | vLLM | A high-throughput LLM serving framework using PagedAttention to manage KV cache as virtual memory pages. | vLLM achieves 10-24× higher throughput than naive serving for concurrent requests via continuous batching. |
| 124 | VRAM (Video RAM) | On-board GPU memory; the primary bottleneck for training and running large models locally. | A 70B FP16 model needs ~140 GB VRAM; a consumer RTX 4090 has 24 GB. |
| 125 | Weights | The learned numerical parameters of a neural network, updated during training to minimize loss. | LLaMA-3-70B stores 70 billion floating-point weight values across its layer matrices. |
| 126 | WordPiece | A subword tokenization algorithm, developed for BERT, that builds vocabulary by maximizing corpus likelihood. | “unaffable” → [“una”, “##ff”, “##able”] using BERT’s WordPiece tokenizer. |
| 127 | XTC (eXclude Top Choices) Sampling | A sampler that occasionally removes all top-probability tokens except the lowest-scoring qualifying one, forcing unconventional output. | When activated, XTC removes the top 5 probable tokens and keeps only the 6th, injecting controlled surprise. |
| 128 | YaRN (Yet Another RoPE extensioN) | A method that rescales the RoPE frequency basis to extend a model’s effective context window beyond its training length. | YaRN with a 4× scale factor extends LLaMA-2-13B from 4K to 16K context with minimal perplexity degradation. |
| 129 | Zero-Shot Learning | Querying an LLM for a task with no examples, relying entirely on knowledge from pre-training. | “Translate to French: Hello world” with no examples → the model applies its trained language knowledge directly. |
