Large Language Models are beautiful.

April 18, 2026

N-gramsRNNsTransformersAttentionPre-trainingInstruction TuningRLHFLong ContextKV CacheQuantizationLoRAMixture of ExpertsRAGTool UseAgentsReasoning ModelsMultimodalState Space Models

llms_are_beautiful

You have text. You want the machine to continue it, answer it, summarize it. But machines do not understand language. They predict.

So you count. N-gram models look at the last few words and guess the next one from frequencies in a corpus. “The cat sat on the ___” learns to produce “mat.” But anything beyond a few words of context is invisible. Language has long dependencies. Counting does not.

So you use RNNs. A hidden state flows forward through the sequence, carrying memory of what came before. Now in theory the model can remember arbitrary context. In practice, gradients vanish. By the twentieth word the first is forgotten.

So you use LSTMs. Gated memory cells decide what to keep and what to forget. Long dependencies survive. But every token still waits for the one before it. Training is sequential. You cannot parallelize what is inherently a for-loop.

So you use attention. Instead of squeezing everything through one hidden state, every token looks at every other token and decides how much to weigh each one. No more bottleneck. Meaning comes from relationships.

So you use Transformers. Drop recurrence entirely. Stack self-attention and feed-forward layers. Train on GPUs at scale because every position computes in parallel. The architecture that powers everything that came after.

But training from scratch on every task is wasteful. Sentiment, translation, QA all need the same fundamental understanding of language.

So you pre-train. Take a Transformer, feed it the internet, and ask it to predict the next token. It learns grammar, facts, reasoning patterns, coding conventions, all from one objective. Then you fine-tune on your task and the base knowledge transfers.

Small models learn patterns. But some abilities only show up at scale.

So you scale. More parameters, more data, more compute. The scaling laws are remarkably clean. Loss drops predictably as you grow. And somewhere along the curve, abilities emerge that smaller models simply did not have.

But a pre-trained model completes text. Ask it a question and it rambles. Give it an instruction and it might continue the instruction rather than follow it.

So you instruction-tune. Curate examples of prompts and desired responses. Train the model to treat prompts as tasks. Now it follows directions instead of playing autocomplete.

But helpful is not the same as aligned. The model will confidently lie, produce toxic output, or help with things it should not.

So you use RLHF. Humans rank pairs of responses. You train a reward model on those preferences and use reinforcement learning to push outputs toward what humans actually want. The model becomes not just capable but aligned.

Your context window is 4K tokens. A contract is thirty thousand. A codebase is millions. You cannot fit them in.

So you extend context. RoPE, ALiBi, YaRN, positional interpolation. Techniques that stretch attention to 128K or a million tokens without training a new model from scratch. Whole books fit in one prompt.

But long context means expensive inference. Every new token attends to every previous one. Cost grows quadratically.

So you use a KV cache. Store the keys and values from previous tokens and reuse them. Each new token becomes cheap to generate. Latency drops, throughput rises.

Your 70 billion parameter model needs 140GB of memory in FP16. Your GPU has 24.

So you quantize. Cast weights from 16 bits to 8, 4, or even 2. GPTQ, AWQ, GGUF, bitsandbytes. The same model fits on a single consumer card with almost no quality loss.

You want to fine-tune on your domain, but updating 70 billion parameters takes a data center you do not own.

So you use LoRA. Freeze the base model. Train tiny low-rank adapters that nudge behavior in a new direction. You fine-tune on one GPU, ship a 50MB file instead of a 140GB checkpoint, and hot-swap adapters across tasks.

You want more capacity without paying for it on every token. Bigger models are better but also slower and more expensive.

So you use Mixture of Experts. Every layer has dozens of expert sub-networks. A router picks two per token. Total parameters balloon into the trillions but each forward pass only activates a slice. Capacity without the bill.

Your model was trained last year. It does not know what happened this week. It does not know your company’s internal docs. And it hallucinates when it does not know.

So you use RAG. Embed the query, retrieve relevant documents from a vector store, stuff them into context. The model answers from evidence instead of imagination.

The model is smart but it cannot check the weather, query a database, or send an email. Text in, text out, and the real world is outside.

So you give it tools. Define functions the model can call. It emits structured JSON, your code executes it, the result goes back into context. The LLM becomes a controller, not just a text generator.

But a single tool call is not enough. Real tasks require planning, branching, retrying, using the output of one step as input to the next.

So you build agents. The model loops. Observe, reason, act, observe again. ReAct, function calling, orchestration graphs. A single prompt becomes a multi-step workflow the model navigates itself.

Your model is fluent but shallow. It pattern-matches its way through math and logic. Hard reasoning breaks.

So you use test-time compute. Chain-of-thought at first. Then models trained to generate long internal reasoning traces before answering. o1, R1, and the reasoning family. Give the model more tokens to think and accuracy climbs on problems rote generation cannot solve.

Text is one modality. Users have images, audio, video, screenshots, documents with layout.

So you go multimodal. Shared token spaces for text and vision. Models that read a chart and explain it, watch a video and summarize it, listen to audio and respond. One model, many inputs, one generation pipeline.

Attention is quadratic. Even with caching, a million tokens is a million tokens and the math does not care.

So you explore alternatives. State space models like Mamba. Linear attention. Hybrid architectures. Constant memory per token, linear time complexity, competitive quality. The Transformer is no longer the only option on the table.