Running Ideogram 4 Locally: Quantized Inference and Structured JSON Captions

June 3, 2026

Ideogram 4Text-to-ImageDiffusion TransformerFlow MatchingQuantizationStructured PromptingLocal Inference

The thing that finally got me to sit down with Ideogram 4 was the text rendering. Most image models treat text in a scene as decorative blur: you ask for a neon sign in Japanese and you get neon-sign-shaped noise. Ideogram has been good at in-image text for a while, and when they open-sourced the weights last week I wanted to know whether “good at text” survived the quantization-to-run-on-one-GPU trip. It did. The prompting interface they shipped with it, a structured JSON schema with bounding boxes, per-element palettes, and a strict key-ordering verifier, is novel in ways I didn’t expect. Below: what the model is, how to run it locally, and a walkthrough of the notebook I built to control every inference parameter interactively.

Run it yourself: the full notebook, setup guide, and README live at spate141/latent-lab/ideogram4.

What Ideogram 4 is

Ideogram 4 is Ideogram’s first open-weight text-to-image model, released on June 3, 2026. It is a 9.3B parameter Diffusion Transformer (DiT) trained from scratch, not a fine-tune or distillation of any existing model. Every design decision in the architecture and prompting interface is theirs.

The core pipeline is a single-stream DiT. The pipeline concatenates text and image latent tokens into one sequence and processes them jointly through 34 transformer blocks, each modulated by an AdaLN computed from the flow-matching timestep embedding. The text encoder is Qwen3-VL-8B-Instruct, a vision-language model whose hidden states from a sparse set of layers (0, 3, …, 33, 35) feed into the conditioning signal. The sampler is Euler flow-matching with a logit-normal timestep schedule and asymmetric CFG; a VAE decodes the final latents into a PIL image.

At 9.3B parameters, it delivers the best text rendering of any open-weight release, ahead of models nearly 10× its size. The headline capabilities:

Multilingual in-image text rendering. Logos, signage, captions, watermarks, multi-line text: all generated at high fidelity directly from the prompt. Beats Qwen-Image (20B), FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE) in blind typography evaluations.
Structured JSON prompting. The model was trained on a JSON caption schema that provides fine-grained control over composition, style, color palette, and spatial layout via bounding boxes. Plain text works too, but matching the training format closes the distribution gap.
Native 2K resolution. Any height/width that’s a multiple of 16, from 256 to 2048 pixels per side, aspect ratios up to 6:1.

On third-party benchmarks: Design Arena (design-focused Elo leaderboard) ranks Ideogram 4 first among all open-weight models and top-5 overall, trailing only proprietary models from OpenAI and Google. ContraLabs ran a blind typography eval with ten professional designers: Ideogram 4 was picked first 47.9% of the time, compared to 30.0% for Gemini Nano Banana 2, 15.5% for FLUX.2 [max], and 15.0% for Grok Imagine 1.0. On LMArena (general text-to-image), Ideogram is the top-ranked open-weight lab.

Where the technical resources live

Everything is public:

GitHub: ideogram-oss/ideogram4 — model code, inference script, prompting guide, and the open-source magic-prompt system prompts.
HuggingFace: two gated model repos: ideogram-ai/ideogram-4-nf4 (bitsandbytes 4-bit, CUDA only, ~10 GB VRAM) and ideogram-ai/ideogram-4-fp8 (weight-only float8, any hardware, ~13 GB VRAM).
Technical blog post: ideogram.ai/blog/ideogram-4.0 covers the design choices behind the architecture and the benchmarks in depth.
API: if you don’t have the hardware, the hosted model is available at developer.ideogram.ai including a free magic-prompt expansion endpoint.
In-repo docs: docs/model_architecture.md, docs/inference.md, and docs/prompting.md are worth reading; the prompting guide in particular is the most useful reference for building JSON captions.

Stack: Ideogram 4 nf4 (9.3B, bitsandbytes 4-bit), PyTorch 2.12 + CUDA 13.0, transformers, accelerate, ipywidgets, psutil, Jupyter Notebook, Python 3.12 in a venv.

Getting it running locally

The full setup is documented in README.md in the repo. The condensed version:

1. Create the venv and install PyTorch with CUDA support first. The pyproject.toml dependency torch>=2.11 doesn’t pin a CUDA build, so a plain pip install -e . can pull the CPU wheel. Solve this by installing torch explicitly before anything else:

python3 -m venv .venv && source .venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
# Expected: True  13.0

2. Install the package and notebook dependencies:

pip install -e .
pip install jupyter ipywidgets ipykernel
python -m ipykernel install --user --name ideogram4-venv --display-name "Python (ideogram4)"

That last command registers the venv as a named Jupyter kernel so the notebook’s import ideogram4 finds the package you just installed.

3. Accept the HuggingFace model gate. Both repos are gated. Visit ideogram-ai/ideogram-4-nf4 in a browser and click Agree and access repository. Then authenticate locally:

hf auth login    # paste a token with Read scope

4. The first pipeline load downloads ~12–15 GB: the DiT transformer (~5–6 GB), Qwen3-VL text encoder (~5–6 GB), and VAE (~0.5 GB). They land in ~/.cache/huggingface/hub/ and every subsequent run loads from disk. After that first download, the entire workflow (loading, generating, iterating) is fully offline.

Walking through the notebook

The full notebook is on GitHub: ideogram4_playground.ipynb

Here is what each section does.

1. Device and quantization detection

The notebook picks the right weight variant automatically:

DEVICE = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')
QUANTIZATION = 'nf4' if DEVICE == 'cuda' else 'fp8'
WEIGHTS_REPO = f'ideogram-ai/ideogram-4-{QUANTIZATION}'

nf4 needs bitsandbytes and only runs on CUDA; fp8 is weight-only quantization that works anywhere. The hardware summary cell prints the GPU name, total VRAM, and CUDA version so you know what you’re working with before loading anything.

2. HF cache check and pipeline load

Before loading, a cell walks the HF cache directory and reports which model variant is already on disk and how large it is, versus which would trigger a download. On my machine:

HF cache    : /mnt/d/HF_CACHE/hub
✅ cached ideogram-ai/ideogram-4-nf4  (32.2 GB on disk)
   not downloaded ideogram-ai/ideogram-4-fp8

Then the pipeline loads once per session:

from ideogram4 import Ideogram4Pipeline, Ideogram4PipelineConfig

pipe = Ideogram4Pipeline.from_pretrained(
    config=Ideogram4PipelineConfig(weights_repo=WEIGHTS_REPO),
    device=DEVICE,
    dtype=torch.bfloat16,
)

This is the slow cell: 20–40 seconds on first load while PyTorch maps the quantized weights to GPU memory. After that you call pipe(...) as many times as you want without reloading.

3. Interactive parameter panel

The panel exposes every argument pipe() accepts as a live widget: resolution (256–2048 px, step 16), sampler preset, num_steps, guidance_scale, mu, std, seed, raise_on_caption_issues, and a guidance_schedule textarea for per-step CFG weights.

The three sampler presets bundle the parameters that matter most:

Preset	Steps	mu	std	Notes
`V4_QUALITY_48`	48	0.0	1.5	Best quality, default
`V4_DEFAULT_20`	20	0.0	1.75	Good speed/quality balance
`V4_TURBO_12`	12	0.5	1.75	Fastest

Selecting a preset auto-fills num_steps, mu, std, and the full per-step guidance schedule, then locks those widgets. Switching to Custom unlocks everything for manual tuning.

4. The `generate()` helper

read_params() snapshots the current widget state into a kwargs dict. generate() wraps pipe(), times the call, prints progress, saves the PNG, and displays it inline:

t0 = time.perf_counter()
images = pipe(prompt, **params)
elapsed = time.perf_counter() - t0
print(f'Done in {elapsed:.1f}s  ({elapsed / params["num_steps"]:.2f}s/step)')

Any widget value can be overridden for a single call without touching the sliders: generate(prompt, width=512, height=512).

5. Structured JSON captions

Ideogram 4 diverges from every other text-to-image model I’ve worked with here. Ideogram trained on a rigid JSON schema, and the repo ships CaptionVerifier, a pure-Python class that validates key ordering, bounding box ranges, hex color format, and encoding. No weights needed: verify a caption offline before spending GPU time on it.

The schema has three top-level fields: high_level_description, style_description, and compositional_deconstruction. Key order matters. For a photographic scene, style_description must go aesthetics → lighting → photo → medium → color_palette, in that order. elements within compositional_deconstruction follow a per-type order: obj elements are type → bbox → desc → color_palette; text elements are type → bbox → text → desc → color_palette. Python 3.7+ dict insertion order is sufficient. Write your keys in the right order and the verifier passes.

Bounding boxes are [y_min, x_min, y_max, x_max] in normalized 0–1000 coordinates. Per-element color_palette arrays let you steer individual sign colors independently of the overall image palette.

6. The Neon Ramen Alley example

To test text rendering, the example prompt is a rain-soaked cyberpunk night-market alley with 10 elements: four obj (ramen stall, hooded customer, steam, lanterns) and six text spanning two languages (English neon, Japanese kanji, a chalk price board). The notebook serializes Japanese characters as literals with json.dumps(..., ensure_ascii=False) because the verifier warns on \uXXXX escapes when the raw text has no literal non-ASCII characters.

ramen_prompt = json.dumps(ramen_caption, separators=(',', ':'), ensure_ascii=False)

verifier = CaptionVerifier()
issues = verifier.verify_raw(ramen_prompt)
# Caption verified — 4,779 chars, 10 elements, 19 non-ASCII chars ✓

A clean verifier result before generation means the schema is correct and you’re sampling within the training distribution.

ramen_image = generate(
    ramen_prompt,
    output_path='outputs/neon_ramen_alley.png',
    width=1536,
    height=1024,
)

Neon Ramen Alley — generated locally by Ideogram 4 nf4

7. Magic prompt (optional)

If you’d rather describe a scene in plain English and let a model translate it into the JSON schema, the repo ships a MagicPrompt interface with three backends:

Key	Backend	Env var
`ideogram-4-v1`	Ideogram hosted API (free)	`IDEOGRAM_API_KEY`
`claude-opus-v1`	Claude Opus 4.8 via OpenRouter	`MAGIC_PROMPT_API_KEY`
`claude-sonnet-v1`	Claude Sonnet 4.6 via OpenRouter	`MAGIC_PROMPT_API_KEY`

magic = MAGIC_PROMPTS['ideogram-4-v1'](api_key=os.environ['IDEOGRAM_API_KEY'])
caption = magic.expand('a golden retriever on a skateboard', aspect_ratio='3:2')
# caption is now a verified JSON string; pass it straight to generate()

This is the only part of the workflow that touches an external API. Image generation still runs locally.

8. Memory panel

After loading a 9.3B quantized model and generating a few images, you want to see what you’re holding in memory. A show_memory() helper renders a dark-themed HTML panel with four gradient bars: VRAM allocated (model weights), VRAM reserved (PyTorch allocator pool), RAM process RSS, and RAM system pressure. Numbers turn red past 85%. Call it any time: after the pipeline loads, after a 2K generation, after clearing the cache, to spot leaks.

A few design decisions worth calling out

Auto device and quantization selection

nf4 and fp8 differ in more than memory: nf4 requires bitsandbytes, which needs CUDA, and won’t load on MPS or CPU at all. Tying device detection to quantization choice means the same notebook works on an RTX 4090, a MacBook with an M-series chip, and a CPU-only box with no configuration changes. If you want to force a specific variant, the constants are at the top of the cell.

The sampler parameters (num_steps, mu, std, guidance_schedule) are only meaningful as a bundle. Picking 48 steps with a V4_TURBO_12-style schedule defeats the purpose of either preset. The on_preset_change observer fills all four values from PRESETS[name] and disables the individual widgets so you can’t produce a hybrid configuration. Selecting “Custom” re-enables everything and clears the schedule textarea. The guidance_schedule is the subtlest parameter: a tuple of floats in loop-index order (index 0 is the final polish step), serialized as a comma-separated string to fit in a text area.

CaptionVerifier as a pre-flight check

Running CaptionVerifier().verify_raw(prompt) before calling generate() costs near-zero time (pure Python, no GPU) and catches schema violations (wrong key order, malformed bounding boxes, lowercase hex colors, character encoding issues) before you spend 30+ seconds on a generation that might degrade output or raise an exception mid-run. Treating the verifier as a unit test you run on every hand-written caption is the right mental model: green verifier means you’re sampling from the training distribution.

The memory panel

torch.cuda.memory_allocated() tells you what the model holds; torch.cuda.memory_reserved() tells you what PyTorch’s caching allocator has reserved but not necessarily filled. The gap between the two is the allocator’s pool, held to avoid repeated cudaMalloc calls. On a 24 GB card with the nf4 model loaded, allocated sits around 10 GB and reserved around 12 GB, leaving ~12 GB headroom for activations during generation. At 2048×2048 that headroom is tight; at 1536×1024 (the example above) it’s comfortable. Seeing those numbers before you push resolution helps avoid the mid-generation OOM.

The JSON schema gives leverage on the kinds of scenes that usually defeat text-to-image models. A chalk menu board with specific Japanese items at specific prices, rendered legibly on a rain-slicked surface with independently colored neon signs: you wouldn’t bother trying that in plain text. The bounding box coordinates feel verbose to write by hand, but once you internalize the 0–1000 normalized range and the [y_min, x_min, y_max, x_max] order, the layout control is deterministic in a way that prose prompting never is.

The full notebook (ipywidgets panel, generate() helper, CaptionVerifier workflow, memory panel, and the complete Neon Ramen Alley caption) is at spate141/latent-lab/ideogram4. Clone the repo, follow the setup steps in README.md, accept the HF gate, and you’re a single jupyter notebook command away from your first local Ideogram 4 generation.

Snehal Patel

Running Ideogram 4 Locally: Quantized Inference and Structured JSON Captions

What Ideogram 4 is

Where the technical resources live

Getting it running locally

Walking through the notebook

1. Device and quantization detection

2. HF cache check and pipeline load

3. Interactive parameter panel

4. The `generate()` helper

5. Structured JSON captions

6. The Neon Ramen Alley example

7. Magic prompt (optional)

8. Memory panel

A few design decisions worth calling out

Auto device and quantization selection

Preset ↔ widget locking

CaptionVerifier as a pre-flight check

The memory panel

Running Ideogram 4 Locally: Quantized Inference and Structured JSON Captions

What Ideogram 4 is

Where the technical resources live

Getting it running locally

Walking through the notebook

1. Device and quantization detection

2. HF cache check and pipeline load

3. Interactive parameter panel

4. The generate() helper

5. Structured JSON captions

6. The Neon Ramen Alley example

7. Magic prompt (optional)

8. Memory panel

A few design decisions worth calling out

Auto device and quantization selection

Preset ↔ widget locking

CaptionVerifier as a pre-flight check

The memory panel

4. The `generate()` helper