Snehal Patel

Snehal Patel

I love to build things ✨

Kokoro-82M: Running a Local Text-to-Speech Model on One GPU

I wanted to add audio narration to my blog posts on snehal.ai. The idea was simple: drop a text content, get a clean .wav out, ship it with the post. The execution took longer than expected because most open-weight TTS models that sounded good enough were either too large for my RTX 4090, too slow to be practical, or produced artifacts that made long-form narration unusable. I tried a few pipelines that seemed promising and then quietly died on 3,000-word posts. None of them gave me something I could run repeatedly and trust.

Kokoro was the first that worked. 82 million parameters, Apache-2.0 weights, and voices clear enough to publish. The model downloads in under a minute, loads in seconds, and streams audio chunk by chunk so it never blocks on long documents.

Run it yourself: the notebook and setup guide are at spate141/latent-lab/kokoro.


What Kokoro is

Kokoro-82M is an open-weight text-to-speech model released by hexgrad. 82 million parameters, Apache-2.0 license, phoneme-based synthesis. The architecture combines StyleTTS 2 with iSTFTNet, routing text to waveform through a neural vocoder pipeline in a single forward pass.

The pipeline breaks into four stages:

input text misaki G2P text -> phonemes -> input_ids | espeak-ng OOD fallback CustomAlbert (PL-BERT) + TextEncoder contextual phoneme representations fed to prosody + decoder ProsodyPredictor predicts duration, F0, energy conditioned on voice ref_s tensor iSTFTNet decoder -> 24 kHz WAV

misaki G2P converts text to phonemes. It’s a standalone library (pip install misaki) with espeak-ng as an out-of-dictionary fallback for English words not in its lexicon. The model receives phoneme sequences; raw text never reaches the neural network.

CustomAlbert (PL-BERT style) encodes the phoneme sequence into contextual representations. A linear projection maps those into the hidden dimension shared with the rest of the model. The TextEncoder builds on top to produce the per-phoneme features that feed the decoder.

ProsodyPredictor takes the encoder output and a voice reference tensor (ref_s) and predicts duration, fundamental frequency (F0), and energy for each phoneme. Each voice reference is a fixed learned embedding that distinguishes one voice from another. To switch voices, you load a different .pt file.

iSTFTNet Decoder takes the prosody-conditioned sequence and synthesizes the waveform via inverse short-time Fourier transform. One forward pass produces audio. Generation is fast even on CPU.

Model weights are ~327 MB, pulled once from hexgrad/Kokoro-82M on Hugging Face and cached in ~/.cache/huggingface/.


Voices, languages, and speed

Kokoro ships with roughly 50 preset voices organized by accent and gender. The voice code encodes both: the first two characters are the language/accent (af = American female, am = American male, bf = British female, bm = British male) and the rest is the voice name. The voices I use most:

Code Gender Accent
af_sky Female American
af_heart Female American
af_bella Female American
am_michael Male American
am_adam Male American
bf_emma Female British
bm_george Male British

Full list and audio samples: hexgrad/Kokoro-82M/SAMPLES.md

Voices are .pt tensors you can load directly and, if you want, blend between two by mixing the tensors before passing them to the pipeline:

v1 = torch.load('af_sky.pt', weights_only=True)
v2 = torch.load('af_heart.pt', weights_only=True)
blended = 0.6 * v1 + 0.4 * v2
generator = pipeline(text, voice=blended)

The lang_code controls the G2P language and must match the voice prefix ('a' for American/British English, 'b' for British, 'e' for Spanish, 'f' for French, 'h' for Hindi, 'i' for Italian, 'j' for Japanese, 'p' for Brazilian Portuguese, 'z' for Mandarin). Japanese and Mandarin need pip install misaki[ja] or misaki[zh].

Speed is a float multiplier passed directly to the pipeline (speed=1.0 is normal, 0.8 is slower, 1.2 is faster). No separate spectrogram stretching or pitch shifting: the prosody predictor bakes it in.


Where the technical resources live

  • GitHub: hexgrad/kokoro: inference library, KPipeline API, misaki integration
  • HuggingFace: hexgrad/Kokoro-82M: weights, config, voice .pt files, samples
  • G2P library: hexgrad/misaki: the phonemizer Kokoro uses
  • Audio samples: SAMPLES.md: one sample per voice so you can audition before picking
  • espeak-ng: espeak-ng releases: system-level OOD fallback; worth installing even though the pipeline runs without it

Stack: kokoro 0.9.4, PyTorch 2.12 + CUDA 13.0, soundfile, ipywidgets 8.1.8, JupyterLab, uv venv, Python 3.12, RTX 4090.


Getting it running locally

The notebook must live inside the cloned Kokoro source tree so that kokoro can be installed in editable mode:

git clone https://github.com/hexgrad/kokoro.git
cd kokoro

Create the venv and install everything with uv:

uv venv --python 3.12 .venv
source .venv/bin/activate

uv pip install "kokoro>=0.9.4" soundfile jupyterlab ipykernel ipywidgets numpy

python -m ipykernel install --user --name kokoro-venv --display-name "Python (kokoro-venv)"

Install espeak-ng at the system level (Ubuntu: sudo apt-get install espeak-ng; macOS: brew install espeak-ng). The pipeline runs without it but will silently skip words it doesn’t recognize.

First run downloads the model weights (~327 MB) to ~/.cache/huggingface/. Every subsequent run loads from disk; after that first pull, the whole workflow is offline.

Launch JupyterLab and open generate_audio.ipynb with the “Python (kokoro-venv)” kernel.


Walking through the notebook

The notebook has three sections: Imports, Configuration, and Generate Audio.

1. Imports and device detection

The imports cell loads everything and picks the compute device:

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
# MPS (Apple Silicon) also supported via PYTORCH_ENABLE_MPS_FALLBACK=1

CUDA is used automatically when available; no extra flags needed on a standard Linux + NVIDIA setup.

2. Configuration

The configuration section has two cells. The first is the voice dropdown, an ipywidgets.Dropdown with all 14 English voices, default af_sky, rendered as a live widget in the notebook so you can switch voice without touching code:

voice_widget = widgets.Dropdown(
    options=[('af_sky (American Female)', 'af_sky'), ...],
    value='af_sky',
    description='Voice:',
)
display(voice_widget)

The second cell sets the text and output paths:

TEXT = "Paste or assign your text here."
OUTPUT_DIR = 'outputs'
OUTPUT_FILENAME = 'my_audio'
SPEED = 1.0

3. Generate audio

The generate cell initializes the pipeline, synthesizes in chunks with a live progress bar, stitches everything together, saves the file, and plays it inline:

pipeline = KPipeline(lang_code=LANG_CODE, repo_id='hexgrad/Kokoro-82M')

for i, (graphemes, phonemes, audio) in enumerate(pipeline(
    TEXT, voice=voice_widget.value, speed=SPEED, split_pattern=r'\n+',
)):
    if audio is None: continue
    audio_chunks.append(audio)
    # update progress bar...

full_audio = np.concatenate(audio_chunks)
sf.write(f'{OUTPUT_DIR}/{voice_widget.value}_{base_name}.wav', full_audio, 24000)
display(Audio(data=full_audio, rate=24000))

The progress bar uses ipywidgets.IntProgress with a bar.max = max(bar.max, i + 2) trick: the bar always leaves one slot of headroom so it never hits 100% mid-run. When synthesis finishes, it snaps to full and turns green. Nothing is printed during generation; the label widget shows chunk count, accumulated audio duration, and elapsed wall time in-place.

Output files are named <voice>_<stem>.wav. Text narrated by af_sky with output filename my_audio becomes af_sky_my_audio.wav. The voice is baked into the filename so you can compare voices without overwriting.


A few design decisions worth calling out

Streaming + unknown-length progress

KPipeline is a generator that yields one audio chunk per paragraph. The chunk count is unknown before generation starts, so a fixed-max progress bar breaks. The bar.max = max(bar.max, i + 2) pattern keeps the bar one slot ahead of the current index, so it never reaches 100% mid-run.

Voice prefix encodes lang_code

The lang_code must match the voice or the G2P module is mismatched. The notebook derives lang_code from the first character of the voice code ('af_sky'[0]'a'), so you only pick the voice and the language follows. The only way to mismatch is to pass a voice code you invented.


Kokoro sits below the quality ceiling of the best commercial TTS systems. It’s small, fast, and offline after a one-time download, with voices good enough to publish and a Python API that stays out of the way. For long-form narration on a single GPU, it’s the first pipeline I’ve trusted across a whole blog.

The notebook (generate_audio.ipynb) and setup guide (NOTEBOOK_README.md) are at spate141/latent-lab/kokoro. Clone the Kokoro repo, copy the files in, follow the setup steps, and you’re one jupyter lab command away from your first narration.