Neural Audio Codecs — Dissected, Compared, and Benchmarked on Indian Languages

Prologue

The Complete Picture

I've spent the last few months going deep into neural audio codecs - not just reading papers, but actually building a speech enhancement model on top of DAC (Descript Audio Codec) at my current role. So a lot of whats written here comes from actually running into these things in real pipelines, not just textbook reading.

This article is going to be a detailed read. I'll walk you through the audio fundamentals you need to understand first, and then 4 fundamentally different approaches to encoding audio into a compact representation - RVQ based codecs (DAC, EnCodec), FSQ based codecs (NanoCodec), semantic-aware codecs (Mimi), and continuous VAE based approaches (like what DiTAR uses). By the end you'll understand not just WHAT each does, but WHY they exist and WHEN you'd pick one over the other.

Lets get into it.

Part 0

Audio Fundamentals (You Need This First)

Before we talk about neural codecs, you need to understand what audio actually IS as data. If you already know sampling, spectrograms, and mel filters, skip ahead. But if you're coming from NLP or vision and thinking "audio is just another modality" - slow down. Audio has properties that make it fundamentally different from text or images, and these properties directly explain WHY codecs are designed the way they are.

Audio is a 1D signal sampled over time

When you record speech at 16kHz, you're capturing 16,000 amplitude values per second. Thats it. Just a long list of numbers.

Sample Rates

1 sec @ 16 kHz 16,000 values

1 sec @ 24 kHz 24,000 values

1 sec @ 44.1 kHz 44,100 values (CD quality)

Why different sample rates? Nyquist theorem - you need at least 2× the highest frequency you care about. Human speech mostly lives below 8kHz, so 16kHz is enough for speech. Music has content up to ~20kHz, so you need 44.1kHz.

From my personal experience working with speech data for model training, I didn't feel much of a difference between 16kHz, 24kHz, and 48kHz for speech. But for music? 16kHz sounds noticeably worse. This matters for codec design - speech codecs can get away with lower sample rates than music codecs.

The time-frequency tradeoff

Raw waveform is in the time domain. But a lot of whats interesting about audio lives in the frequency domain. This is where the Short-Time Fourier Transform (STFT) comes in.

STFT takes a waveform and tells you: at each moment in time, which frequencies are present and how loud they are.

You slide a window across the audio (say 25ms wide, hopping 10ms at a time), apply the Fourier Transform to each window, and you get a spectrogram - a 2D representation where:

Spectrogram Axes

X axis Time

Y axis Frequency

Color / brightness Magnitude (how loud)

STFT Shape

Raw waveform: [16000 samples] for 1 second
After STFT:   [frequency_bins × time_frames] = e.g. [513 × 100]

Now heres the thing. The spectrogram has magnitude AND phase. Magnitude tells you how loud each frequency is. Phase tells you where in its cycle each frequency is. Magnitude is perceptually meaningful - your ears care about it. Phase? Your ears are mostly insensitive to it.

This is exactly why mel spectrograms became popular for TTS. Take the magnitude spectrogram, apply mel filterbanks (which mimic how the human ear perceives frequency - we're more sensitive to low frequencies than high), and you get a compact representation that captures what matters perceptually.

Audio Feature Pipeline

Waveform [16000] → STFT → Mag Spec [513×100] → Mel → [80×100]

80 mel bins × 100 time frames = 8000 values to represent 1 second of audio. Way more compact than 16000 raw samples.

MFCCs - taking it one step further

MFCC

Mel-Frequency Cepstral Coefficients - what you get when you take the mel spectrogram and apply a Discrete Cosine Transform (DCT) to decorrelate the mel bins. You typically keep only the first 13 coefficients.

MFCC Pipeline

Mel spectrogram [80 × 100] → DCT → MFCCs [13 × 100]

MFCCs capture the "shape" of the spectrum - which roughly corresponds to the shape of the vocal tract. This is why they were THE feature for speech recognition for decades before deep learning. I used MFCCs in my SyncNet work for audio-visual synchronization scoring.

So why not just use mel spectrograms for everything?

This is the key question. If mel spectrograms are so compact and perceptually meaningful, why did we need neural codecs?

Two reasons:

1. Phase reconstruction problem. Mel spectrograms throw away phase. To get audio back, you need a vocoder (Griffin-Lim, WaveNet, HiFi-GAN) that "invents" plausible phase. This works okay but introduces artifacts. Theres a quality ceiling you cant get past because the phase information is gone forever.

2. Not discrete. Modern language models (GPT, LLama) work with discrete tokens. Mel spectrograms are continuous. You cant do next-token prediction on continuous values using standard cross-entropy loss and softmax. You'd need regression (MSE loss), which tends to produce blurry/averaged outputs.

Neural audio codecs solve both problems: they encode audio into discrete tokens that can be decoded back to audio near-losslessly. No phase reconstruction needed. And the tokens plug directly into a language model for next-token prediction.

Thats the paradigm shift. From continuous lossy representations to discrete near-lossless tokens.

Vector Quantization - the bridge between continuous and discrete

Before we get to the specific codecs, you need to understand Vector Quantization (VQ). This is the fundamental operation that converts continuous representations into discrete tokens.

Vector Quantization (VQ)

Given a continuous vector, find the closest entry in a fixed codebook and return its index.

VQ Example

Codebook: 1024 entries, each 8-dimensional
  Entry 0:   [0.1, -0.3, 0.5, 0.2, -0.1, 0.4, 0.3, -0.2]
  Entry 1:   [0.4, 0.1, -0.2, 0.6, 0.3, -0.5, 0.1, 0.7]
  ...
  Entry 1023: [-0.2, 0.5, 0.1, -0.4, 0.6, 0.2, -0.3, 0.1]

Input vector: [0.38, 0.12, -0.19, 0.55, 0.28, -0.48, 0.09, 0.65]

Find closest: Entry 1 (smallest Euclidean distance)
Output: index = 1

Thats it. You've converted a continuous 8-dim vector into a single integer. You can transmit that integer (10 bits for 1024 entries) instead of the full vector (8 × 32 bits = 256 bits). Massive compression.

The problem? One codebook with 1024 entries cant capture all the nuance of audio. You'd need millions of entries to get good quality, which defeats the purpose of compression. This is where the different codec approaches diverge - each one has a different strategy for getting high quality reconstruction with manageable codebook sizes.

Now lets look at the four approaches.

Part I

Why Do We Even Need Audio Codecs for ML?

Before 2022, if you wanted to build a text-to-speech system, you'd generate mel spectrograms and then use a vocoder (like HiFi-GAN) to convert them back to audio. The problem? Mel spectrograms throw away phase information. You literally cant reconstruct the original audio perfectly from a mel spectrogram - the vocoder has to "guess" the phase. This works okay, but theres a ceiling on quality.

Neural audio codecs changed this completely. The idea is simple:

Take raw audio → compress it into a sequence of discrete tokens (like words in a sentence) → decompress back to audio with near-perfect reconstruction.

Now instead of generating mel spectrograms, your TTS model can generate these tokens directly. And since the codec can reconstruct audio from tokens almost perfectly, the quality ceiling goes way up.

So basically what happened is - the field shifted from "generate a lossy representation and hope the vocoder fills in the gaps" to "generate exact tokens that the codec decoder can perfectly reconstruct." Thats the paradigm shift.

The question then becomes: HOW do you compress audio into tokens? And this is where things get interesting, because there are fundamentally different philosophies.

Part II — Approach 1

RVQ Based Codecs (DAC, EnCodec)

This is where it all started. EnCodec came from Meta in 2022, and DAC (Descript Audio Codec) followed in 2023. I've worked extensively with DAC at 16kHz with 12 codebooks for my speech enhancement work, so I'll explain this one in detail.

The WHAT

RVQ stands for Residual Vector Quantization. The "residual" part is the key insight.

Imagine you have a continuous audio feature vector - lets say its 64-dimensional. You want to represent it using a discrete token (an index into a codebook). But one codebook cant capture all the detail. So you do this:

Residual Vector Quantization

Original signal: [0.5, -0.3, 0.8, ...]  (64-dim continuous vector)

Codebook 1 (1024 entries, each 8-dim):
  Find closest entry → index 42 → codeword [0.4, -0.2, 0.7, ...]
  Residual = original - codeword = [0.1, -0.1, 0.1, ...]

Codebook 2 (1024 entries, each 8-dim):
  Quantize the RESIDUAL → index 107 → codeword [0.08, -0.09, 0.11, ...]
  Residual = previous residual - codeword = [0.02, -0.01, -0.01, ...]

Codebook 3:
  Quantize the even smaller residual → index 5
  ...

Keep going for 12 codebooks.

So Codebook 1 captures the coarse structure (fundamental frequency, energy). Codebook 2 captures what CB1 missed. Codebook 3 captures what CB1+CB2 missed. And so on. Each subsequent codebook handles finer and finer details.

RVQ Codebook Hierarchy

Codebook 1 → Coarse structure (F0, energy)

Codebook 2 → What CB1 missed

Codebook 3 → What CB1+CB2 missed

Codebook 12 → Finest residual details

Now heres what this looks like for DAC specifically:

DAC at 16kHz

Codebooks 12

Entries per codebook 1,024

Codeword dimensions 8

Frame rate ~86 Hz (86 frames/sec)

Bitrate ~8 kbps

Tokens per second 86 × 12 = 1,032

The Architecture

DAC Pipeline

Waveform → Conv Encoder → RVQ (12 CB) → Dequantize → Conv Decoder → Waveform

The encoder and decoder are purely convolutional - no transformers, no attention. Just stacked conv layers with residual blocks. The magic is all in the RVQ bottleneck.

Training

DAC trains with multiple losses simultaneously:

1. Reconstruction loss: MSE between original and reconstructed waveform + multi-scale STFT loss (compares frequency content at multiple resolutions)

2. Adversarial loss: Multi-Period Discriminator (MPD) and Multi-Scale Discriminator (MSD) - same discriminators from HiFi-GAN. The decoder has to fool these discriminators into thinking the reconstructed audio is real.

3. Commitment loss: standard VQ loss that prevents the encoder output from drifting too far from codebook entries.

The codec is trained separately from any downstream model and then frozen. This is important - when I built my speech enhancement model on top of DAC, the DAC encoder and decoder were completely frozen. Only my model's weights were trained.

And here's something I verified myself - DAC generalizes remarkably well even to languages it was never trained on. I ran the first systematic evaluation of DAC on Indian language speech (Hindi, Tamil, Telugu, Bengali, Kannada) and it achieved a mean PESQ of 4.473, STOI of 0.992, and speaker similarity of 0.987 across all five languages. Near-perfect reconstruction on languages the model never saw during training. More on this experiment later in this piece.

The Problem With RVQ

Heres the thing that bit me in practice. Because codebooks are hierarchical (CB2 depends on CB1's residual, CB3 depends on CB1+CB2's residual), you cant predict all 12 codebooks independently. You HAVE to go in order.

For my Genhancer model, this meant during inference I had to generate codebooks sequentially:

Sequential RVQ Prediction

Step 1 Predict all CB1 tokens for all frames

Step 2 Predict CB2 tokens (knowing CB1)

Step 3 Predict CB3 (knowing CB1+CB2)

... ...

Step 12 12 forward passes total

Thats 12 forward passes through the model. Or you use tricks like delay patterns (MusicGen style) or MaskGIT, but the fundamental dependency is still there.

Also - codebook collapse. Sometimes during training, certain codebook entries never get used. The codebook "collapses" to only using a subset of its entries. This wastes capacity. DAC and EnCodec both deal with this using techniques like EMA updates and codebook reset, but its a real problem.

Part III — Approach 2

FSQ Based Codecs (NanoCodec / LFSC from Koel-TTS)

This is the newer approach from NVIDIA, released in 2024. I read the Koel-TTS paper in detail for my Sarvam interview prep, and honestly this is elegant.

The Key Insight

Instead of learning codebook entries (which can collapse), just... dont learn them. Round each dimension to a fixed set of levels.

Learn 1024 codewords of 8-dim each

Each dimension gets fixed levels

Hope they dont collapse

No collapse possible

Learned codebook

Deterministic quantization

FSQ Levels

FSQ: Each dimension gets fixed levels. No learning needed.
  Dimension 1: round to one of [8 levels]
  Dimension 2: round to one of [7 levels]
  Dimension 3: round to one of [6 levels]
  Dimension 4: round to one of [6 levels]

  Total combinations: 8 × 7 × 6 × 6 = 2016 unique codes per codebook

No codebook to learn. No collapse possible. The quantization is deterministic.

Why This Changes Everything for TTS

The MASSIVE difference: FSQ codebooks are INDEPENDENT. CB1 and CB2 and CB3 all quantize independent dimensions of the same feature vector. There is no residual relationship.

So what does this mean? At each decoder timestep, your TTS model can predict all 8 codebooks IN PARALLEL with a single linear layer:

FSQ Parallel Prediction

hidden [768] → Linear(768, 8×2016) → reshape [8, 2016] → 8 softmaxes

One forward pass. 8 indices. Done.

Compare this to RVQ where you need sequential prediction across codebooks. For production TTS with streaming requirements, this parallel prediction is a massive win.

NanoCodec Specifics

NVIDIA released this on HuggingFace (nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). Its open source and you can pip install it.

NanoCodec Specs

Codebooks 8 (FSQ, independent)

Codes per codebook 2,016 [8, 7, 6, 6] levels

Frame rate 21.5 Hz

Output sample rate 22,050 Hz

Bitrate 1.89 kbps

Parameters 62M

Now look at that frame rate. 21.5 Hz. Compare to DAC at 86 Hz. Thats 4x fewer frames per second.

Why does this matter? If you're doing autoregressive TTS, you generate one frame at a time. At 86 Hz (DAC), 10 seconds of audio = 860 autoregressive steps. At 21.5 Hz (NanoCodec), 10 seconds = 215 steps. 4x fewer steps = 4x faster generation. For streaming TTS where you need sub-250ms latency, this is the difference between "works" and "doesnt work."

860

Steps for 10s @ DAC 86 Hz

215

Steps for 10s @ NanoCodec 21.5 Hz

The architecture is similar to DAC - convolutional encoder, convolutional decoder (HiFi-GAN based), same discriminators (MPD + multi-band multi-scale STFT + WavLM discriminator). The only difference is FSQ instead of RVQ in the bottleneck, and the architecture is designed for lower frame rates.

One thing I should note - when I ran my Indic codec evaluation, I couldn't include NanoCodec due to protobuf/onnx dependency conflicts. This is a gap worth filling since NanoCodec is the codec behind Koel-TTS and MagpieTTS, and its FSQ architecture is fundamentally different from the RVQ codecs I was able to test.

Part IV — Approach 3

Mimi (Semantic Distillation + RVQ)

Mimi came from Kyutai in 2024, and its the codec that powers Moshi - the first real-time full-duplex voice conversation model. Mimi is special because it solved a problem that had been annoying the field for years.

The Problem Mimi Solved

Before Mimi, if you wanted to build a speech language model (like AudioLM or VALL-E), you needed TWO separate token streams:

From HuBERT / WavLM

From EnCodec / DAC

Captures WHAT is said

Captures HOW it sounds

Content, phonemes, words

Speaker voice, room acoustics

Throws away speaker identity

Doesnt explicitly model content

Two encoders. Two token streams. Complex merging strategies. AudioLM had a whole hierarchical pipeline for this.

Mimi's insight: what if we just distill the semantic information INTO the codec during training?

How Semantic Distillation Works

During training, Mimi does something extra alongside the normal reconstruction loss:

Semantic Distillation

Same audio → frozen WavLM encoder → semantic features [T, D]
Same audio → Mimi encoder → Mimi features → after RVQ codebook 1 → CB1 output [T, D]

Extra loss: cosine_similarity(CB1_output, WavLM_features)

This forces codebook 1 to learn representations that are similar to WavLM's semantic features. After training:

Mimi Codebook Roles

Codebook 1 → Content / semantics (what is said)

Codebooks 2-8 → Acoustics (how it sounds)

One codec. One encoder. One token stream. Both semantic and acoustic information.

The Transformer Bottleneck

Another thing that makes Mimi different from DAC/EnCodec - it has transformers inside the codec.

Mimi Architecture

Conv Enc → 8 Transformer (CAUSAL) → RVQ → 8 Transformer (CAUSAL) → Conv Dec

DAC and EnCodec are purely convolutional. Mimi adds transformer layers before and after the quantization step. These capture long-range temporal dependencies that convolutions miss.

And critically - these transformers are CAUSAL. Each frame only attends to current and past frames, not future. This makes Mimi fully streamable with 80ms latency. You can process audio chunk by chunk in real-time.

Mimi By The Numbers

Mimi Specs

Output sample rate 24 kHz

Frame rate 12.5 Hz (!!!)

Codebooks 8 (RVQ)

Entries per codebook 2,048

Bitrate 1.1 kbps

Streaming Fully causal

Algorithmic latency 80ms

12.5 Hz frame rate at 1.1 kbps. For context, EnCodec operates at 75 Hz and 6 kbps. Mimi is 6x more compressed in frame rate AND 6x more compressed in bitrate. And it still sounds good because the transformer layers compensate for the aggressive compression.

This is what enabled Moshi to do full-duplex conversation at 200ms latency. At 12.5 Hz, one second of audio is just 12.5 × 8 = 100 tokens. Two audio streams (user + assistant) = 200 tokens per second. Thats manageable for a 7B transformer. At EnCodec's 75 Hz, it would be 1200 tokens per second - way too many.

100

Tokens/sec with Mimi (12.5 Hz × 8)

1,200

Tokens/sec with EnCodec (75 Hz × 16)

The Indic Language Problem

Now heres something I've been thinking about. Mimi distills from WavLM, which is trained on English data. WavLM's representations encode English phonetic categories.

What happens when you encode Hindi speech through Mimi? Codebook 1 tries to capture semantics using English-trained representations. Hindi has retroflex consonants (ट, ड) vs dental consonants (त, द) that English doesn't distinguish. These might map to the same semantic token in codebook 1, destroying information that can never be recovered.

This isn't theoretical - the DualCodec paper (Interspeech 2025) showed that SpeechTokenizer (which uses HuBERT distillation, same idea as Mimi) gets 83.2% WER on Chinese. Catastrophic failure. Because HuBERT cant capture Mandarin tonal distinctions.

The fix? Use a multilingual teacher model like mHuBERT-147 (trained on 147 languages) instead of English-only WavLM. Or use IndicWav2Vec for Indian languages specifically.

Part V — Approach 4

Continuous VAE (DiTAR Style)

Now we get to the approach that says - why quantize at all?

DiTAR (ByteDance, 2025) uses a VAE instead of a discrete codec. No codebooks. No quantization. No information loss from discretization.

The Architecture

DiTAR VAE Pipeline

24kHz waveform → Conv Encoder → μ, σ (mean and variance)
  → z = μ + σ × ε (reparameterization trick)
  → continuous latent vectors [400 frames, 64-dim]
  → BigVGAN Decoder → reconstructed 24kHz waveform

The latent space is Gaussian - each frame is a 64-dimensional continuous vector. At ~40 Hz frame rate, 10 seconds of audio gives you 400 frames of 64-dimensional vectors.

Training losses:

DiTAR VAE Training Losses

Loss 1 Reconstruction (waveform + spectral)

Loss 2 KL divergence (latents → standard normal)

Loss 3 Adversarial (MPD + MSD discriminators)

The VAE is trained as Phase 1, then frozen. The downstream TTS model (DiTAR's language model + LocDiT) operates on these continuous vectors.

Why Go Continuous?

The argument is simple: quantization is lossy. Every time you round a continuous value to a discrete codebook entry, you lose information. This "quantization error" accumulates across codebooks and limits reconstruction quality.

With a VAE, theres no quantization step. The latent vectors are continuous. The decoder gets exactly the right information to reconstruct the audio.

The downside? You cant use standard next-token prediction (softmax over vocabulary) because there is no vocabulary. You need a different generation approach. DiTAR uses flow matching - a diffusion-like process that generates continuous vectors from noise. This requires multiple denoising steps (10 Euler steps in DiTAR) per frame, which is slower than discrete token prediction.

Continuous vs Discrete - The Tradeoff

LLM-compatible (softmax, temperature)

No quantization loss (highest quality)

Fast inference (1 pass per token)

No codebook issues

Natural streaming (token by token)

Needs diffusion/flow matching head

DPO/GRPO alignment straightforward

Slower inference (multiple steps)

Quantization error (bitrate ceiling)

Less natural streaming

Codebook management (collapse, seq.)

Temperature control is different

The field is genuinely split right now. Koel-TTS (NVIDIA, EMNLP 2025) and likely Sarvam's Bulbul V3 use discrete tokens. DiTAR (ByteDance) and Pocket TTS/CALM (Kyutai) use continuous latents. Both camps claim superior results.

My personal take: for production streaming TTS with low latency requirements, discrete (especially FSQ) wins today. For highest quality offline generation where latency doesnt matter, continuous wins. The gap is closing though - Pocket TTS showed you can do continuous generation in just 1 step using consistency models, running faster than real-time on CPU.

Part VI — The Experiment

The Indic Benchmark: Testing These Codecs on Indian Languages

So remember I said nobody had evaluated these codecs on non-English speech? I decided to actually do it. I ran the first systematic evaluation of neural audio codecs on Indian language speech data - five languages, three codecs, five metrics.

The Setup

Evaluation Setup

Languages Hindi, Tamil, Telugu, Bengali, Kannada

Samples per language 10 (duration: 1-10 seconds)

Source data Public HuggingFace speech datasets

Codecs tested DAC (RVQ), EnCodec (RVQ), SNAC (Multi-scale RVQ)

All codecs evaluated at 24 kHz

Compute Apple M4 GPU (MPS backend)

Five metrics: PESQ (perceptual quality), STOI (intelligibility), speaker similarity via ECAPA-TDNN embeddings, WER via language-specific Whisper models, and exact ASR transcript match rate. The idea was to test every angle - does the audio sound good? Is it still intelligible? Does it preserve speaker identity? Can downstream ASR still understand it?

Hindi source data was 48kHz studio quality. Tamil, Telugu, Bengali, and Kannada were 16kHz. This is worth flagging as a confounding variable - Hindi's higher scores might partially reflect better source quality, not just phonological similarity to English.

PESQ: Perceptual Quality

EnCodec

2.536

3.188

2.725

2.815

2.720

2.797

SNAC

1.645

2.374

1.809

1.995

1.784

1.921

DAC at 4.473 mean PESQ is near-perfect (scale goes to 4.5). EnCodec at 2.797 is decent but noticeable degradation. SNAC at 1.921 is genuinely poor - Bengali drops to 1.645. Thats bad enough that you'd hear the artifacts clearly.

STOI: Intelligibility

EnCodec

0.875

0.953

0.911

0.920

0.890

0.910

SNAC

0.752

0.931

0.836

0.846

0.773

0.828

DAC at 0.992 - nearly identical intelligibility. SNAC loses 17% intelligibility on Bengali (0.752). For a codec that's supposed to reconstruct audio faithfully, losing almost a fifth of your intelligibility is a serious problem.

Speaker Similarity: Voice Preservation

EnCodec

0.877

0.898

0.903

0.908

0.925

0.902

SNAC

0.632

0.755

0.725

0.756

0.730

0.720

This one really tells the story. DAC preserves speaker identity almost perfectly at 0.987. EnCodec is acceptable at 0.902. But SNAC? 0.720 mean, and Bengali at 0.632 - thats losing nearly 37% of speaker identity. If you're building a voice cloning or speaker-conditioned TTS pipeline for Indian languages and using SNAC, the voice that comes out the other end won't sound like the person you put in. Thats a dealbreaker.

0.987

DAC speaker similarity

0.720

SNAC speaker similarity

WER: ASR Degradation

I evaluated WER using language-specific fine-tuned Whisper models. I could only get reliable results for Hindi and Tamil - the Telugu, Bengali, and Kannada ASR models produced hallucinated text, infinite repetition loops, or outputs in the wrong script entirely. Rather than report misleading numbers, I'm reporting WER only where ASR was reliable.

DAC
0.178
0.359

EnCodec

0.184

0.508

SNAC

0.340

0.829

DAC introduces only 17.8% WER on Hindi - minimal degradation. SNAC on Tamil? 82.9% WER. The reconstructed audio is nearly completely unintelligible to downstream ASR. Tamil is consistently 2-2.4x harder than Hindi across all codecs, likely reflecting Tamil's agglutinative morphology and less representation in codec training data.

17.8%

DAC WER on Hindi

82.9%

SNAC WER on Tamil

What This Tells Us

The codec ranking is unambiguous: DAC >> EnCodec >> SNAC for Indian languages. This holds across all five languages and all metrics.

DAC generalizes remarkably well to Indian languages despite being trained primarily on English. SNAC is not recommended for Indian language applications in its current form - it introduces significant artifacts that degrade both perceived quality and downstream ASR performance.

Hindi consistently scores highest across all codecs. This may reflect that Hindi's phonological features are closer to English (the dominant training language), or that the Hindi dataset had higher recording quality (48kHz studio vs 16kHz for others). Bengali and Telugu consistently score lowest - their distinct phonological features (retroflex consonants, aspiration patterns) appear to be less well-captured by English-trained codecs.

A few important caveats: this is 10 samples per language, not hundreds. PESQ was designed for English speech - the absolute numbers might not perfectly reflect what a native speaker perceives. And I couldn't test NanoCodec (dependency issues) or Mimi (not packaged for easy evaluation) - both of which would be very interesting to benchmark given their architectural differences.

But the signal is clear enough. If you're building speech AI for Indian languages today and you need to pick a codec - use DAC. Not SNAC. Not EnCodec. DAC.

Part VII

The Big Picture

Lets put it all together:

Neural Audio Codec Timeline

2022 EnCodec (Meta) — RVQ, proved codecs work

2023 DAC (Descript) — RVQ, better quality

2024 Mimi (Kyutai) — RVQ + semantic, 12.5Hz

2024 NanoCodec (NVIDIA) — FSQ, parallel, 21.5Hz

2025 DiTAR VAE (ByteDance) — continuous, no VQ

2025 Pocket TTS (Kyutai) — continuous, 1-step

The trend is clear:

The Trends

Frame rates → Going DOWN (86 → 21.5 → 12.5 Hz)

CB independence → Going UP (RVQ sequential → FSQ parallel)

Semantic awareness → Built IN (Mimi distillation)

Discretization → Some abandoning entirely (DiTAR, CALM)

And as I showed in the Indic experiment above - evaluating these codecs on non-English speech reveals dramatic differences. DAC generalizes remarkably well (PESQ 4.473 across five Indian languages), while SNAC nearly collapses (82.9% WER on Tamil). The codec you pick matters enormously, and evaluating only on English hides these failure modes entirely.

The open questions that remain: do these codecs faithfully reconstruct retroflex consonants? Aspirated sounds? Code-switched Hinglish? These need minimal pair evaluations that go beyond aggregate metrics. And we need to test NanoCodec, Mimi, and the newer codecs on Indic data too. This evaluation was a starting point - 10 samples per language, three codecs. The full picture needs 22 languages, phoneme-class-specific analysis, and subjective listening tests with native speakers. But the signal is already clear enough to make production decisions today.

If you found this useful, feel free to ping me to ask anything related here:

Thanks to Claude for helping me organize my thoughts on this. All the understanding comes from actually working with these systems - the writing just needed a push.

Email Twitter GitHub LinkedIn Portfolio

Discussion

The Complete Picture

Audio Fundamentals (You Need This First)

Audio is a 1D signal sampled over time

The time-frequency tradeoff

MFCCs - taking it one step further

So why not just use mel spectrograms for everything?

Vector Quantization - the bridge between continuous and discrete

Why Do We Even Need Audio Codecs for ML?

RVQ Based Codecs (DAC, EnCodec)

The WHAT

The Architecture

Training

The Problem With RVQ

FSQ Based Codecs (NanoCodec / LFSC from Koel-TTS)

The Key Insight

Why This Changes Everything for TTS

NanoCodec Specifics

Mimi (Semantic Distillation + RVQ)

The Problem Mimi Solved

How Semantic Distillation Works

The Transformer Bottleneck

Mimi By The Numbers

The Indic Language Problem

Continuous VAE (DiTAR Style)

The Architecture

Why Go Continuous?

Continuous vs Discrete - The Tradeoff

The Indic Benchmark: Testing These Codecs on Indian Languages

The Setup

PESQ: Perceptual Quality

STOI: Intelligibility

Speaker Similarity: Voice Preservation

WER: ASR Degradation

What This Tells Us

The Big Picture

Comments