The Complete Picture
I've spent the last few months going deep into neural audio codecs - not just reading papers, but actually building a speech enhancement model on top of DAC (Descript Audio Codec) at my current role. So a lot of whats written here comes from actually running into these things in real pipelines, not just textbook reading.
This article is going to be a detailed read. I'll walk you through the audio fundamentals you need to understand first, and then 4 fundamentally different approaches to encoding audio into a compact representation - RVQ based codecs (DAC, EnCodec), FSQ based codecs (NanoCodec), semantic-aware codecs (Mimi), and continuous VAE based approaches (like what DiTAR uses). By the end you'll understand not just WHAT each does, but WHY they exist and WHEN you'd pick one over the other.
Lets get into it.
Audio Fundamentals (You Need This First)
Before we talk about neural codecs, you need to understand what audio actually IS as data. If you already know sampling, spectrograms, and mel filters, skip ahead. But if you're coming from NLP or vision and thinking "audio is just another modality" - slow down. Audio has properties that make it fundamentally different from text or images, and these properties directly explain WHY codecs are designed the way they are.
Audio is a 1D signal sampled over time
When you record speech at 16kHz, you're capturing 16,000 amplitude values per second. Thats it. Just a long list of numbers.
Why different sample rates? Nyquist theorem - you need at least 2× the highest frequency you care about. Human speech mostly lives below 8kHz, so 16kHz is enough for speech. Music has content up to ~20kHz, so you need 44.1kHz.
From my personal experience working with speech data for model training, I didn't feel much of a difference between 16kHz, 24kHz, and 48kHz for speech. But for music? 16kHz sounds noticeably worse. This matters for codec design - speech codecs can get away with lower sample rates than music codecs.
The time-frequency tradeoff
Raw waveform is in the time domain. But a lot of whats interesting about audio lives in the frequency domain. This is where the Short-Time Fourier Transform (STFT) comes in.
STFT takes a waveform and tells you: at each moment in time, which frequencies are present and how loud they are.
You slide a window across the audio (say 25ms wide, hopping 10ms at a time), apply the Fourier Transform to each window, and you get a spectrogram - a 2D representation where:
Raw waveform: [16000 samples] for 1 second
After STFT: [frequency_bins × time_frames] = e.g. [513 × 100]
Now heres the thing. The spectrogram has magnitude AND phase. Magnitude tells you how loud each frequency is. Phase tells you where in its cycle each frequency is. Magnitude is perceptually meaningful - your ears care about it. Phase? Your ears are mostly insensitive to it.
This is exactly why mel spectrograms became popular for TTS. Take the magnitude spectrogram, apply mel filterbanks (which mimic how the human ear perceives frequency - we're more sensitive to low frequencies than high), and you get a compact representation that captures what matters perceptually.
80 mel bins × 100 time frames = 8000 values to represent 1 second of audio. Way more compact than 16000 raw samples.
MFCCs - taking it one step further
Mel-Frequency Cepstral Coefficients - what you get when you take the mel spectrogram and apply a Discrete Cosine Transform (DCT) to decorrelate the mel bins. You typically keep only the first 13 coefficients.
Mel spectrogram [80 × 100] → DCT → MFCCs [13 × 100]
MFCCs capture the "shape" of the spectrum - which roughly corresponds to the shape of the vocal tract. This is why they were THE feature for speech recognition for decades before deep learning. I used MFCCs in my SyncNet work for audio-visual synchronization scoring.
So why not just use mel spectrograms for everything?
This is the key question. If mel spectrograms are so compact and perceptually meaningful, why did we need neural codecs?
Two reasons:
1. Phase reconstruction problem. Mel spectrograms throw away phase. To get audio back, you need a vocoder (Griffin-Lim, WaveNet, HiFi-GAN) that "invents" plausible phase. This works okay but introduces artifacts. Theres a quality ceiling you cant get past because the phase information is gone forever.
2. Not discrete. Modern language models (GPT, LLama) work with discrete tokens. Mel spectrograms are continuous. You cant do next-token prediction on continuous values using standard cross-entropy loss and softmax. You'd need regression (MSE loss), which tends to produce blurry/averaged outputs.
Neural audio codecs solve both problems: they encode audio into discrete tokens that can be decoded back to audio near-losslessly. No phase reconstruction needed. And the tokens plug directly into a language model for next-token prediction.
Thats the paradigm shift. From continuous lossy representations to discrete near-lossless tokens.
Vector Quantization - the bridge between continuous and discrete
Before we get to the specific codecs, you need to understand Vector Quantization (VQ). This is the fundamental operation that converts continuous representations into discrete tokens.
Given a continuous vector, find the closest entry in a fixed codebook and return its index.
Codebook: 1024 entries, each 8-dimensional
Entry 0: [0.1, -0.3, 0.5, 0.2, -0.1, 0.4, 0.3, -0.2]
Entry 1: [0.4, 0.1, -0.2, 0.6, 0.3, -0.5, 0.1, 0.7]
...
Entry 1023: [-0.2, 0.5, 0.1, -0.4, 0.6, 0.2, -0.3, 0.1]
Input vector: [0.38, 0.12, -0.19, 0.55, 0.28, -0.48, 0.09, 0.65]
Find closest: Entry 1 (smallest Euclidean distance)
Output: index = 1
Thats it. You've converted a continuous 8-dim vector into a single integer. You can transmit that integer (10 bits for 1024 entries) instead of the full vector (8 × 32 bits = 256 bits). Massive compression.
The problem? One codebook with 1024 entries cant capture all the nuance of audio. You'd need millions of entries to get good quality, which defeats the purpose of compression. This is where the different codec approaches diverge - each one has a different strategy for getting high quality reconstruction with manageable codebook sizes.
Now lets look at the four approaches.
Why Do We Even Need Audio Codecs for ML?
Before 2022, if you wanted to build a text-to-speech system, you'd generate mel spectrograms and then use a vocoder (like HiFi-GAN) to convert them back to audio. The problem? Mel spectrograms throw away phase information. You literally cant reconstruct the original audio perfectly from a mel spectrogram - the vocoder has to "guess" the phase. This works okay, but theres a ceiling on quality.
Neural audio codecs changed this completely. The idea is simple:
Take raw audio → compress it into a sequence of discrete tokens (like words in a sentence) → decompress back to audio with near-perfect reconstruction.
Now instead of generating mel spectrograms, your TTS model can generate these tokens directly. And since the codec can reconstruct audio from tokens almost perfectly, the quality ceiling goes way up.
So basically what happened is - the field shifted from "generate a lossy representation and hope the vocoder fills in the gaps" to "generate exact tokens that the codec decoder can perfectly reconstruct." Thats the paradigm shift.
The question then becomes: HOW do you compress audio into tokens? And this is where things get interesting, because there are fundamentally different philosophies.
RVQ Based Codecs (DAC, EnCodec)
This is where it all started. EnCodec came from Meta in 2022, and DAC (Descript Audio Codec) followed in 2023. I've worked extensively with DAC at 16kHz with 12 codebooks for my speech enhancement work, so I'll explain this one in detail.
The WHAT
RVQ stands for Residual Vector Quantization. The "residual" part is the key insight.
Imagine you have a continuous audio feature vector - lets say its 64-dimensional. You want to represent it using a discrete token (an index into a codebook). But one codebook cant capture all the detail. So you do this:
Original signal: [0.5, -0.3, 0.8, ...] (64-dim continuous vector)
Codebook 1 (1024 entries, each 8-dim):
Find closest entry → index 42 → codeword [0.4, -0.2, 0.7, ...]
Residual = original - codeword = [0.1, -0.1, 0.1, ...]
Codebook 2 (1024 entries, each 8-dim):
Quantize the RESIDUAL → index 107 → codeword [0.08, -0.09, 0.11, ...]
Residual = previous residual - codeword = [0.02, -0.01, -0.01, ...]
Codebook 3:
Quantize the even smaller residual → index 5
...
Keep going for 12 codebooks.
So Codebook 1 captures the coarse structure (fundamental frequency, energy). Codebook 2 captures what CB1 missed. Codebook 3 captures what CB1+CB2 missed. And so on. Each subsequent codebook handles finer and finer details.
Now heres what this looks like for DAC specifically:
The Architecture
The encoder and decoder are purely convolutional - no transformers, no attention. Just stacked conv layers with residual blocks. The magic is all in the RVQ bottleneck.
Training
DAC trains with multiple losses simultaneously:
1. Reconstruction loss: MSE between original and reconstructed waveform + multi-scale STFT loss (compares frequency content at multiple resolutions)
2. Adversarial loss: Multi-Period Discriminator (MPD) and Multi-Scale Discriminator (MSD) - same discriminators from HiFi-GAN. The decoder has to fool these discriminators into thinking the reconstructed audio is real.
3. Commitment loss: standard VQ loss that prevents the encoder output from drifting too far from codebook entries.
The codec is trained separately from any downstream model and then frozen. This is important - when I built my speech enhancement model on top of DAC, the DAC encoder and decoder were completely frozen. Only my model's weights were trained.
And here's something I verified myself - DAC generalizes remarkably well even to languages it was never trained on. I ran the first systematic evaluation of DAC on Indian language speech (Hindi, Tamil, Telugu, Bengali, Kannada) and it achieved a mean PESQ of 4.473, STOI of 0.992, and speaker similarity of 0.987 across all five languages. Near-perfect reconstruction on languages the model never saw during training. More on this experiment later in this piece.
The Problem With RVQ
Heres the thing that bit me in practice. Because codebooks are hierarchical (CB2 depends on CB1's residual, CB3 depends on CB1+CB2's residual), you cant predict all 12 codebooks independently. You HAVE to go in order.
For my Genhancer model, this meant during inference I had to generate codebooks sequentially:
Thats 12 forward passes through the model. Or you use tricks like delay patterns (MusicGen style) or MaskGIT, but the fundamental dependency is still there.
Also - codebook collapse. Sometimes during training, certain codebook entries never get used. The codebook "collapses" to only using a subset of its entries. This wastes capacity. DAC and EnCodec both deal with this using techniques like EMA updates and codebook reset, but its a real problem.
FSQ Based Codecs (NanoCodec / LFSC from Koel-TTS)
This is the newer approach from NVIDIA, released in 2024. I read the Koel-TTS paper in detail for my Sarvam interview prep, and honestly this is elegant.
The Key Insight
Instead of learning codebook entries (which can collapse), just... dont learn them. Round each dimension to a fixed set of levels.
FSQ: Each dimension gets fixed levels. No learning needed.
Dimension 1: round to one of [8 levels]
Dimension 2: round to one of [7 levels]
Dimension 3: round to one of [6 levels]
Dimension 4: round to one of [6 levels]
Total combinations: 8 × 7 × 6 × 6 = 2016 unique codes per codebook
No codebook to learn. No collapse possible. The quantization is deterministic.
Why This Changes Everything for TTS
The MASSIVE difference: FSQ codebooks are INDEPENDENT. CB1 and CB2 and CB3 all quantize independent dimensions of the same feature vector. There is no residual relationship.
So what does this mean? At each decoder timestep, your TTS model can predict all 8 codebooks IN PARALLEL with a single linear layer:
One forward pass. 8 indices. Done.
Compare this to RVQ where you need sequential prediction across codebooks. For production TTS with streaming requirements, this parallel prediction is a massive win.
NanoCodec Specifics
NVIDIA released this on HuggingFace (nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). Its open source and you can pip install it.
Now look at that frame rate. 21.5 Hz. Compare to DAC at 86 Hz. Thats 4x fewer frames per second.
Why does this matter? If you're doing autoregressive TTS, you generate one frame at a time. At 86 Hz (DAC), 10 seconds of audio = 860 autoregressive steps. At 21.5 Hz (NanoCodec), 10 seconds = 215 steps. 4x fewer steps = 4x faster generation. For streaming TTS where you need sub-250ms latency, this is the difference between "works" and "doesnt work."
The architecture is similar to DAC - convolutional encoder, convolutional decoder (HiFi-GAN based), same discriminators (MPD + multi-band multi-scale STFT + WavLM discriminator). The only difference is FSQ instead of RVQ in the bottleneck, and the architecture is designed for lower frame rates.
One thing I should note - when I ran my Indic codec evaluation, I couldn't include NanoCodec due to protobuf/onnx dependency conflicts. This is a gap worth filling since NanoCodec is the codec behind Koel-TTS and MagpieTTS, and its FSQ architecture is fundamentally different from the RVQ codecs I was able to test.
Mimi (Semantic Distillation + RVQ)
Mimi came from Kyutai in 2024, and its the codec that powers Moshi - the first real-time full-duplex voice conversation model. Mimi is special because it solved a problem that had been annoying the field for years.
The Problem Mimi Solved
Before Mimi, if you wanted to build a speech language model (like AudioLM or VALL-E), you needed TWO separate token streams:
Two encoders. Two token streams. Complex merging strategies. AudioLM had a whole hierarchical pipeline for this.
Mimi's insight: what if we just distill the semantic information INTO the codec during training?
How Semantic Distillation Works
During training, Mimi does something extra alongside the normal reconstruction loss:
Same audio → frozen WavLM encoder → semantic features [T, D]
Same audio → Mimi encoder → Mimi features → after RVQ codebook 1 → CB1 output [T, D]
Extra loss: cosine_similarity(CB1_output, WavLM_features)
This forces codebook 1 to learn representations that are similar to WavLM's semantic features. After training:
One codec. One encoder. One token stream. Both semantic and acoustic information.
The Transformer Bottleneck
Another thing that makes Mimi different from DAC/EnCodec - it has transformers inside the codec.
DAC and EnCodec are purely convolutional. Mimi adds transformer layers before and after the quantization step. These capture long-range temporal dependencies that convolutions miss.
And critically - these transformers are CAUSAL. Each frame only attends to current and past frames, not future. This makes Mimi fully streamable with 80ms latency. You can process audio chunk by chunk in real-time.
Mimi By The Numbers
12.5 Hz frame rate at 1.1 kbps. For context, EnCodec operates at 75 Hz and 6 kbps. Mimi is 6x more compressed in frame rate AND 6x more compressed in bitrate. And it still sounds good because the transformer layers compensate for the aggressive compression.
This is what enabled Moshi to do full-duplex conversation at 200ms latency. At 12.5 Hz, one second of audio is just 12.5 × 8 = 100 tokens. Two audio streams (user + assistant) = 200 tokens per second. Thats manageable for a 7B transformer. At EnCodec's 75 Hz, it would be 1200 tokens per second - way too many.
The Indic Language Problem
Now heres something I've been thinking about. Mimi distills from WavLM, which is trained on English data. WavLM's representations encode English phonetic categories.
What happens when you encode Hindi speech through Mimi? Codebook 1 tries to capture semantics using English-trained representations. Hindi has retroflex consonants (ट, ड) vs dental consonants (त, द) that English doesn't distinguish. These might map to the same semantic token in codebook 1, destroying information that can never be recovered.
This isn't theoretical - the DualCodec paper (Interspeech 2025) showed that SpeechTokenizer (which uses HuBERT distillation, same idea as Mimi) gets 83.2% WER on Chinese. Catastrophic failure. Because HuBERT cant capture Mandarin tonal distinctions.
The fix? Use a multilingual teacher model like mHuBERT-147 (trained on 147 languages) instead of English-only WavLM. Or use IndicWav2Vec for Indian languages specifically.
Continuous VAE (DiTAR Style)
Now we get to the approach that says - why quantize at all?
DiTAR (ByteDance, 2025) uses a VAE instead of a discrete codec. No codebooks. No quantization. No information loss from discretization.
The Architecture
24kHz waveform → Conv Encoder → μ, σ (mean and variance)
→ z = μ + σ × ε (reparameterization trick)
→ continuous latent vectors [400 frames, 64-dim]
→ BigVGAN Decoder → reconstructed 24kHz waveform
The latent space is Gaussian - each frame is a 64-dimensional continuous vector. At ~40 Hz frame rate, 10 seconds of audio gives you 400 frames of 64-dimensional vectors.
Training losses:
The VAE is trained as Phase 1, then frozen. The downstream TTS model (DiTAR's language model + LocDiT) operates on these continuous vectors.
Why Go Continuous?
The argument is simple: quantization is lossy. Every time you round a continuous value to a discrete codebook entry, you lose information. This "quantization error" accumulates across codebooks and limits reconstruction quality.
With a VAE, theres no quantization step. The latent vectors are continuous. The decoder gets exactly the right information to reconstruct the audio.
The downside? You cant use standard next-token prediction (softmax over vocabulary) because there is no vocabulary. You need a different generation approach. DiTAR uses flow matching - a diffusion-like process that generates continuous vectors from noise. This requires multiple denoising steps (10 Euler steps in DiTAR) per frame, which is slower than discrete token prediction.
Continuous vs Discrete - The Tradeoff
The field is genuinely split right now. Koel-TTS (NVIDIA, EMNLP 2025) and likely Sarvam's Bulbul V3 use discrete tokens. DiTAR (ByteDance) and Pocket TTS/CALM (Kyutai) use continuous latents. Both camps claim superior results.
My personal take: for production streaming TTS with low latency requirements, discrete (especially FSQ) wins today. For highest quality offline generation where latency doesnt matter, continuous wins. The gap is closing though - Pocket TTS showed you can do continuous generation in just 1 step using consistency models, running faster than real-time on CPU.
The Indic Benchmark: Testing These Codecs on Indian Languages
So remember I said nobody had evaluated these codecs on non-English speech? I decided to actually do it. I ran the first systematic evaluation of neural audio codecs on Indian language speech data - five languages, three codecs, five metrics.
The Setup
Five metrics: PESQ (perceptual quality), STOI (intelligibility), speaker similarity via ECAPA-TDNN embeddings, WER via language-specific Whisper models, and exact ASR transcript match rate. The idea was to test every angle - does the audio sound good? Is it still intelligible? Does it preserve speaker identity? Can downstream ASR still understand it?
Hindi source data was 48kHz studio quality. Tamil, Telugu, Bengali, and Kannada were 16kHz. This is worth flagging as a confounding variable - Hindi's higher scores might partially reflect better source quality, not just phonological similarity to English.
PESQ: Perceptual Quality
DAC at 4.473 mean PESQ is near-perfect (scale goes to 4.5). EnCodec at 2.797 is decent but noticeable degradation. SNAC at 1.921 is genuinely poor - Bengali drops to 1.645. Thats bad enough that you'd hear the artifacts clearly.
STOI: Intelligibility
DAC at 0.992 - nearly identical intelligibility. SNAC loses 17% intelligibility on Bengali (0.752). For a codec that's supposed to reconstruct audio faithfully, losing almost a fifth of your intelligibility is a serious problem.
Speaker Similarity: Voice Preservation
This one really tells the story. DAC preserves speaker identity almost perfectly at 0.987. EnCodec is acceptable at 0.902. But SNAC? 0.720 mean, and Bengali at 0.632 - thats losing nearly 37% of speaker identity. If you're building a voice cloning or speaker-conditioned TTS pipeline for Indian languages and using SNAC, the voice that comes out the other end won't sound like the person you put in. Thats a dealbreaker.
WER: ASR Degradation
I evaluated WER using language-specific fine-tuned Whisper models. I could only get reliable results for Hindi and Tamil - the Telugu, Bengali, and Kannada ASR models produced hallucinated text, infinite repetition loops, or outputs in the wrong script entirely. Rather than report misleading numbers, I'm reporting WER only where ASR was reliable.
DAC introduces only 17.8% WER on Hindi - minimal degradation. SNAC on Tamil? 82.9% WER. The reconstructed audio is nearly completely unintelligible to downstream ASR. Tamil is consistently 2-2.4x harder than Hindi across all codecs, likely reflecting Tamil's agglutinative morphology and less representation in codec training data.
What This Tells Us
The codec ranking is unambiguous: DAC >> EnCodec >> SNAC for Indian languages. This holds across all five languages and all metrics.
DAC generalizes remarkably well to Indian languages despite being trained primarily on English. SNAC is not recommended for Indian language applications in its current form - it introduces significant artifacts that degrade both perceived quality and downstream ASR performance.
Hindi consistently scores highest across all codecs. This may reflect that Hindi's phonological features are closer to English (the dominant training language), or that the Hindi dataset had higher recording quality (48kHz studio vs 16kHz for others). Bengali and Telugu consistently score lowest - their distinct phonological features (retroflex consonants, aspiration patterns) appear to be less well-captured by English-trained codecs.
A few important caveats: this is 10 samples per language, not hundreds. PESQ was designed for English speech - the absolute numbers might not perfectly reflect what a native speaker perceives. And I couldn't test NanoCodec (dependency issues) or Mimi (not packaged for easy evaluation) - both of which would be very interesting to benchmark given their architectural differences.
But the signal is clear enough. If you're building speech AI for Indian languages today and you need to pick a codec - use DAC. Not SNAC. Not EnCodec. DAC.
The Big Picture
Lets put it all together:
The trend is clear:
And as I showed in the Indic experiment above - evaluating these codecs on non-English speech reveals dramatic differences. DAC generalizes remarkably well (PESQ 4.473 across five Indian languages), while SNAC nearly collapses (82.9% WER on Tamil). The codec you pick matters enormously, and evaluating only on English hides these failure modes entirely.
The open questions that remain: do these codecs faithfully reconstruct retroflex consonants? Aspirated sounds? Code-switched Hinglish? These need minimal pair evaluations that go beyond aggregate metrics. And we need to test NanoCodec, Mimi, and the newer codecs on Indic data too. This evaluation was a starting point - 10 samples per language, three codecs. The full picture needs 22 languages, phoneme-class-specific analysis, and subjective listening tests with native speakers. But the signal is already clear enough to make production decisions today.
If you found this useful, feel free to ping me to ask anything related here:
Thanks to Claude for helping me organize my thoughts on this. All the understanding comes from actually working with these systems - the writing just needed a push.