Research Scientist @ Invideo

Aman Agrawal

Research Scientist Speech Enhancement Audio Signal Processing Distributed Training LLM Systems ASR & TTS

Building speech enhancement systems and training models at scale. IIT Roorkee. 25,000+ views on Medium.

First principles, pen & paper, Jupyter notebooks — before frameworks.

I document everything I learn — from paper breakdowns to model internals. See my research.

scroll

01 About

Aman Agrawal

Research Scientist at Invideo with hands-on experience in audio signal processing, speech enhancement, and large-scale distributed training. Built and pre-trained GenHencer from scratch, achieving strong quantitative and perceptual results on industry benchmarks.

Deeply interested in the latest developments in ASR (Whisper, Wav2Vec2, HuBERT) and TTS systems. Strong foundation in mathematics, ML, and Generative AI with in-depth knowledge of LLMs. Published 4 articles in Towards Data Science with 25,000+ views.

Indian Institute of Technology Roorkee

B.Tech · JEE Advanced AIR 6851 · 2021 – 2025

Audio Signal Processing Speech Enhancement ASR TTS Distributed Training LLMs Audio-Visual Sync RAG Systems DSP

02 Experience

Jul 2025 – Present Full Time

Research Scientist

Invideo

  • Built GenHencer, a speech enhancement model, contributing to architecture design and implementing key components of the training pipeline.
  • Designed and implemented a synthetic noisy-clean data generation pipeline using RIR convolution and SNR-controlled noise mixing to simulate real-world acoustic environments.
  • Scaled multi-node distributed training (PyTorch DDP) on a large GPU cluster to pre-train GenHencer on LibriTTS-R, achieving STOI 0.91 with perceptual quality validated through structured listening tests.
  • Leveraged audio signal processing knowledge (STFT, spectral analysis, DSP fundamentals) to debug model behavior, guide experimentation, and evaluate output quality.
PyTorchAudio DSPPyTorch DDPSLURMWav2Vec2DAC
May 2025 – Jul 2025 Internship

Machine Learning Engineer Intern

Pascal AI Labs · Bangalore (Remote)

  • Designed and developed a comprehensive LLM evaluation platform, empowering finance domain experts to conduct iterative prompt experimentation and efficiently analyze responses against internal golden datasets.
  • Streamlined the experimentation workflow with intuitive comparison tools for LLM outputs, golden answers, and historical experiment results.
  • Benchmarked fintech software against the industry-standard FinBench dataset; increased normal RAG accuracy by 11% and advanced RAG accuracy by 10% through research-driven system enhancements.
LLMsRAGEvaluationFinBenchPrompt Engineering

03 Research

I go deep into the papers I study — custom diagrams, full pipeline breakdowns, and training analysis.

SyncNet: Audio-Visual Sync — Full Technical Breakdown

21-page deep dive into the SyncNet paper with custom diagrams, full inference pipeline walkthrough with exact tensor shapes, training analysis with W&B logs, and dataloader bottleneck identification. From architecture to contrastive loss to confidence scores.

Audio-Visual Sync Contrastive Loss MFCC CNN W&B
Page 1 of --

videoEra — Audio-Visual Research Knowledge Base

Structured repository covering audio fundamentals (sound physics, Fourier transforms, aliasing, reverb), paper breakdowns (HuBERT, Wav2Vec2, Whisper, SyncNet), Descript Audio Codec documentation, spectral & time-domain analysis notebooks, and an audio model visualizer.

Fourier Transforms Spectral Analysis HuBERT Wav2Vec2 Whisper DAC
Repository Structure
videoEra/
├── docs/
│   ├── audio_fundamentals/
│   │   ├── aliasing.md
│   │   ├── fourier_intuition.md
│   │   ├── sound_physics_and_perception.md
│   │   └── sound_reverb.md
│   ├── audio_codecs/
│   │   ├── README.md
│   │   └── descript_audio_codec.md
│   └── papers/
│       ├── hubert.pdf
│       ├── syncnet.pdf
│       ├── wav2vec2.pdf
│       └── whisper.pdf
├── notebooks/
│   ├── spectral_analysis/
│   └── time_domain/
├── projects/
│   ├── audio_model_visualizer/
│   └── syncnet/
└── tutorials/

Audio Preprocessing Notes — Sound to Mel Spectrograms

76 pages of handwritten notes covering the full audio preprocessing pipeline from first principles — sound physics, sampling, Fourier transforms, STFT, filter banks, and mel spectrograms. Written while building deep domain expertise during research work at Invideo.

Handwritten 76 Pages Mel Spectrograms STFT Filter Banks

04 Projects

GISS

GitHub Issues Semantic Search · 2023

Concept-based issue retrieval using NLP embeddings and FAISS. Robust data pipeline via GitHub REST API with selective comment filtering. Contextual chatbot powered by Google Gemini 1.5 with RAG for conversational, issue-focused solutions.

Watch Demo
FAISS Embeddings Gemini RAG REST API

05 Tech Stack

Languages

Python C++

ML & NLP

PyTorch Transformers spaCy OpenAI Hugging Face LangChain RAG Pipelines

Audio & DSP

LibROSA torchaudio Mel Spectrograms MFCCs Fourier Transforms Wav2Vec2 DAC

Distributed & Infra

PyTorch DDP SLURM Linux/Bash Weights & Biases

Dev Tools

FastAPI Git Postman

06 Writing

25,000+ views — 4 publications in Towards Data Science + self-hosted technical blog

Fourier Transform
Towards Data Science Signal Processing

Everything You Need to Know About Fourier Transform

An intuitive deep dive — from winding machines and the centre of mass to frequency decomposition of sound.

2026 · ~18 min
Aliasing in Audio
Towards Data Science Audio · DSP

Aliasing in Audio — From Wagon Wheels to Waveforms

Why spinning wheels go backward, why cheap recordings sound harsh, and why it all traces back to the Nyquist theorem.

2026 · ~15 min
SyncNet
Towards Data Science Audio-Visual

SyncNet Paper Easily Explained

Full technical breakdown of audio-visual synchronisation — how SyncNet learns to match lip movements with speech.

2025 · ~12 min
Bessel's Correction
Towards Data Science Statistics

Bessel's Correction — Why n-1?

Why we divide by n-1 for sample variance — the mathematical proof and intuition behind this statistical correction.

2024 · ~8 min
Blog Neural Codecs · Indic

Neural Audio Codecs — Dissected, Compared, and Benchmarked

Four codec architectures from first principles with the first systematic evaluation on five Indian languages.

Mar 2026 · ~25 min
Fourier Transform
Blog Signal Processing

Understanding the Fourier Transform: An Intuitive Approach

From winding machines to the centre of mass — building deep intuition for frequency decomposition.

Mar 2026 · ~18 min
Aliasing
Blog Signal Processing

Understanding the Foundational Distortion of Digital Audio

From first principles — why spinning wheels go backward and why it traces back to one elegant rule.

Feb 2026 · ~15 min
Blog Audio · Codecs

The Grey Bar — Sound, Data & Childhood Curiosity

A story about sound waves, compression, and the invisible machinery behind every song you've ever streamed.

Feb 2026 · ~12 min
RAG Paper
Towards AI RAG · NLP

RAG for Knowledge-Intensive NLP Tasks

Deep dive into Retrieval-Augmented Generation — how retrieval enhances language model generation.

2024 · ~10 min
MLE
Medium Math · ML

Maximum Likelihood Estimation

The fundamental concept of MLE — a cornerstone in parameter estimation for machine learning.

2024 · ~8 min

07 Achievements

IMAGINE Hackathon — Top 20

Secured Top 20 among 72 teams (15,000+ registrations) at PIWOT 2025, organized by PAN IIT Alumni Association in Mumbai, representing IIT Roorkee.

Basketball Gold Medal

Gold Medalist in Basketball (Men) Inter-Hostel (10 Teams) Sports General Championship, IIT Roorkee.

08 Get In Touch

Open to research collaborations, full-time opportunities in AI/ML, or just a conversation about what's next in audio ML and LLMs.

Say Hello