Research Scientist @ Invideo

Aman Agrawal

Research Scientist Speech Enhancement Audio Signal Processing Distributed Training LLM Systems ASR & TTS

Building speech enhancement systems and training models at scale. IIT Roorkee. 25,000+ views on Medium.

First principles, pen & paper, Jupyter notebooks — before frameworks.

I document everything I learn — from paper breakdowns to model internals. See my research.

View My Work Resume

scroll

01 About

Research Scientist at Invideo with hands-on experience in audio signal processing, speech enhancement, and large-scale distributed training. Built and pre-trained GenHencer from scratch, achieving strong quantitative and perceptual results on industry benchmarks.

Deeply interested in the latest developments in ASR (Whisper, Wav2Vec2, HuBERT) and TTS systems. Strong foundation in mathematics, ML, and Generative AI with in-depth knowledge of LLMs. Published 4 articles in Towards Data Science with 25,000+ views.

Indian Institute of Technology Roorkee

B.Tech · JEE Advanced AIR 6851 · 2021 – 2025

Audio Signal Processing Speech Enhancement ASR TTS Distributed Training LLMs Audio-Visual Sync RAG Systems DSP

02 Experience

Jul 2025 – Present Full Time

Research Scientist

Invideo

Built GenHencer, a speech enhancement model, contributing to architecture design and implementing key components of the training pipeline.
Designed and implemented a synthetic noisy-clean data generation pipeline using RIR convolution and SNR-controlled noise mixing to simulate real-world acoustic environments.
Scaled multi-node distributed training (PyTorch DDP) on a large GPU cluster to pre-train GenHencer on LibriTTS-R, achieving STOI 0.91 with perceptual quality validated through structured listening tests.
Leveraged audio signal processing knowledge (STFT, spectral analysis, DSP fundamentals) to debug model behavior, guide experimentation, and evaluate output quality.

PyTorchAudio DSPPyTorch DDPSLURMWav2Vec2DAC

May 2025 – Jul 2025 Internship

Machine Learning Engineer Intern

Pascal AI Labs · Bangalore (Remote)

Designed and developed a comprehensive LLM evaluation platform, empowering finance domain experts to conduct iterative prompt experimentation and efficiently analyze responses against internal golden datasets.
Streamlined the experimentation workflow with intuitive comparison tools for LLM outputs, golden answers, and historical experiment results.
Benchmarked fintech software against the industry-standard FinBench dataset; increased normal RAG accuracy by 11% and advanced RAG accuracy by 10% through research-driven system enhancements.

LLMsRAGEvaluationFinBenchPrompt Engineering

03 Research

I go deep into the papers I study — custom diagrams, full pipeline breakdowns, and training analysis.

SyncNet: Audio-Visual Sync — Full Technical Breakdown

21-page deep dive into the SyncNet paper with custom diagrams, full inference pipeline walkthrough with exact tensor shapes, training analysis with W&B logs, and dataloader bottleneck identification. From architecture to contrastive loss to confidence scores.

Audio-Visual Sync Contrastive Loss MFCC CNN W&B

TDS Article

Page 1 of --

videoEra — Audio-Visual Research Knowledge Base

Structured repository covering audio fundamentals (sound physics, Fourier transforms, aliasing, reverb), paper breakdowns (HuBERT, Wav2Vec2, Whisper, SyncNet), Descript Audio Codec documentation, spectral & time-domain analysis notebooks, and an audio model visualizer.

Fourier Transforms Spectral Analysis HuBERT Wav2Vec2 Whisper DAC

View Repo

Repository Structure

videoEra/
├── docs/
│   ├── audio_fundamentals/
│   │   ├── aliasing.md
│   │   ├── fourier_intuition.md
│   │   ├── sound_physics_and_perception.md
│   │   └── sound_reverb.md
│   ├── audio_codecs/
│   │   ├── README.md
│   │   └── descript_audio_codec.md
│   └── papers/
│       ├── hubert.pdf
│       ├── syncnet.pdf
│       ├── wav2vec2.pdf
│       └── whisper.pdf
├── notebooks/
│   ├── spectral_analysis/
│   └── time_domain/
├── projects/
│   ├── audio_model_visualizer/
│   └── syncnet/
└── tutorials/

Audio Preprocessing Notes — Sound to Mel Spectrograms

76 pages of handwritten notes covering the full audio preprocessing pipeline from first principles — sound physics, sampling, Fourier transforms, STFT, filter banks, and mel spectrograms. Written while building deep domain expertise during research work at Invideo.

Handwritten 76 Pages Mel Spectrograms STFT Filter Banks

View Notes

Aman Agrawal

01 About

Indian Institute of Technology Roorkee

02 Experience

Research Scientist

Invideo

Machine Learning Engineer Intern

Pascal AI Labs · Bangalore (Remote)

03 Research

SyncNet: Audio-Visual Sync — Full Technical Breakdown

videoEra — Audio-Visual Research Knowledge Base

Audio Preprocessing Notes — Sound to Mel Spectrograms

04 Projects

GISS

05 Tech Stack

Languages

ML & NLP

Audio & DSP

Distributed & Infra

Dev Tools

06 Writing

Everything You Need to Know About Fourier Transform

Aliasing in Audio — From Wagon Wheels to Waveforms

SyncNet Paper Easily Explained

Bessel's Correction — Why n-1?

Neural Audio Codecs — Dissected, Compared, and Benchmarked

Understanding the Fourier Transform: An Intuitive Approach

Understanding the Foundational Distortion of Digital Audio

The Grey Bar — Sound, Data & Childhood Curiosity

RAG for Knowledge-Intensive NLP Tasks

Maximum Likelihood Estimation

07 Achievements

IMAGINE Hackathon — Top 20

Basketball Gold Medal

08 Get In Touch