From INTERSPEECH to Production

In 2021, I won a Best Student Paper Award at INTERSPEECH for work on end-to-end speech translation, and an Outstanding Student Paper Award at IEEE ICASSP for speech quality assessment. I thought I was ready for industry. I was wrong about almost everything that mattered.

Five gaps between research and production: Latency (offline to real-time), Data (clean to noisy), Systems (model is 5% of system), Users (ship to users not reviewers), Teams (researchers to ML engineers)

The next four years took me through Comcast Applied AI, Spectrum Labs, Numerade, and AWS. Each one taught me something academia never mentioned. Not because professors are hiding secrets, but because the problems of production ML are fundamentally different from the problems of research ML, and you can't learn them from papers.

I'm not bitter about my research years. They gave me the foundation to think rigorously about hard problems. But the transition was harder than it needed to be, and as of 2024, roughly 70% of AI PhDs go to industry, up from 20% two decades ago. That's a lot of researchers about to hit the same wall. Here's what I wish someone had told me.

Gap 1: Latency budgets kill elegant models

My INTERSPEECH paper optimized transformer architectures for end-to-end speech translation. We improved BLEU scores by encoding inductive biases into the attention mechanism. Technically interesting, but the model took 2.3 seconds per inference on a V100.

At Comcast, the latency budget for our ASR system was 200 milliseconds. Not 2.3 seconds. Not even 500 milliseconds. Two hundred. Serving millions of cable box voice queries meant every millisecond cost real money and real user patience. My elegant transformer modifications? Useless. We shipped a wav2vec 2.0 system with aggressive quantization and temporal early exiting that hit 30% WER reduction within the latency envelope.

The lesson: in research, you optimize for accuracy. In production, you optimize for accuracy under constraints. Latency, memory, cost, throughput. The model that wins the benchmark rarely wins the deployment. I've seen this pattern repeat at every company since. At Numerade, we deployed LLMs via vLLM and had to choose 10x throughput over marginally better generation quality. At AWS, I've seen customers with models that score beautifully on eval sets but can't serve 100 concurrent requests without burning through their compute budget.

Gap 2: Production data looks nothing like your dataset

Academic datasets are clean. They're curated. They have balanced classes, consistent formatting, and known distributions. I spent months working with LibriSpeech, Common Voice, and carefully constructed evaluation sets. The data was the one thing I never worried about.

At Spectrum Labs, we processed billions of messages daily across 12+ languages for trust and safety classification. The data was a nightmare. Misspellings, code-switching, slang that evolved weekly, adversarial evasion (users intentionally obfuscating harmful content), encoding issues, and distribution shifts that made last month's model measurably worse this month. We achieved 98% accuracy, but maintaining it was a full-time job. Not because the model degraded, but because the world the model operated in never stopped changing.

Google's "Hidden Technical Debt in Machine Learning Systems" paper nailed this in 2015: ML systems have all the maintenance problems of traditional software plus an entire additional dimension of data-dependency debt. The data pipeline, feature extraction, validation, and monitoring infrastructure dwarfs the model itself. In academia, the dataset is a given. In production, the dataset is the product.

Gap 3: The model is 5% of the system

The most cited figure from Google's technical debt paper is the diagram showing that ML model code is a tiny fraction of a real-world ML system. Google's own ML crash course estimates the model accounts for roughly 5% of the total production codebase. The other 95% is data collection, feature engineering, serving infrastructure, monitoring, configuration management, pipeline orchestration, and testing.

At Numerade, I built an AI video generation system from scratch: image generation, animation, text-to-speech, avatar synthesis. The models were maybe three weeks of work. The pipeline that stitched them together, handled failures gracefully, cached intermediate results, managed GPU memory across stages, and served the output reliably? Three months. And that pipeline is what actually shipped. Nobody uses a model. They use a system that happens to contain a model.

In my research days, I spent 90% of my time on the model and 10% on everything else. In production, those percentages are inverted. The sooner a researcher internalizes this, the faster they'll be effective in industry.

Gap 4: You ship to users, not reviewers

Academic success is measured by peer reviewers. Three to five experts read your paper, evaluate your methodology, check your baselines, and judge your novelty. The feedback loop is 3-6 months. The audience is a few hundred specialists at a conference.

Production success is measured by users. Millions of them. Who don't read your methodology section. Who don't care about your baselines. Who want the thing to work, right now, every time. The feedback loop is seconds. If the model is slow, they leave. If it hallucinates, they lose trust. If it breaks, they file a support ticket and your on-call rotation starts at 3am.

After dozens of production deployments, I can tell you the conversations that matter aren't about architecture innovations or BLEU score improvements. They're about: "What happens when the model is wrong?" "How do we know it's degrading?" "What's the fallback?" "Can we explain this to our compliance team?" These are the questions that determine whether a project ships or dies. I don't recall a conference paper that asked me to answer them.

The mindset shift is profound. In research, you publish and move on. In production, you ship and then you live with it. Every deployment is a commitment to maintaining, monitoring, and improving something indefinitely. That changes how you make decisions about everything from model architecture to API design.

Gap 5: The team matters more than the architecture

My research was largely solo or in small groups of 2-3 collaborators, each with deep expertise in the same narrow domain. We communicated in math notation and paper citations. Disagreements were settled by experiments.

At Numerade, I built and managed a team of ML engineers and interns, coordinating across data, infrastructure, and product. The work is fundamentally collaborative in a way that research rarely is.

The architecture of the team constrains the architecture of the system. Conway's Law isn't a theory. It's a description of reality I've observed at each company I've worked at. If the ML team doesn't talk to the infrastructure team, the model won't deploy well. If the data engineers don't understand the model's requirements, the training pipeline will produce garbage. If the product team doesn't understand the model's limitations, they'll promise capabilities that don't exist.

My best work in industry hasn't been my best technical work. It's been the work where I translated between groups, helping researchers speak to engineers, helping engineers speak to product managers, helping product managers speak to customers. That translation layer is where most production ML projects fail, and it's a skill that no amount of paper-writing develops.

Advice for researchers entering industry

If I could go back and tell 2021-me five things:

Learn to love constraints. The elegance of a production system isn't in the model. It's in how gracefully the system handles the real world's messiness within a tight budget of latency, cost, and reliability.
Build the pipeline, not just the model. Before your first day, set up a training pipeline, a serving endpoint, a monitoring dashboard, and an A/B test. These skills matter more than any architecture innovation.
Read the technical debt paper. Sculley et al., NeurIPS 2015. I think it's the most practically useful paper in ML, and it's not about ML at all.
Practice explaining your work to non-experts. Your manager's manager doesn't know what attention is. Your customer definitely doesn't. The ability to explain complex systems simply is the most underrated skill in industry.
Ship something small, soon. The gap between "I understand this" and "I shipped this" is where all the real learning happens. Don't wait for the perfect project. Deploy a model, break it, fix it, deploy it again.

My papers opened doors. My production experience kept them open. The research gave me the vocabulary and the intuition. But the five gaps above (latency, data, systems, users, teams) had to be learned the hard way. If you're making the transition, expect it to be uncomfortable. That discomfort is the learning.