Epic! for Kids · February 2023 – May 2024

ML Infrastructure Rescue

Production ML platform ownership across cost, search, recommendations, and reliability.

Role: Senior Research Engineer

ML infraKubernetesSearchRecommendationsCost optimization

Executive summary

Took ownership of production ML systems after layoffs and reduced cost, complexity, and operational risk.

10x ML platform cost reduction100x Kubernetes pod usage reduction99% spot instance error reduction50% Docker build-time reduction

Problem and constraints

The ML platform needed ownership across discovery, recommendations, search, Docker builds, Kubernetes usage, spot instance stability, and product experiments.

Keep production systems running
Reduce operational cost
Improve search relevance
Support backend/frontend/analytics needs

Architecture

01ML services

02Docker build pipeline

03Kubernetes deployment

04Spot compute

05Elasticsearch autocomplete

06Recommendations

07A/B testing

Decision Theater

Decision fork

Scale existing infrastructure vs simplify it

Cost and reliability problems were symptoms of complexity, not just capacity.

Scale existing pattern

Pros

Less migration work

Cons

Preserves cost and fragility

Simplify usage

Pros

Lower cost
Smaller failure surface

Cons

Requires deeper investigation

Chosen: Simplify infrastructure. Reducing unnecessary infrastructure can be more powerful than tuning it.

Evaluation and reliability

Measured search relevance against the prior autocomplete solution.
Tracked infrastructure cost and operational error reduction.

Observability and debugging

Used production metrics and error behavior to prioritize high-impact infrastructure fixes.

Reflection

Senior ML ownership often means cleaning up cost, reliability, and product feedback loops, not only training models.

This case study uses sanitized architecture and representative examples. It excludes confidential prompts, customer data, proprietary datasets, private implementation details, and internal traces.