Epic! for Kids · February 2023 – May 2024
ML Infrastructure Rescue
Production ML platform ownership across cost, search, recommendations, and reliability.
Role: Senior Research Engineer
Executive summary
Took ownership of production ML systems after layoffs and reduced cost, complexity, and operational risk.
Problem and constraints
The ML platform needed ownership across discovery, recommendations, search, Docker builds, Kubernetes usage, spot instance stability, and product experiments.
- Keep production systems running
- Reduce operational cost
- Improve search relevance
- Support backend/frontend/analytics needs
Architecture
Decision Theater
Decision fork
Scale existing infrastructure vs simplify it
Cost and reliability problems were symptoms of complexity, not just capacity.
Chosen: Simplify infrastructure. Reducing unnecessary infrastructure can be more powerful than tuning it.
Evaluation and reliability
- Measured search relevance against the prior autocomplete solution.
- Tracked infrastructure cost and operational error reduction.
Observability and debugging
- Used production metrics and error behavior to prioritize high-impact infrastructure fixes.
Reflection
Senior ML ownership often means cleaning up cost, reliability, and product feedback loops, not only training models.
This case study uses sanitized architecture and representative examples. It excludes confidential prompts, customer data, proprietary datasets, private implementation details, and internal traces.