Running ML Inference Services in Shared Hosting Environments

29 Sep 2021

I had the pleasure of presenting ‘Running ML Inference Services in Shared Hosting Environments’ at MLOps: Machine Learning in Production Bay Area Virtual Conference. This presentation was based off the 6 years of experience the Nextdoor CoreML team has productionalizing and operating 30+ real-time ML microservices.

Abstract

Running a ML inference layer in a shared hosting environment (ECS, K8s, etc.) comes with a number of unobvious pitfalls that have significant impact on latency and throughput. In this talk, we describe how Nextdoor’s ML team experienced these issues, discovered their sources and fixed them, and in the end received latency drops of a factor of 4, throughput increases of 3x and improved resource utilization (CPU 10% -> 50%) while maintaining performance. The main points of concern are request queue management and OpenMP parameter tuning.

What You’ll Learn

Why your load balancing algorithm matters
The importance of request queue timeouts for service recovery
What resources are actually being shared in a shared hosting environment

Danny Luo

Running ML Inference Services in Shared Hosting Environments

Abstract

What You’ll Learn

Related Posts

Thoughts on Nietzsche's Beyond Good and Evil 21 Jan 2023

Modern Agile for Machine Learning 09 Aug 2019