Running ML Inference Services in Shared Hosting Environments



I had the pleasure of presenting ‘Running ML Inference Services in Shared Hosting Environments’ at MLOps: Machine Learning in Production Bay Area Virtual Conference. This presentation was based off the 6 years of experience the Nextdoor CoreML team has productionalizing and operating 30+ real-time ML microservices.



Abstract

Running a ML inference layer in a shared hosting environment (ECS, K8s, etc.) comes with a number of unobvious pitfalls that have significant impact on latency and throughput. In this talk, we describe how Nextdoor’s ML team experienced these issues, discovered their sources and fixed them, and in the end received latency drops of a factor of 4, throughput increases of 3x and improved resource utilization (CPU 10% -> 50%) while maintaining performance. The main points of concern are request queue management and OpenMP parameter tuning.


What You’ll Learn

  1. Why your load balancing algorithm matters
  2. The importance of request queue timeouts for service recovery
  3. What resources are actually being shared in a shared hosting environment