SMOOTH: Scalable Multitask Offloading with Backbone Sharing
Wei Geng, Xiang Su, Nitinder Mohan, Jörg Ott, Pan Hui
2026 IFIP Networking Conference (IFIP Networking), Lugano, Switzerland, May 2026
Abstract
Intelligent mobile applications are often constrained by limited on-device hardware and by the latency and bandwidth overhead of cloud-based offloading. Offloading computation-intensive deep learning (DL) tasks to edge servers can mitigate these challenges. However, existing systems struggle to scale as the numbers of tasks and users grow. This limitation stems from a fundamental conflict: Requests in edge settings are typically sparse, heterogeneous, and demand immediate processing to minimize latency. However, GPUs operate most efficiently on dense, homogeneous batches, a condition that edge traffic rarely provides. Most existing approaches, including cloud-oriented solutions like Multi-Instance GPU (MIG), fail to resolve this tension because edge devices typically lack the advanced features available in high-end cloud GPUs. This paper contributes SMOOTH, a scalable multitask offloading system that introduces a Sparsity-to-Density Abstraction to resolve this conflict. By decomposing deep learning models into a shared backbone and lightweight task-specific heads, SMOOTH transforms sparse, heterogeneous request streams into dense, homogeneous computation blocks amenable to efficient batching. This allows highly efficient cross-task batching without prohibitive accumulation delays. To further optimize responsiveness, SMOOTH employs a compute-lightweight (O(1)), sparsity-aware adaptive scheduler that dynamically balances inference throughput and end-to-end (E2E) latency based on queue dynamics. Our evaluations demonstrate up to 2.21x higher throughput, 45% lower memory usage, and up to 82% reduced latency across diverse arrival patterns compared to baselines.