Performance Isolation in Cloud Data Center Networks

Tuesday, February 18, 2020
2:00 PM to 3:00 PM
EER 3.646
Free and open to the public

Despite decades of work, it is still possible today for collocated applications to hurt each other's network performance. In this talk, I discuss the underlying difficulties in enforcing performance isolation at line-rate across the different levels of the network. This includes software challenges due to multi-core coordination within the operating system that arise because scheduling strategies were designed for a uni-processor system with complete global knowledge and ample processing power, whereas today’s systems are highly parallel and must run at rates that allow only thousands of CPU cycles per packet. This also includes hardware challenges related to simple scheduling approaches used in NICs, inefficient interfaces between the operating system and the NIC, and limited state in network switches.
Based on these difficulties, I describe three approaches to improving performance isolation at all of different levels of the network from the host operating system down to the switches in the network. First, Titan is a new packet scheduler for Linux an extension to the Linux networking stack that systematically addresses unfairness, or the different performance flows receive. Titan can reduce unfairness by 58% compared with the best performing Linux configuration. Next, Loom is a new NIC design that moves all per-flow scheduling decisions out of the OS and into the NIC. Loom can enforce complex hierarchical network performance isolation policies while also enabling applications to fully utilize network bandwidth with low CPU overheads. Finally, Nimble is a new switch scheduler that uses commodity programmable switches to provide scalable in-network performance isolation. Because Nimble uses counters to approximately behave as a switch with a large number of virtual queues, it can scale to tens of thousands of rate-limiters. Further, it also enables MT-DCTCP, a new multi-tenant congestion control algorithm.


Brent Stephens

Brent Stephens

University of Illinois at Chicago

Brent Stephens is an Assistant Professor at the University of Illinois at Chicago. Before that, he was a postdoc at the University of Wisconsin-Madison working with Professors Aditya Akella and Michael Swift. After growing up in Portland, OR, he attended Rice University, where he worked with Professors Alan L. Cox and Scott Rixner as he completed his PhD and MS in Computer Science after receiving a BS in Electrical Engineering. Throughout his career, Brent has won notable competitive awards, including the NSF CAREER Award, the Google Faculty Research Award, the NSF CRII Award, and an IBM Ph.D. Fellowship.