Skip to main content

Catch M(oor)e If You Can: Agile Hardware/Software Co-Design for Hyperscale Cloud Systems

ECE Seminar

Location: GDC 2.216 & Zoom
Sagar Karandikar
UC Berkeley

Transformative technologies like generative AI, machine learning, and big-data analytics are driving exponential growth in demand for hyperscale cloud compute infrastructure, while the breakdown of classical hardware scaling (e.g., Moore's Law) is hampering growth in compute supply. Building domain-specific hardware can address this supply-demand gap, but catching up with exponential demand requires developing new hardware rapidly and with confidence that performance/efficiency gains will compound in the context of a complete system. These are challenging tasks given the status quos in hardware design, even before accounting for the immense scale of cloud systems.

This talk will focus on two themes of my work: (1) Developing radical new agile hardware/software co-design tools that enable rapidly building and evaluating complete specialized hardware/software systems to challenge the status quos in hardware design and (2) Leveraging these tools to architect and implement state-of-the-art domain-specific hardware that addresses key efficiency challenges in hyperscale cloud systems.

I will first cover my work creating the award-winning and widely used FireSim FPGA-accelerated hardware simulation platform, which enables unprecedented hardware-software co-design capabilities. FireSim automatically constructs high-performance, cycle-exact, scale-out simulations of novel hardware designs (e.g., a 1024-node Ethernet-interconnected hyperscale cluster with specialized server designs) derived from the tapeout-friendly RTL code that describes digital hardware designs. By simulating novel hardware rapidly enough to run their actual software stacks, FireSim empowers domain experts to directly influence hardware designs in hours rather than years. I will then cover my work co-creating the also widely used Chipyard platform for agile development of specialized RISC-V System-on-Chip (SoC) designs. Using a novel, RTL-generator-driven approach, Chipyard enables composing a large collection of highly parameterized and customizable hardware IP into massive specialized SoCs. These SoCs can then be pushed through various integrated flows for simulation (including FireSim), tape-out, and more.

Next, I will discuss my work in collaboration with Google on Hyperscale SoC, a cloud-optimized server chip built, evaluated, and taped-out with FireSim and Chipyard. Hyperscale SoC includes my work on several novel domain-specific accelerators (DSAs) for expensive but foundational operations in hyperscale servers, including (de)serialization, (de)compression, and more. Hyperscale SoC demonstrates a new paradigm of data-driven, end-to-end hardware/software co-design, combining profiling of Google's world-wide datacenter fleet with the ability to rapidly build and evaluate novel hardware designs in FireSim/Chipyard. This instance of Hyperscale SoC is just the beginning---I will conclude by covering the wide-ranging opportunities that can now be explored for radically redesigning next generation hyperscale cloud datacenters.

Sagar Karandikar is a Ph.D. Candidate at UC Berkeley and a Student Researcher at Google. His work broadly focuses on co-designing hardware and software to build next generation hyperscale cloud systems. He is also interested in agile, open-source hardware development methodologies.

His first-author publications have received several honors, including being selected for the ISCA@50 25-year Retrospective, as an IEEE Micro Top Pick, as an IEEE Micro Top Pick Honorable Mention, and as the MICRO '21 Distinguished Artifact Award winner.

He created and leads the FireSim project, which has been used as a foundational research platform in over 50 peer-reviewed publications from first authors at over 20 institutions. FireSim has also been used in the development of commercially available chips and as a standard host platform for DARPA and IARPA programs. He is a co-creator and co-lead of the also widely used Chipyard RISC-V System-on-Chip (SoC) development platform. His work on Hyperscale SoC has been influential at Google and more broadly across other silicon vendors. He was selected as a 2022 DARPA Riser and received the UC Berkeley Outstanding Graduate Student Instructor (TA) Award. He received his M.S. and B.S. from UC Berkeley.