I am a staff software engineer at Google, developing distributed software systems for
data-center and WAN networks, as a member of
NetInfra team.
Before that, I was a graduate student at Stanford
University, where I received a Ph.D. degree in Electrical Engineering, and a
Ph.D. minor degree in Computer Science, in 2017. My interests include cloud computing,
distributed systems, and networking systems. At Stanford, I worked with
Professor Philip Levis as a member of Stanford Information Networks Group
(SING). I received my M.S.
degree from Stanford University in 2013, and B.S. degree from Sharif University
of Technology in 2011, both in Electrical Engineering.
Projects
Main Research Projects
Nimbus
Nimbus is a general purpose cloud computing system specially
designed for computations with short tasks. Nimbus is developed in
C++ and the API offers a data model similar to
Spark and
Naiad.
The key difference between Nimbus and its counterparts is the novel
control plane abstraction called Execution
Templates. For jobs with
short tasks the runtime overhead becomes comparable to the task
itself. For example for an application with 4ms tasks, Sparks
runtime overhead is about 98%. Execution templates enable Nimbus to
schedule and drive jobs with tasks as short as 100us over large
number of workers at scale with negligible runtime overhead. For
more information visit Nimbus home page.
Janus
In this project, we investigate novel techniques to design MAC
protocol for the Full Duplex radio. This project follows SING’s
earlier paper on single channel full duplex wireless network (please
see here), which explores feasible
implementation of the physical layer for efficient simultaneous
transmission and reception. We consider the very fundamental
characteristics of a full duplex network and try to design a MAC
protocol which perfectly takes the advantages of the freedom
provided by the physical layer. Specially, the goal is to improve
the throughput as much as two times of the ordinary half duplex
systems while remaining fair to all the users. You can find our
paper here.
Older Research Projects
Overloaded Codes
As my undergraduate research thesis, I worked on developing
overloaded codes with efficient decoders for CDMA systems. These
codes not only have really simple decoder but also show the
bit-error-rate (BER) performance almost as good as the
maximum-likelihood (ML) decoder. The novel coding matrices are
called logical signature matrices (LSM), which benefit from
comparison-based decoders. In addition to the published
paper,
this work also received a US patent.
Selected Short Projects
x86 Runtime Prediction
Predicting x86 program runtime for Intel processor. As a part of
the machine learning course at Stanford (CS229), we
developed a series of supervised learning methods for predicting
the runtime length of serialized x86 programs for Intel
processors. Our generative model predicts runtime for programs
with arbitrary numbers of instructions with 12% error.
Packet Classification
Novel and optimized methods for packet classification when both
rules and keys contain wildcard expressions. This work introduces
a software based algorithm which is memory efficient, does not
need expansion for rules and keys containing wildcards, is
scalable, and provides multi-match results.
DCell with OpenFlow
Simulating DCell topology and routing for data centers using
Mininet and OpenFlow controller. Each year, advanced networking
course at Stanford (CS244) publishes a blog on reproducing the results reported in
the literature by students. Yo can find our report here .
Cognitive Radio Devices
This work is a literature survey on cognitive radio devices as a
part of wireless networking course at Stanford (EE359).
Publications
Dissertation
Today, cloud computing frameworks adopt one of two strategies to
schedule their computations over hundreds to thousands of machines. In
the centralized strategy, a controller dispatches computation units,
called tasks, to worker nodes. Centralization allows a framework to
quickly reschedule, respond to faults, and mitigate stragglers.
However, a centralized controller can only schedule a few thousand
tasks per second and becomes a bottleneck. In the distributed
strategy, there is no central node. These systems install data flow
graphs on each node, which then independently execute and exchange
data. By distributing the control plane and turning it into data flow,
these frameworks can execute hundreds of thousands of tasks per
second, and do not have a control plane bottleneck. However, data flow
graphs describe a static schedule; even small changes, such as
migrating a task between two nodes, requires stopping all nodes,
recompiling the flow graph and reinstalling it on every node. This
leads to high latency or wasteful resource utilization for
rescheduling, fault recovery, or straggler mitigation.
This dissertation presents a new, third strategy, called execution
templates. Execution templates leverage a program’s repetitive
control flow to cache control plane decisions in templates. A template
is a parametrisable block of tasks: it caches some information (e.g.,
task dependencies) but instantiation requires some parameters (e.g.,
task identifiers). Executing the cached tasks in a template requires
sending a single message that loads the new parameters. Large-scale
scheduling changes install new templates, while small changes apply
edits to existing templates. Execution templates are not bound to a
static control flow and efficiently capture nested loops and data
dependent branches. Evaluation of execution templates in Nimbus, a
cloud computing framework, shows that they provide the fine-grained
scheduling flexibility of centralized control planes while matching
the performance of the distributed ones. Execution templates in Nimbus
support not only the traditional data analytics, but also complex,
scientific applications such as hybrid graphical simulations.
Papers
-
Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking
-
Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh,
Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard,
Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer,
Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir
Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura,
Shidong Zhang, Junlan Zhou, Amin Vahdat
-
In Proceedings of ACM SIGCOMM 2022
-
Abstract
We present a decade of evolution and production experience with
Jupiter datacenter network fabrics. In this period Jupiter has
delivered 5x higher speed and capacity, 30% reduction in capex, 41%
reduction in power, incremental deployment and technology refresh
all while serving live production traffic. A key enabler for these
improvements is evolving Jupiter from a Clos to a direct-connect
topology among the machine aggregation blocks. Critical
architectural changes for this include: A datacenter
interconnection layer employing Micro-ElectroMechanical Systems
(MEMS) based Optical Circuit Switches (OCSes) to enable dynamic
topology reconfiguration, centralized Software-Defined Networking
(SDN) control for traffic engineering, and automated network
operations for incremental capacity delivery and topology
engineering. We show that the combination of traffic and topology
engineering on direct-connect fabrics achieves similar throughput
as Clos fabrics for our production traffic patterns. We also
optimize for path lengths: 60% of the traffic takes direct path
from source to destination aggregation blocks, while the remaining
transits one additional block, achieving an average blocklevel path
length of 1.4 in our fleet today. OCS also achieves 3x faster
fabric reconfiguration compared to pre-evolution Clos fabrics that
used a patch panel based interconnect.
Distributing a simulation across many machines can drastically speed
up computations and increase detail. The computing cloud provides
tremendous computing resources, but weak service guarantees
force programs to manage significant system complexity: nodes,
networks, and storage occasionally perform poorly or fail.
We describe Nimbus, a system that automatically distributes
grid-based and hybrid simulations across cloud computing
nodes. The main simulation loop is sequential code
and launches
distributed computations across many cores. The simulation on each
core runs as if it is stand-alone: Nimbus automatically
stitches these simulations into a single, larger
one. To do this efficiently, Nimbus introduces a four-layer
data model that translates between the contiguous, geometric objects
used by simulation libraries and the replicated, fine-grained objects
managed by its underlying cloud computing runtime.
Using PhysBAM particle level set fluid simulations, we
demonstrate that Nimbus can run higher detail simulations faster,
distribute simulations on up to 512 cores, and run enormous
simulations (10243 cells). Nimbus automatically manages
these distributed simulations, balancing load across nodes and recovering
from failures. Implementations of PhysBAM water and smoke simulations
as well as an open source heat-diffusion simulation show that Nimbus is
general and can support complex simulations.
Nimbus can be downloaded from https://nimbus.stanford.edu.
Control planes of cloud frameworks trade off between scheduling
granularity and performance. Centralized systems schedule at task
granularity, but only schedule a few thousand tasks per second.
Distributed systems schedule hundreds of thousands of tasks per
second but changing the schedule is costly.
We present execution templates, a control plane abstraction that can
schedule hundreds of thousands of tasks per second while supporting
fine-grained, per-task scheduling decisions. Execution templates
leverage a program’s repetitive control flow to cache blocks of
frequently-executed tasks. Executing a task in a template requires
sending a single message. Large-scale scheduling changes install new
templates, while small changes apply edits to existing templates.
Evaluations of execution templates in Nimbus, a data analytics
framework, find that they provide the fine-grained scheduling
flexibility of centralized control planes while matching the strong
scaling of distributed ones. Execution templates support complex,
real-world applications, such as a fluid simulation with a triply
nested loop and data dependent branches.
Existing cloud computing control planes do not scale to more than
a few hundred cores, while frameworks without control planes
scale but take seconds to reschedule a job. We propose an asynchronous
control plane for cloud computing systems, in which
a central controller can dynamically reschedule jobs but worker
nodes never block on communication with the controller. By decoupling
control plane traffic from program control flow in this way,
an asynchronous control plane can scale to run millions of computations
per second while being able to reschedule computations
within milliseconds.
We show that an asynchronous control plane can match the
scalability and performance of TensorFlow and MPI-based programs
while rescheduling individual tasks in milliseconds. Scheduling
an individual task takes 1μs, such that a 1,152 core cluster can
schedule over 120 million tasks/second and this scales linearly with
the number of cores. The ability to schedule huge numbers of tasks
allows jobs to be divided into very large numbers of tiny tasks,
whose improved load balancing can speed up computations 2.1-
2.3×.
This paper presents Janus, a novel MAC protocol for
full–duplex wireless networks. Unlike other full-duplex
MACs, Janus allows partially interfering nodes to cooperate
by finding the appropriate transmission rates based on
interference levels, making better use of the channel. Computing
the optimal schedule and transmission rates is NPComplete,
so Janus uses a cheaper heuristic approach. Janus
also ensures that channel access time is shared fairly between
all nodes. Janus has lower per-packet overhead compared
to CSMA/CA because it eliminates random back-off
and lets nodes transmit multiple packets with a single set of
control packets. We show that for a setup with one access
point and three nodes, Janus achieves 2.5X the throughput
of half–duplex system based on CSMA/CA.
In this paper, we introduce a new class of signature matrices for
overloaded synchronous CDMA systems that have a very low complexity
decoder. While overloaded systems are more efficient from the
bandwidth point of view, the Maximum Likelihood (ML) implementation
for decoding is impractical even for moderate dimensions. Simulation
results show that the performance of the proposed decoder is very
close to that of the ML decoder. Indeed, the proposed decoding
scheme needs neither multiplication nor addition and requires only a
few comparisons . Furthermore, the computational complexity and the
probability of error vs. Signal to Noise Ratios (SNR) are derived
analytically.
Presentations
Technical Reports
Graphical simulations are a cornerstone of modern media and
films. But existing software packages are designed to run on HPC
nodes, and perform poorly in the computing cloud. These
simulations have complex data access patterns over complex data
structures, and mutate data arbitrarily, and so are a poor fit for
existing cloud computing systems. We describe a software
architecture for running graphical simulations in the cloud that
decouples control logic, computations and data exchanges. This
allows a central controller to balance load by redistributing
computations, and recover from failures. Evaluations show that the
architecture can run existing, state-of-the-art simulations in the
presence of stragglers and failures, thereby enabling this large
class of applications to use the computing cloud for the first
time.
Large scale cloud data analytics applications are often CPU
bound. Most of these cycles are wasted: benchmarks written in C++ run
10-51 times faster than frameworks such as Naiad and Spark. However,
calling faster implementations from those frameworks only sees
moderate (4-5x) speedups because their control planes cannot schedule
work fast enough.
This paper presents execution templates, a control plane abstraction
for CPU-bound cloud applications, such as machine learning. Execution
templates leverage highly repetitive control flow to cache scheduling
decisions as templates. Rather than reschedule hundreds of
thousands of tasks on every loop execution, nodes instantiate these
templates. A controller's template specifies the execution across all
worker nodes, which it partitions into per-worker templates. To ensure
that templates execute correctly, controllers dynamically patch
templates to match program control flow. We have implemented
execution templates in Nimbus, a C++ cloud computing framework.
Running in Nimbus, analytics benchmarks can run 16-43 times faster
than in Naiad and Spark. Nimbus's control plane can scale out to run
these faster benchmarks on up to 100 nodes (800 cores).
We present Canary, a scheduling architecture that allows
high performance analytics workloads to scale out to run
on thousands of cores. Canary is motivated by the observation
that a central scheduler is a bottleneck for high
performance codes: a handful of multicore workers can
execute tasks faster than a controller can schedule them.
The key insight in Canary is to reverse the responsibilities
between controllers and workers. Rather than
dispatch tasks to workers, which then fetch data as necessary,
in Canary the controller assigns data partitions to
workers, which then spawn and schedule tasks locally.
We evaluate three benchmark applications in Canary
on up to 64 servers and 1,152 cores on Amazon EC2.
Canary achieves up to 9−90× speedup over Spark and
up to 4× speedup over GraphX, a highly optimized graph
analytics engine. While current centralized schedulers
can schedule 2,500 tasks/second, each Canary worker can
schedule 136,000 tasks/second per core and experiments
show this scales out linearly, with 64 workers scheduling
over 120 million tasks per second, allowing Canary to
support optimized jobs running on thousands of cores.
Resume
Here are the links to my single-page and extended resumes.