Omid Mashayekhi

I am a staff software engineer at Google, developing distributed software systems for data-center and WAN networks, as a member of NetInfra team. Before that, I was a graduate student at Stanford University, where I received a Ph.D. degree in Electrical Engineering, and a Ph.D. minor degree in Computer Science, in 2017. My interests include cloud computing, distributed systems, and networking systems. At Stanford, I worked with Professor Philip Levis as a member of Stanford Information Networks Group (SING). I received my M.S. degree from Stanford University in 2013, and B.S. degree from Sharif University of Technology in 2011, both in Electrical Engineering.

Nimbus

Nimbus is a general purpose cloud computing system specially designed for computations with short tasks. Nimbus is developed in C++ and the API offers a data model similar to Spark and Naiad. The key difference between Nimbus and its counterparts is the novel control plane abstraction called Execution Templates. For jobs with short tasks the runtime overhead becomes comparable to the task itself. For example for an application with 4ms tasks, Sparks runtime overhead is about 98%. Execution templates enable Nimbus to schedule and drive jobs with tasks as short as 100us over large number of workers at scale with negligible runtime overhead. For more information visit Nimbus home page.

Janus

In this project, we investigate novel techniques to design MAC protocol for the Full Duplex radio. This project follows SING’s earlier paper on single channel full duplex wireless network (please see here), which explores feasible implementation of the physical layer for efficient simultaneous transmission and reception. We consider the very fundamental characteristics of a full duplex network and try to design a MAC protocol which perfectly takes the advantages of the freedom provided by the physical layer. Specially, the goal is to improve the throughput as much as two times of the ordinary half duplex systems while remaining fair to all the users. You can find our paper here.

Overloaded Codes

As my undergraduate research thesis, I worked on developing overloaded codes with efficient decoders for CDMA systems. These codes not only have really simple decoder but also show the bit-error-rate (BER) performance almost as good as the maximum-likelihood (ML) decoder. The novel coding matrices are called logical signature matrices (LSM), which benefit from comparison-based decoders. In addition to the published paper, this work also received a US patent.

x86 Runtime Prediction

Predicting x86 program runtime for Intel processor. As a part of the machine learning course at Stanford (CS229), we developed a series of supervised learning methods for predicting the runtime length of serialized x86 programs for Intel processors. Our generative model predicts runtime for programs with arbitrary numbers of instructions with 12% error.

Packet Classification

Novel and optimized methods for packet classification when both rules and keys contain wildcard expressions. This work introduces a software based algorithm which is memory efficient, does not need expansion for rules and keys containing wildcards, is scalable, and provides multi-match results.

DCell with OpenFlow

Simulating DCell topology and routing for data centers using Mininet and OpenFlow controller. Each year, advanced networking course at Stanford (CS244) publishes a blog on reproducing the results reported in the literature by students. Yo can find our report here .

Cognitive Radio Devices

This work is a literature survey on cognitive radio devices as a part of wireless networking course at Stanford (EE359).

For the full list of publications, refer to my Google Scholar page.

Fast, Distributed Computations in the Cloud: Omid Mashayekhi; Philip Levis (primary advisor), Mendel Rosenblum (advisor), Alex Aiken (advisor); Stanford University Doctoral Dissertation, 2017
Abstract

Today, cloud computing frameworks adopt one of two strategies to schedule their computations over hundreds to thousands of machines. In the centralized strategy, a controller dispatches computation units, called tasks, to worker nodes. Centralization allows a framework to quickly reschedule, respond to faults, and mitigate stragglers. However, a centralized controller can only schedule a few thousand tasks per second and becomes a bottleneck. In the distributed strategy, there is no central node. These systems install data flow graphs on each node, which then independently execute and exchange data. By distributing the control plane and turning it into data flow, these frameworks can execute hundreds of thousands of tasks per second, and do not have a control plane bottleneck. However, data flow graphs describe a static schedule; even small changes, such as migrating a task between two nodes, requires stopping all nodes, recompiling the flow graph and reinstalling it on every node. This leads to high latency or wasteful resource utilization for rescheduling, fault recovery, or straggler mitigation. This dissertation presents a new, third strategy, called execution templates. Execution templates leverage a program’s repetitive control flow to cache control plane decisions in templates. A template is a parametrisable block of tasks: it caches some information (e.g., task dependencies) but instantiation requires some parameters (e.g., task identifiers). Executing the cached tasks in a template requires sending a single message that loads the new parameters. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates. Execution templates are not bound to a static control flow and efficiently capture nested loops and data dependent branches. Evaluation of execution templates in Nimbus, a cloud computing framework, shows that they provide the fine-grained scheduling flexibility of centralized control planes while matching the performance of the distributed ones. Execution templates in Nimbus support not only the traditional data analytics, but also complex, scientific applications such as hybrid graphical simulations.

Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking: Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, Amin Vahdat; In Proceedings of ACM SIGCOMM 2022
Abstract

We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect.

Automatically Distributing Eulerian and Hybrid Fluid Simulations in the Cloud: Omid Mashayekhi, Chinmayee Shah, Hang Qu, Andrew Lim, Philip Levis; In ACM Transactions on Graphics (TOG), 37, 2, Article 24, June 2018, Presented at SIGGRAPH 2018
Abstract

Distributing a simulation across many machines can drastically speed up computations and increase detail. The computing cloud provides tremendous computing resources, but weak service guarantees force programs to manage significant system complexity: nodes, networks, and storage occasionally perform poorly or fail. We describe Nimbus, a system that automatically distributes grid-based and hybrid simulations across cloud computing nodes. The main simulation loop is sequential code and launches distributed computations across many cores. The simulation on each core runs as if it is stand-alone: Nimbus automatically stitches these simulations into a single, larger one. To do this efficiently, Nimbus introduces a four-layer data model that translates between the contiguous, geometric objects used by simulation libraries and the replicated, fine-grained objects managed by its underlying cloud computing runtime. Using PhysBAM particle level set fluid simulations, we demonstrate that Nimbus can run higher detail simulations faster, distribute simulations on up to 512 cores, and run enormous simulations (1024³ cells). Nimbus automatically manages these distributed simulations, balancing load across nodes and recovering from failures. Implementations of PhysBAM water and smoke simulations as well as an open source heat-diffusion simulation show that Nimbus is general and can support complex simulations. Nimbus can be downloaded from https://nimbus.stanford.edu.

Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics: Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis; In proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
Abstract

Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly. We present execution templates, a control plane abstraction that can schedule hundreds of thousands of tasks per second while supporting fine-grained, per-task scheduling decisions. Execution templates leverage a program’s repetitive control flow to cache blocks of frequently-executed tasks. Executing a task in a template requires sending a single message. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates. Evaluations of execution templates in Nimbus, a data analytics framework, find that they provide the fine-grained scheduling flexibility of centralized control planes while matching the strong scaling of distributed ones. Execution templates support complex, real-world applications, such as a fluid simulation with a triply nested loop and data dependent branches.

Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing: Hang Qu, Omid Mashayekhi, Chinmayee Shah, Philip Levis; In Proceedings of the 13th European Conference on Computer Systems (EuroSys '18), 2018
Abstract

Existing cloud computing control planes do not scale to more than a few hundred cores, while frameworks without control planes scale but take seconds to reschedule a job. We propose an asynchronous control plane for cloud computing systems, in which a central controller can dynamically reschedule jobs but worker nodes never block on communication with the controller. By decoupling control plane traffic from program control flow in this way, an asynchronous control plane can scale to run millions of computations per second while being able to reschedule computations within milliseconds. We show that an asynchronous control plane can match the scalability and performance of TensorFlow and MPI-based programs while rescheduling individual tasks in milliseconds. Scheduling an individual task takes 1μs, such that a 1,152 core cluster can schedule over 120 million tasks/second and this scales linearly with the number of cores. The ability to schedule huge numbers of tasks allows jobs to be divided into very large numbers of tiny tasks, whose improved load balancing can speed up computations 2.1- 2.3×.

Janus: A Novel MAC Protocol for Full Duplex Radio: Jae Young Kim, Omid Mashayekhi, Hang Qu, Maria Kazandjieva, and Philip Levis; Stanford CSTR 2013-02, 2013
Abstract

This paper presents Janus, a novel MAC protocol for full–duplex wireless networks. Unlike other full-duplex MACs, Janus allows partially interfering nodes to cooperate by finding the appropriate transmission rates based on interference levels, making better use of the channel. Computing the optimal schedule and transmission rates is NPComplete, so Janus uses a cheaper heuristic approach. Janus also ensures that channel access time is shared fairly between all nodes. Janus has lower per-packet overhead compared to CSMA/CA because it eliminates random back-off and lets nodes transmit multiple packets with a single set of control packets. We show that for a setup with one access point and three nodes, Janus achieves 2.5X the throughput of half–duplex system based on CSMA/CA.

Uniquely Decodable Codes with Fast Decoder for Overloaded Synchronous CDMA Systems: Omid Mashayekhi, Farokh Marvasti; IEEE Transactions on Communication, vol. 60, no. 11, pp. 3145-3149, November 2012
Abstract

In this paper, we introduce a new class of signature matrices for overloaded synchronous CDMA systems that have a very low complexity decoder. While overloaded systems are more efficient from the bandwidth point of view, the Maximum Likelihood (ML) implementation for decoding is impractical even for moderate dimensions. Simulation results show that the performance of the proposed decoder is very close to that of the ML decoder. Indeed, the proposed decoding scheme needs neither multiplication nor addition and requires only a few comparisons . Furthermore, the computational complexity and the probability of error vs. Signal to Noise Ratios (SNR) are derived analytically.

Scalable, Fast Cloud Computing with Execution Templates: Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis; Poster presentation at 2016 USENIX Symposium on Operating Systems Design and Implementation (OSDI'16)

Distributed Graphical Simulation in the Cloud: Omid Mashayekhi, Chinmayee Shah, Hang Qu, Andrew Lim, Philip Levis; arXiv:1606.01966 [cs.DC], 2016
Abstract

Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphical simulations in the cloud that decouples control logic, computations and data exchanges. This allows a central controller to balance load by redistributing computations, and recover from failures. Evaluations show that the architecture can run existing, state-of-the-art simulations in the presence of stragglers and failures, thereby enabling this large class of applications to use the computing cloud for the first time.

Scalable, Fast Cloud Computing with Execution Templates: Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis; arXiv:1606.01972 [cs.DC], 2016
Abstract

Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (4-5x) speedups because their control planes cannot schedule work fast enough. This paper presents execution templates, a control plane abstraction for CPU-bound cloud applications, such as machine learning. Execution templates leverage highly repetitive control flow to cache scheduling decisions as templates. Rather than reschedule hundreds of thousands of tasks on every loop execution, nodes instantiate these templates. A controller's template specifies the execution across all worker nodes, which it partitions into per-worker templates. To ensure that templates execute correctly, controllers dynamically patch templates to match program control flow. We have implemented execution templates in Nimbus, a C++ cloud computing framework. Running in Nimbus, analytics benchmarks can run 16-43 times faster than in Naiad and Spark. Nimbus's control plane can scale out to run these faster benchmarks on up to 100 nodes (800 cores).

Canary: A Scheduling Architecture for High Performance Cloud Computing: Hang Qu, Omid Mashayekhi, David Terei, Philip Levis; Stanford CSTR 2016-01, 2016
Abstract

We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them. The key insight in Canary is to reverse the responsibilities between controllers and workers. Rather than dispatch tasks to workers, which then fetch data as necessary, in Canary the controller assigns data partitions to workers, which then spawn and schedule tasks locally. We evaluate three benchmark applications in Canary on up to 64 servers and 1,152 cores on Amazon EC2. Canary achieves up to 9−90× speedup over Spark and up to 4× speedup over GraphX, a highly optimized graph analytics engine. While current centralized schedulers can schedule 2,500 tasks/second, each Canary worker can schedule 136,000 tasks/second per core and experiments show this scales out linearly, with 64 workers scheduling over 120 million tasks per second, allowing Canary to support optimized jobs running on thousands of cores.

Here are the links to my single-page and extended resumes.

Resume:

Extended Resume:

Omid Mashayekhi

Projects

Main Research Projects

Nimbus

Janus

Older Research Projects

Overloaded Codes

Selected Short Projects

x86 Runtime Prediction

Packet Classification

DCell with OpenFlow

Cognitive Radio Devices

Publications

Dissertation

Papers

Presentations

Technical Reports

Resume