Software-Defined Infrastructure at Uber
Esther Shein | 04 June 2018
The only way for Uber to deliver the required level of network performance and availability is through software and automation, said Justin Dustzadeh, the head of global network and software platform at Uber. The ride sharing company relies heavily on software to automate its infrastructure and thoroughly tests not only its software but also the test environment itself, Dustzadeh said, speaking at the recent Open Networking Summit.
“Our approach … is to create a test environment that can not only provide the capabilities needed to do the traditional software test cycles — such as feature testing, regression testing, integration testing — but also enables us to deploy and use the tested software to provision, monitor and configure the test environment itself,’’ he told the audience.
To give an idea of just how vast a network Uber has, the company, which started in 2010, logged over five billion trips in 2016, and about 15 million rides occur every day in over 600 cities and in 78 countries, he said.
Software architecture
“The magic of the Uber app today is powered by a highly distributed software architecture that relies on a fault-tolerant and highly available infrastructure,’’ Dustzadeh said. To fully achieve the benefits of software-based automation, he said they always strive to use open standards-based technologies and avoid dependency on a single vendor across the entire infrastructure stack.
At Uber, a key enabler is to build real-time or near real-time visibility into the infrastructure state, and then leveraging that information and augment it with additional insights from analytics and machine learning, Dustzadeh said. Then IT can push the desired state of the infrastructure through programmatic interfaces.
In terms of real-life examples, he said they use software to automate many areas, from delivering forecasting models to doing capacity planning, provisioning infrastructure and managing all the changes that IT performs. Additionally, software is used to automate detecting incidents and for mitigating and remediating when things fail.
Automation
“For provisioning across our server and network environments we leverage a number of homegrown software platforms to automate and orchestrate the entire provisioning process,” in areas like auto discovery, Dustzadeh said. On the network side, for example, IT pushes intelligence to the devices to enable a distributed self-discovery model and enable zero-touch provisioning, he noted. This includes auto validation of the state of the hardware, for example, to prevent bad devices from going into production, he added.
Uber’s IT group utilizes a distributed and highly available platform for auto-detection, he said. On the network side, they do both active and passive monitoring, leveraging streaming telemetry. This gives officials near real-time visibility into the state of the network, including network reachability, network latency, packet losses, and link utilization, he said.
Auto-mitigation and auto-remediation are other areas where Uber heavily leverages software to improve its operational efficiencies, he said. “So when hardware fails, not only do we have to ensure that the issue is mitigated quickly before it becomes a service impacting incident, we also automate the back-end workflows to automatically generate troubleshooting and/or RMA tickets.”
If necessary, he said, they can also do auto-diagnostic tests, auto-remediation tests and perform failure prediction functions, for example, by monitoring specific metrics or by running specific playbooks.
Resiliency
Uber views its network as a key enabler of its business, Dustzadeh said. “Such network resiliency with the focus on deterministic failure behavior is one of our top design principles. Operational efficiency is also a key objective, meaning that the network has to be simple to build and also be flexible and cost effective.”
On the backbone side and in the WAN space, Uber is moving away from static and long-term contract models toward a more flexible approach, preferably SDN-controlled, on-demand spectrum-as-a-service, he said. “We are also exploring ideas and future models where regional and long-haul bandwidth could be more on demand and usage based like cloud services where carriers would serve as spectrum brokers.”
On the data center side, in addition to the software-defined capabilities Dustzadeh outlined, the company is also looking into server OEMs and a modular rack design to support multiple server types, for example, across compute, storage, and AI, and machine learning with GPU and FPGA, he said. They are also looking at network disaggregation in the data center.
“There is a great opportunity, especially in the data center space, to look into the disaggregated model to separate network hardware and network software,’’ he said. This could enable a much faster pace of innovation and faster development of new features, he noted.
Watch the complete presentation below:
Similar Articles
Browse Categories
Cloud Computing Compliance and Security Open Source Projects 2024 Linux How-To LF Research Open Source Ecosystem and Governance Blog Diversity & Inclusion Research Newsletter Data, AI, and Analytics linux blog Training and Certification Linux Cross Technology Cloud Native Computing Foundation cybersecurity software development Announcements Decentralized Technology Legal OpenSearch Sustainability and Green Initiatives cloud native generative AI lf events Finance and Business Technology Networking and Edge cncf industries Emerging Technology Health and Public Sector Interoperability Kubernetes Topic: Security Web Application & Development amazon web services aws community tools confidential computing challenges decentralized AI decentralized computing eBPF funding japan spotlight kernel license compliance openssf ospo research survey skills development state of open source tech talent