Robot Operations Manifesto
Why another manifesto?
A manifesto is a published declaration of the intentions, motives, or views of the issuer. In modern applications of robotics and autonomy, there is an increasing need for guiding principles for how to manage these technologies at scale. What works for one robot is unlikely to work for 1,000.
The companies, organizations and individuals that form the Robot Operations Group believe that developing best practices for robot operations at scale will accelerate the adoption of robotics and will help humanity address some of its biggest challenges.
There are already many manifestos in existence, from political parties to large corporations. We draw inspiration from one that has been influential in the software world.
The Agile Manifesto
We are uncovering better ways of developing software by doing it and
helping others do it. Through this work we have come to value:
over processes and tools
over comprehensive documentation
over contract negotiation
over following a plan
That is, while there is value in the items on the right, we value the items on the left more.
While robots rely heavily on software, they are also unique. Robots interact with and change the world around them, whether it’s transporting materials in a warehouse, moving around a field or a retail store to capture video or collect items, or helping build skyscrapers. A single robot can complete a small number of specific tasks, whereas robotics deployment at scale can address several needs, from food production to shelter, to giving people access to the products they love in a fraction of the time it used to take.
We hope that the principles described in this manifesto will help a new generation of innovators drive change in traditional industries, making them more efficient, reducing error rates, protecting the health of human workers, lowering operational costs and ultimately delivering more customer value.
Who is this for?
This paper can benefit anyone involved in designing, manufacturing, deploying and operating robots. It provides a roadmap for the changing needs companies face as they go from one to thousands of robots.
However, our primary focus is on the operation at scale of large, autonomous robot fleets. The principles captured here apply equally for all types of robots that incorporate a degree of autonomy, whether the robots are wheeled, ground-based, flying or (under)water. Even a new robot that hasn’t been invented yet can benefit from these principles
Robotics deployments go through various phases, and their needs change accordingly:
|1-5||Lab/prototype phase, focus on getting a robot to work|
|5-50||Initial field deployment, often a limited time pilot|
|50 - 5,000||Scaling, most of what worked before stops working|
|+5,000||It’s just a matter of growth, more of the same|
Somewhere along the journey from 50 to 5,000, the needs for robot operations change dramatically. Manual tasks can be replaced with automation at scale, and ROI ramps up.
Typically, an end user of a robotics system is looking for a solution to a real-world problem, not a robot. They want it to just work, and they want robots from different vendors to work together.
If we follow the jobs to be done framework (JTBD), people “hire” robots to complete a task. Perhaps it’s getting product out the door faster, or making sure there’s never an empty shelf in a grocery store. For people who hire robots to help complete a task, this manifesto is also meant to provide guidance on how to adopt and integrate robotics for operations at scale.
How do I use it?
The next section provides a high-level overview of the challenges and the best approaches to address them. These are organized around four pillars, plus an overarching practice that connects them.
In the sections that follow, we provide specifics of each approach.
We believe that the key to operating robots at scale is to follow the lifecycle of failures.
Modern robots are complex systems. While they can achieve a high degree of autonomy, there are still what we call autonomy exceptions, where the robot cannot continue operating safely nor efficiently.
Effective operations at scale requires:
- accepting that autonomy exceptions will occur
- taking steps to resolve them immediately
- reducing the incidence of exceptions over time.
Borrowing from existing models such as Failure Reporting Analysis and Corrective Action System (FRACAS), we identified four pillars that work together to solve problems at scale.
In addition, across the pillars, we acknowledge the role of interoperability and orchestration of multiple robots, possibly from different vendors, as well as integration with other line-of-business software.
In modern DevOps terminology, observability is the evolution of more traditional monitoring. Beyond up/down health signals, observability supports drilling down and understanding implicit failure modes. Logs, metrics, and traces are considered the core of observability.1 More broadly, observability may include alerting, and may be integrated with incident management systems.
Adapting the concept of observability from the cloud – where it was developed– to robotics requires thinking of a fleet of robots as a distributed system, with compute/storage/networking on each robot and often in the cloud.
With deployments of a handful or even a dozen robots, each one may be somewhat unique in its hardware, software and operational settings. This is often updated individually and manually. However, as the fleet scales, introducing automation of configuration management (CM) becomes critical.
CM often requires an agent on the robot that can receive and implement changes in configuration, ensure that they are applied properly and supporting rollbacks in case of problems. These changes are subject to an audit trail that can be used as part of problem resolution.
CM may include software that is deployed across all robots plus anything that makes that robot unique, such as location-specific configs, maps, scheduling, etc.
Safety, Security and Auditing
In contrast with traditional manufacturing robots that operate in a cage, autonomous machines operating in unstructured environments have the potential for causing bodily harm to the people with whom they share space or collaborate. Therefore, safety is the most critical element in this manifesto. The maxim adopted by medical professions, primum non nocere (first, do no harm)
While information security is a separate concept, and merits attention in its own right to prevent loss of data, privacy or assets, it is also closely related to safety. However, as companies expand rapidly from a handful of robots that are tightly controlled to a large fleet distributed in far flung locations, instrumenting security across all attack vectors becomes critical.
It may take a catastrophic security/safety incident to get serious about the risks. The goal of this manifesto is to raise awareness around this issue in hopes of preventing a disaster. Some of the best practices applied in other areas, such as establishing a root of trust, adopting modern encryption and key management techniques, conducting intrusion detection and active monitoring and enabling role based access control for different types of users. Organizations such as the Center for Internet Security provide guidance on cybersecurity best practices, tools and threats.
When implemented properly, auditing provides accountability as well as an opportunity for forensic analysis that may result in improved compliance and prevention of future incidents.
Despite companies’ best efforts, problems in complex systems are unavoidable. Large fleets of autonomous robots are some of the most complex systems in existence, and the industry is still developing many of the foundational technologies. Therefore, robotics operators must have a system in place for problem resolution.
Frameworks such as Closed-Loop Corrective Action (CLCA) are meant to identify, analyze, and correct a problem with a product or process. The goal is to provide corrective action in a timely manner for any problem, and to have a means of verifying that the failure did indeed fix the problem.
Runbooks can be used to provide more repeatability. Runbook automation and remote interventions initiated automatically, eg through the use of AI that learns from failure cases, can complement but usually not completely replace human interventions.
In some cases, interventions for robot fleets may be done remotely, while other situations require on-site interventions. Sometimes interventions are aimed at addressing a symptom, such as getting a mis-localized robot back to its mission. In other cases, it may be pushing an OTA (over the air) hot fix. These practices build on and complement the other pillars described above.
As deployments of autonomous robots grow from novelty to widespread adoption across multiple industries, robotics companies and operators must develop best practices to enable robot operations at scale. We’ve presented a framework that applies across different robotics applications, enabling robot operators to be successful by following the lifecycle of failures.
No single tool or process is sufficient; instead, a combination of capabilities across four pillars is required to build a solid foundation. Just as important is a culture of continuous improvement based on accepting that autonomy exceptions will occur, taking steps to resolve them immediately and reducing the incidence of exceptions over time.
Distributed Systems Observability, A Guide to Building Robust Systems, by Cyndy Sridharan, 2018 O’Reilly Media ↩