The Robot Operations Manifesto

Who is this for?

This paper can benefit anyone involved in designing, manufacturing, deploying and operating robots. It provides a roadmap for the changing needs companies face as they go from one to thousands of robots.

However, our primary focus is on the operation at scale of large, autonomous robot fleets. The principles captured here apply equally for all types of robots that incorporate a degree of autonomy, whether the robots are wheeled, ground-based, flying or (under)water. Even a new robot that hasn’t been invented yet can benefit from these principles

Robotics deployments go through various phases, and their needs change accordingly:

Lab/prototype phase, focus on getting a robot to work

1-5

Initial field deployment, often a limited time pilot

5-50

Scaling, most of what worked before stops working

50 - 5,000

It’s just a matter of growth, more of the same

+ 5,000

Somewhere along the journey from 50 to 5,000, the needs for robot operations change dramatically. Manual tasks can be replaced with automation at scale, and ROI ramps up.

Typically, an end user of a robotics system is looking for a solution to a real-world problem, not a robot. They want it to just work, and they want robots from different vendors to work together.

If we follow the jobs to be done framework (JTBD), people “hire” robots to complete a task. Perhaps it’s getting product out the door faster, or making sure there’s never an empty shelf in a grocery store. For people who hire robots to help complete a task, this manifesto is also meant to provide guidance on how to adopt and integrate robotics for operations at scale.

How do I use it?

The next section provides a high-level overview of the challenges and the best approaches to address them. These are organized around four pillars, plus an overarching practice that connects them.

In the sections that follow, we provide specifics of each approach.

Manifesto

We believe that the key to operating robots at scale is to follow the lifecycle of failures.

Modern robots are complex systems. While they can achieve a high degree of autonomy, there are still what we call autonomy exceptions, where the robot cannot continue operating safely nor efficiently.

Effective operations at scale requires:

  • accepting that autonomy exceptions will occur

  • taking steps to resolve them immediately

  • reducing the incidence of exceptions over time.

Borrowing from existing models such as Failure Reporting Analysis and Corrective Action System (FRACAS), we identified four pillars that work together to solve problems at scale:

  • Monitoring/Observability

  • Configuration Management

  • Safety, Security, and Auditing

  • Interventions

The ROG Circle of Pillars: Monitoring, Config Management, Auditing, Interventions

In addition, across the pillars, we acknowledge the role of interoperability and orchestration of multiple robots, possibly from different vendors, as well as integration with other line-of-business software.

Monitoring/Observability

In modern DevOps terminology, observability is the evolution of more traditional monitoring. Beyond up/down health signals, observability supports drilling down and understanding implicit failure modes. Logs, metrics, and traces are considered the core of observability.1 More broadly, observability may include alerting, and may be integrated with incident management systems.

Adapting the concept of observability from the cloud – where it was developed– to robotics requires thinking of a fleet of robots as a distributed system, with compute/storage/networking on each robot and often in the cloud.

Configuration Management

With deployments of a handful or even a dozen robots, each one may be somewhat unique in its hardware, software and operational settings. This is often updated individually and manually. However, as the fleet scales, introducing automation of configuration management (CM) becomes critical.

CM often requires an agent on the robot that can receive and implement changes in configuration, ensure that they are applied properly and supporting rollbacks in case of problems. These changes are subject to an audit trail that can be used as part of problem resolution.

CM may include software that is deployed across all robots plus anything that makes that robot unique, such as location-specific configs, maps, scheduling, etc.

Safety, Security and Auditing

In contrast with traditional manufacturing robots that operate in a cage, autonomous machines operating in unstructured environments have the potential for causing bodily harm to the people with whom they share space or collaborate. Therefore, safety is the most critical element in this manifesto. The maxim adopted by medical professions, primum non nocere (first, do no harm)

While information security is a separate concept, and merits attention in its own right to prevent loss of data, privacy or assets, it is also closely related to safety. However, as companies expand rapidly from a handful of robots that are tightly controlled to a large fleet distributed in far flung locations, instrumenting security across all attack vectors becomes critical.

It may take a catastrophic security/safety incident to get serious about the risks. The goal of this manifesto is to raise awareness around this issue in hopes of preventing a disaster. Some of the best practices applied in other areas, such as establishing a root of trust, adopting modern encryption and key management techniques, conducting intrusion detection and active monitoring and enabling role based access control for different types of users. Organizations such as the Center for Internet Security provide guidance on cybersecurity best practices, tools and threats.

When implemented properly, auditing provides accountability as well as an opportunity for forensic analysis that may result in improved compliance and prevention of future incidents.

Interventions

Despite companies’ best efforts, problems in complex systems are unavoidable. Large fleets of autonomous robots are some of the most complex systems in existence, and the industry is still developing many of the foundational technologies. Therefore, robotics operators must have a system in place for problem resolution.

Frameworks such as Closed-Loop Corrective Action (CLCA) are meant to identify, analyze, and correct a problem with a product or process. The goal is to provide corrective action in a timely manner for any problem, and to have a means of verifying that the failure did indeed fix the problem.

Runbooks can be used to provide more repeatability. Runbook automation and remote interventions initiated automatically, eg through the use of AI that learns from failure cases, can complement but usually not completely replace human interventions.

In some cases, interventions for robot fleets may be done remotely, while other situations require on-site interventions. Sometimes interventions are aimed at addressing a symptom, such as getting a mis-localized robot back to its mission. In other cases, it may be pushing an OTA (over the air) hot fix. These practices build on and complement the other pillars described above.

Conclusion

As deployments of autonomous robots grow from novelty to widespread adoption across multiple industries, robotics companies and operators must develop best practices to enable robot operations at scale. We’ve presented a framework that applies across different robotics applications, enabling robot operators to be successful by following the lifecycle of failures.

No single tool or process is sufficient; instead, a combination of capabilities across four pillars is required to build a solid foundation. Just as important is a culture of continuous improvement based on accepting that autonomy exceptions will occur, taking steps to resolve them immediately and reducing the incidence of exceptions over time.

Notes

  1. Distributed Systems Observability, A Guide to Building Robust Systems, by Cyndy Sridharan, 2018 O’Reilly Media