Software Safety Lessons from the World of Roller Derby

Roller derby is a fast paced full contact sport played on roller skates. It involves people deliberately blocking other players’ progress and knocking them off track as well as skating as fast as possible around the track to get back into a scoring position. Accidents can and do happen, resulting in broken bones and bruises (which we like to call ‘derby kisses’).

There are three ways in which roller derby applies the concepts of safety engineering in order to reduce the number and severity of accidents.

1. Prevention

The Women’s Flat Track Derby Association (WFTDA) has a 74-page rule book, which has many sections, sub-sections and sub-sub-sections. Whilst most of these are concerned with the mechanics of the game, such as how points are scored, there are a large number which are aimed at making the sport safer. A good example is that you are not allowed to hit someone in the back, on the head or below the knees. In fact, if you fall over and then someone else trips over you then you have committed a ‘low block’ foul and not only suffer the physical pain from someone skating over you but you also have to serve a penalty. This means that when someone falls they take care to ‘fall small’ curling up into a ball and pulling their hands and feet under their body to show the referees that they are not deliberately committing a foul.
alt These rules make the game safer by addressing the causes of accidents and requiring players to have procedures which mitigate the problem. Prevention makes an accident less likely to occur. In the software world we have lots of ways in which we can reduce the likelihood of accidental problems, such as peer reviews, pair programming, test driven development and continuous integration. These allow us to find our mistakes before they cause accidents in production.

2. Protection

Because roller derby is a full contact sport you have to wear protective equipment to help prevent injury, including pads for your knees, elbows and wrists as well as a helmet and mouth guard. Physical protection doesn’t reduce the likelihood of an accident but it does reduce the likelihood of an injury resulting from an accident.
alt In the software world this would be analogous to writing resilient code, which captures expected and unexpected error conditions and handles them gracefully. If an unexpected error occurs your system should still continue to function whenever possible.

3. Preparedness

Whenever a roller derby training session takes place there is always a trained first aider, just in case someone injures themselves. At a game there are a number of medical staff to deal with any injuries as quickly as possible. In the software world this is about designing code for live operations, not just for development. This could be as simple as ensuring that appropriate logging is in place and that automatic alerts are produced when particular error logs occur.
alt I investigated one system to find out why it wasn’t reliable in production. It used messaging extensively and the developer had used the standard approach of sending messages to the dead message queue if they couldn’t be handled after 2 retries. This works well in development, if the system isn’t behaving as expected you can look in the dead message queue and work out what has happened. However, in production this can hide issues and make it difficult to understand when a problem has occurred, as well as make it harder to diagnose what has gone wrong. The solution is simple, having an automated alert on the dead message queue allows problems to be found early, potentially before they are reported by users.

Multiple Layers of Safety

So, in roller derby we reduce the likelihood of an accident by prevention. If, despite our prevention, an accident happens then we reduce the likelihood of injury through protection. If, despite our protection, injury occurs then we reduce the severity of the injury by being prepared to treat it as fast as possible. Safety engineering tells us that we need multiple levels of safety so that if an issue gets through one level it is likely to be stopped by another.

One day a developer came to me and told me that he had just accidentally processed a case on the live system, thinking that he was in the test environment. This was serious as the customer would be contacted about the choices ‘they’ had made online. Because this was raised quickly we were able to prevent the system from sending any customer communications and we were able to purge the accidental transaction.

The developer was upset with himself for making this mistake and blamed himself for doing so. However, the problem was that we only had one level of safety (knowing which system you were accessing) and this had failed. The live system looked the same as the test system and the development system, so we changed the development and test environments so that the background colour made it clear which system was being accessed (reduction in likelihood of an accident : protection). In addition, the development and test system used live data, meaning that if the live environment was accidentally accessed real customers could be affected. Using test data would have meant that no customer impact would have been possible (reduction in likelihood of injury : prevention). Nowadays, whenever we are putting a system live I ask myself whether we have multiple levels of safety and so whether we are yet ready to face the world.