The first week of March in 2017 will be remembered as the time that AWS (Amazon Web Services) failed. The actual failure was in the Amazon Simple Storage Service (S3), but to the world in general, if your stuff was running in the Amazon cloud, it was not working.

Amazon provided a very complete write up of what happened, which basically boiled down to someone made a mistake, which caused a cascading failure that required several systems to be restarted in order to get the S3 system back up and running.  Amazon is making some changes (read sanity checks) in their systems to prevent this type of problem in the future.

Within 24 hours, I started receiving advertising emails from companies asking if we suffered from the Amazon outage and would we like to look at them to prevent this from ever happening again. In Yiddish, we would call this chutzpa (audacity).

Along with my security prime directive “There is no such thing as perfect security,” the corollary system rule is “There is no such thing as a perfect system.” I am a firm believer that given enough time, every system will eventually fail. What we need to work on is the ability to detect the failure and then recover quickly. We should always learn from the failure and build controls into the process to prevent the same failure in the future.

At Columbia, we use a system that was written here that looks for compromised systems and automatically takes them off the network. In addition to using behavior analysis, we get feeds from various places that help us find systems communicating with known bad actors.

Back in 2015, one of these feeds sent us a false positive and, as a result, our system mistakenly took hundreds of computers off the network. Now, in this case, although it was not a local mistake that caused the failure, to the owners of the computers it was a local failure. As a result of this, we added a circuit breaker into our system that will pop if hundreds of systems instantly show up with exactly the same problem. Sometimes, even the best automated system requires a human.

While I essentially believe that it is impossible to build an effective security system without at least some automation, my experience tells me that any system needs to have sanity checks built in and, at the very least, a way for humans to override the process. These fail safes can take the form of white lists, circuit breakers or a simple on/off switch.

These simple cautions would ruin the exciting scenes of many an action movie (the scene opens on the hero crawling through the tunnel where the lasers are set to vaporize any rodent, ignoring the fact that rats are at least 100 times smaller than a person). In real life, you would want your anti-rodent security system to be able to tell the difference between people and mice – a sanity check sub routine.

As it appears to me, security threats are not going away or even reducing any time soon. Our best security tools will require smarter and smarter automated systems to keep up with the sheer number of attacks, even though many of these are designed to distract you from the real attacks hiding in the noise. Make sure that any system deployed has the intelligence behind it to tell the difference between a mouse and a person.

Leave a Reply