As the Operations lead, I find myself pondering what the difference between good software and good operational software. The software development team here at NSIDC is a sharp group, they know good software when they see it. Further, many of them know software that is not good operational when they see it. But, as a technical group, I don’t think we’ve all nailed down what set of features we can use as a benchmark for “this makes the software operational.” As far as I can tell, these things vary depending on the project, and, like Science Fiction, “good operational software is best described by pointing at it.”
That being said, by way of getting some ideas out there, there are three things that I have noticed I consistently point at and say, “that makes this software operational.”
First, is good logging. Logging is one of those things that has gotten easier over the years – there many different logging libraries that make it easy for developers to have all the logs they could want. However, all of these logging libraries have the same problem: they write logs for developers, but after the product is in production, developers are the last people to look at the logs. I have noticed that in many cases the software is working as expected, but is failing due to external issues: the data has changed, the user is doing something unexpected, or there is an infrastructure issue.
In these instances, instead of dumping a stack trace, what is needed is to describe the context and problem in (something resembling) English. I usually point to UNIX tools as primary examples of this in practice. The ‘cp’ command doesn’t give a stack trace when it can’t read the source file, it responds with the curt error, “No such file or directory”. While this is a simple example, there are many more complicated examples: apache HTTP server, the murmur server, and even the Linux kernel itself.