Monitoring in a Post Nagios World

Louis McCormack
DevOps College
Published in
7 min readFeb 23, 2018

--

Monitoring is changing. It has been a glacial, unheralded change, driven by attitudinal shifts and technological advances, but traditional monitoring systems such as Nagios are being usurped by metrics platforms and even machine learning. We are living in a post Nagios world.

There will be those to whom that statement is achingly obvious, sun-tanned Bay Area pioneer-types. Then there’ll be those who’ve spent years building protectionist Nagios empires glaring angrily at their screens. Most of us will recognise some truth in it, but will conclude that our monitoring is fine, thank-you-very-much. However there is most definitely a vanguard taking a different approach.

Allow me to explain.

Traditional monitoring systems — Nagios, Xymon, Zabbix, Sensu et al — all share a common approach: they run checks on hosts and then report whether the result of those checks are OK or Not OK, good or bad, green or red (ignoring the murky yellow for now). By setting arbitrary thresholds on disk usage, CPU utilisation, memory pressure etc. we can be woken up in the depths of night to ward off any disasters that might have been. At the cost of a few additional alerts, and maybe some sleep, we are generally able to maintain excellent uptime statistics.

And so, throughout the short life-span of our fledgling profession, it has always been. The Nagios family had firmly entrenched itself as the establishment of the old world.

But ours is an evolutionary profession, ideas and abstractions engender change and often the drivers behind change are the very things that enable it. In fact monitoring has changed little compared with the seismic shifts we’ve seen elsewhere: the tools may be different but the techniques are decades-old. Monitoring is the overgrown teenager about to blossom.

We can actually chart the demise of traditional monitoring using the signal markers that have characterised our industry over the last decade, beginning with the first, probably the single most impactful change of the millennium — The Cloud.

The Cloud presented a huge philosophical change.

Along with the tangible benefits cloud computing bought, a less obvious corollary was the inexorable attitude shift it engendered. Gradually it evoked (and continues to evoke) a mental shift away from the data centre. Previously hard-won knowledges of Spanning Trees, LUN alignments and ILOMs began to fade, like ageing war heroes, supplanted by millennial ideas such as immutable infrastructure and infinite scalability. Where once we cared for our servers, lovingly tucking their cables behind rack clips and meticulously replacing their creaking disks; now we shoot them in the head to save on the EC2 bill.

We had unwittingly embarked on the relentless pursuit of abstraction. Abstraction away from the spinning and buzzing, the earplugs, the cold-and-hotness and the draconian security measures of the data-centre. Like a careening juggernaut this trend continues, unsentimentally dismantling anything in its way.

The first inklings of an existential crisis was on the cards for Nagios. Its centralised file-based configuration model started to feel abstruse in this new world of transient hosts. Younger relations like Sensu, with an API for registering and de-registering fleeting servers, stepped in to bolster the old world order and the teetering status-quo managed to right itself.

Then along came containers.

The new prince of the realm is Kubernetes. Another abstract layer — the nebulous Cluster — has been slapped atop the existing virtualisation layer. Furthermore, in almost all cases it makes very little sense to actually manage the Cluster oneself, no, just throw some containers at GKE or EKS and hope they stick. Naturally this means that we start to care even less about our servers. Our ‘hosts’, such as they are, swim around like fish in the proverbial barrel, to be shot at and re-animated by the omnipotent orchestration layer.

And so the very bedrock of the old monitoring establishment begins to creak and groan. All along they had been treating the host as the basic unit of monitoring, but the interminable quest for abstraction had driven a wedge between the host and what we actually care about — our services.

The abstraction gods are not yet appeased. Kubernetes may reign supreme, but it is a fragile rule, with one eye on the impending Serverless hordes, who care little even for containers. They’ll just take some code and run it, somewhere. With this the downfall of traditional monitoring tools will be complete. The only way they can be used in a Serverless world is in some sort of black-box capacity, poking and prodding from outside the circle.

Of course none of this means that Monitoring is going away. On the contrary, it has just evolved with the times. The demise of traditional monitoring tools can be charted diametrically against the rise of a different sort of tool — the time-series database. Again we can reach into the annals of recent history to trace this rise.

Etsy’s pithy edict

Aeons ago, in 2010, Etsy gifted the world statsd and Graphite. Amazingly statsd (or at least the statsd protocol) is still widely in use today, as a means for aggregating metrics and sending them to a visualisation tool like Graphite. Before Graphite, the only graphs we had were RRDs (Round Robin Databases), graphs that looked as though they’d been hand-pixellated in the ’80s. Aside from any aesthetic complaints, they lacked something else — a query language. A way to aggregate and apply mathematic formulae to our metric data and, most importantly, key alerts off of it.

Again it was The Cloud that provided the catalyst for expansion, in its role as the great enabler: erstwhile hobbyists, open source heroes and budding enterprises now had access to freely available compute with (more recently at least) fast disks. Advances in time-series storage spawned OSS tools like InfluxDb and Prometheus, and 3rd party SaaS vendors like Datadog and Wavefront were able to build systems that could process a rate of metrics that would leave Graphite dribbling into its carbon relays. The introduction of tagged metrics and the enrichment of query languages made the data more useful.

Containerisation and Serverless bought a common bedfellow in the shape of Microservices, with all their promise and all their baggage. If we accept their benefits — smaller units of deployment, enablement of polyglot infrastructure etc. then we must concede their shortcomings — difficulty in discerning just what the bleedin’ hell is going on in there. We were effectively forced to re-evaluate our monitoring techniques. Metrics were our only real option, and we would need a lot of them.

Recently the term Observability has been coined, to differentiate the larger bulk of the information we collect from the small subset that we want to actively watch (i.e. Monitor). The triumvirate of Observability are logging, metrics and tracing. You may wish to alert on some of the measures that fall into the first 2 categories, but probably only a very small portion. The rest is passively gathered and exists as an investigatory aid in the event of one of those alerts being triggered (or for capacity planning, performance engineering etc.)

This fork in the road is only achievable because of the evolving technologies described above, and it is only required because of those self same technologies. In her excellent and seminal article, Cindy Sridharan argues eloquently that ‘the ideal number of signals to be “monitored” is anywhere between 3–5, and definitely no more than 7–10’ . To suggest that 10 years ago would’ve been laughable, and is testament to how far we’ve come.

But before you go ahead and rip out your Nagios or Sensu clusters and replace them with Prometheus or Wavefront, I want to concede a disclaimer: most of the above is written with tongue firmly wedged in cheek. The arguments hold true for buzz-word architectures; serverless-containerised-microservices. Naturally there is an increasingly sizeable cohort with such models, but equally there is a much larger long-tail. Some of us aspire to such things, and probably have a mixture of ‘new-wave’ and traditional systems.

Trying to monitor traditional systems (let’s say those where you do still care somewhat about your hosts) using solely metrics is not without its difficulties. How do you know, for example, if an EC2 instance has stopped sending metrics or if it has been legitimately scaled out? There is also a certain amount of soul-searching in unwinding years of conditioning for you and your team.

But the winds of change are blowing, new paradigms require new approaches, and the future looks ever more distinguishable from the past. It may be time to take a different approach.

--

--