System Monitoring - IT Hare on Soft.ware

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”7″ chap=”Chapter 25(c) from “beta” Volume VII”]

As noted above – to see that your mission-critical system is working, you need to monitor its health 24×7. And – no surprise here – there are tools which help you in this regard. While I myself am guilty of designing my own monitoring tool for a what-have-became-a-billion-dollar-business – I will be the first one to say that these days, it would be quite silly.

Requirements

First, let’s take a look on “what we want from system monitoring?”. Without going into unnecessary details, we can safely say that:

We will want to monitor at least a hundred parameters per server, with at most 5-second intervals.
- It means that for a hundred of servers, we’ll need 2000 New Values Per Second (NVPS) written.
- On the other hand, we do NOT have a requirement of strict Durability (i.e. even if some monitoring data will get lost once in a long while – it certainly won’t be the end of the world).
We DO need to monitor app-level parameters. As a result, custom monitoring is the must.
“We DO need an ability to raise hell when something goes wrong.We DO need an ability to raise hell when something goes wrong. This includes two components:
- Detecting that something goes wrong (at least simple thresholds)
- Ability to notify support/admins/…
We DO want to monitor log files (both those by our apps and by OS/DB/3^rd-party apps), to parse them – and to raise hell if something goes wrong.

What to Monitor?

Now, we need to decide “what to monitor?”. This is not that difficult question; actually –

We need to monitor EVERYTHING we can afford to monitor.

This includes both very traditional monitoring things such as:

SNMP is an Internet-standard protocol for collecting and organizing information about managed devices on IP networks— Wikipedia — Data from your network devices, such as number of packets traveling via the switch Ethernet interface, number of corrupted packets over the interface (if it is non-zero – most likely we’re speaking about the partially-broken Ethernet cable), and so on. This kind of data is usually collected via SNMP.
Data from your Server Boxes. Includes CPU usage, amount of committed RAM, amount of free space on your drives, etc. etc. etc. Maybe collected either via SNMP, or directly via a plugin-provided-by-your-monitoring-system.
- One very important thing to monitor, is health of your RAID drives (if one drive fails, or indicated as “going to fail soon” – it needs to be replaced ASAP).
Data from your DBMS(s). As database is one absolutely critical part of your setup – monitoring it is also absolutely critical. This kind of data is usually collected via some kind of plugin-provided-by-your-monitoring-system.
Data from your own apps(!). This is one part which is very often overlooked by admins. However – to make sure that our software does work as intended – we do need to monitor quite a few parameters of our own (which is BTW very consistent with operating under DevOps model – with both developers and admins having joint responsibility for the system). To do it – you’ll need to write your own agents (fortunately – usually you can write your own script or module for your-system’s-agent). In particular, the following parameters are of extreme importance:
- For (Re)Actor-based systems:
  - idle (waiting-for-event) time per thread and per-(Re)Actor. As each (Re)Actor can use only one thread – going over 100% means sharp increase in latencies¹ – but it is avoidable if you monitor your system closely enough.
  - Sizes of incoming queues for (Re)Actors.
- For massively-multithreading systems:
  - Number of threads per type of requests.
  - Time spent under each of your mutexes. As overloading mutex has even worse effects² than using-100%-of-the-core-for-(Re)Actor – we need to monitor all of them very closely too.
- For all systems: whatever-performance-related-data-your-app-is-able-to-share. Some examples include:
  - Current number of players per game type (region, etc. etc.)
  - Per-request-type (per-message-type, etc.) processing times.
  - Per-app memory usage
  - Current app-level cache sizes.
  - etc. etc.
- “it is really important to know that a 10-minute average of your drop-out rate from a certain form raised to 100%Business-level real-time data (sic!). This is one field which is even more frequently overlooked than app-level data. Still – it is really important to know that a 10-minute average of your drop-out rate from a certain form raised to 100% (and if you, in the spirit of Continuous Delivery, have just deployed a new version of this form – it may indicate that there is an outright bug there to be fixed ASAP and even sooner). Monitoring business-level data, just as app-level monitoring, requires writing your own agents (scripts, modules, etc.).

¹ Though surprisingly, there is a substantial reserve for the (Re)Actor to operate at 100% of load – this happens due to avoiding some of context switches in this mode, which in turn leads to better-than-linear(!) scalability around this point.

² Unlike (Re)Actors which tend to save on context switches under the heavy load, mutex-based systems tend to cause much more context switches under the same scenarios – which often leads to highly-non-linear dependencies.

On Storage: to SQL or not to SQL?

Now, we can discuss the Big Fat Ugly Question™ of the storage system for our monitoring system; after all – we should be able to handle those thousands-of-NVPS we need.

Since time immemorial,³ I have been a opponent of using SQL databases to store monitoring data (more generally – time series data). There are several reasons for it:

As a Very Big Fat Rule of Thumb™, Monitoring data doesn’t really require SQL access (and specialized stuff such as aggregates is much more important for monitoring analysis than ad-hoc SQL queries). For monitoring data, all the views are very simplistic, and show merely some-timeframe-from-the-dataset (nothing else).
- As a result, SQL benefits pretty much disappear when it comes to monitoring data.
“Monitoring data doesn’t really require ACID transactions.Monitoring data doesn’t really require ACID transactions. As noted above – we do NOT need Durability, and if some monitoring data is lost once in a blue moon – so be it. This is a part of a more general observation that we are not going to stop our production just because monitoring is down/not coping with the load/crashed/etc.
- As a result, another benefit of a proper SQL DBMS doesn’t apply to monitoring data.
On the negative side – overheads of SQL DBMS, when applying them to time series data, are high.
- In particular, overheads of transactions (which are not necessary for time series – see on Durability above), doing the same work twice (first into DB log, and then into data files), and updating indexes (which are necessary in SQL DB, but aren’t really necessary when using plain files, or in specialized time series DB) do hurt SQL DBs pretty badly.
- One real-world example: in [Shaw], the author is bragging about achieving 10K NVPS (New Values Per Second),⁴ using two twelve-core 2.4GHz boxes with 48G RAM total (that’s not accounting for proxies). Well, I have to say that back 15 years ago, I saw a plain-file-based monitoring system, working off one single box with 2×1.5Ghz cores, 256M RAM⁵ and BBWC RAID card – and being able to process 5K NVPS; while this is 2x less than numbers in [Shaw], it ran on hardware which is about 20x-50x less powerful, so actually we’re speaking about 10x-25x performance penalty of using SQL DBs for storing time series.
- As a result – given the choice, I strongly prefer non-SQL-DB-based monitoring systems to SQL-DB-based ones.
- OTOH – it is possible to scale at least some of the SQL-DB-based systems; in spite of huge overheads (we’re speaking about 10x-25x(!) difference), modern computers can handle even such inefficiencies, while not being prohibitively expensive (and as we need just one such server for a hundred of other boxes – it won’t reflect that bad on your bottom line).

“I certainly want a monitoring system which has demonstrated scalabilityAll that being said – modern monitoring systems are way too often SQL-DB-driven (with only modern exception I know being Zenoss), so you might have to use an SQL back-end. OTOH – I certainly want a monitoring system which has demonstrated scalability, at least along the lines of [Shaw]. However, without such demonstrated scalability – I’d be extremely suspicious of SQL-DB-based monitoring tools (it is just soooo many ways to do it badly wrong with an SQL DB…).

Note that my rants about DBs for monitoring actually apply ONLY to traditional SQL-based DBs, and do NOT apply to specialized time series DBs such as RRDTool or OpenTSDB (which has been reported to scale to about 100K writes/second per server box).

³ At least for 10 years now

⁴ keep in mind that in the spirit of DevOps, you’ll want to monitor of the order of 100 values per server box (including lots of custom app parameters) – and at most at 5-second intervals, so 10K NVPS corresponds to a few hundred of servers

⁵ was a significant amount back there.

Which Monitoring System to Use?

Currently, there is more than one system which can be used for monitoring. In particular, the following systems are particularly popular:

Cacti. A dinosaur (just like me <wink />) which was the most popular monitoring tool 10 years ago. Cacti is working, but IMO is a bit too data-centric (rather than task-centric); also it is losing popularity to newer monitoring systems. On the positive side, Cacti uses RRDTool (rather than an SQL database) as a back-end – which is IMNSHO a Good Thing™ (as discussed above).
- My take: I’d rather not use Cacti for new deployments, due to better alternatives such as Zenoss.
Nagios Core and Nagios XI. As of 2017, Nagios is probably the most popular monitoring system out there. Much more task-centric than Cacti (in particular, alerts are much more integral to Nagios than to Cacti). While Nagios Core can work without a database – Nagios XI requires MySQL (which can hurt your scalability easily – so it is not a coincidence that scalability is among top complaints about Nagios out there).
- My take: I don’t recommend to use Nagios at least if you’re planning to go to several dozens of servers (and for a hundreds-of-thousands-players game – you should plan for at least as much).
Zabbix. While Zabbix, just like Nagios XI, is also SQL-database-driven, it seems that Zabbix guys did their job to optimize performance; while still lacking performance of the SQL-less solutions (see below) – there are believable experiments (such as [Shaw]) which demonstrate that Zabbix seems to scale, at least to some extent. On the positive side – I’d mention that I like Zabbix APIs for custom monitoring.
- My take: if starting a new project, Zabbix would be my 2^nd choice (after Zenoss, see below). Even simpler-than-Zenoss custom monitoring cannot outweigh my very serious dislike to using generic SQL DB for storing monitoring time series data.
PandoraFMS. PandoraFMS is a very direct competitor to Zabbix, and seems to be very similar feature-wise too. However, I am rather concerned that the line between PandoraFMS Community (which is free), and PandoraFMS Enterprise (which costs $$) is blurred.
- My take: Overall, I prefer Zenoss; and if looking for an SQL-DB-based monitoring tool – then, due to licensing being blurred for PandoraFMS, and to demonstrated scalability for Zabbix, I’d rather use Zabbix.
Zenoss provides option to use either RRDTool or OpenTSDB (both being specially optimized for time series, and not suffering from traditional SQL DB performance overheads). On the negative side – to add our own app-level monitoring, we’d need to write a ZenPack – which is substantially more involved than writing a simple script.
- My take: in spite of more complicated custom app-level monitoring, I still prefer Zenoss (with either RRDTool or OpenTSDB as a back-end), to any other monitoring system.

[[TODO: take a look at Prometheus/InfluxDB, TICK; also open-source Graphite, Grafana, Icinga, and libreNMS]]

[[TODO: push vs pull]]

[[TODO: HA for monitoring is a Big Fat Overkill(tm) most of the time]]

Application Logs

“Application logs is just a yet another source of extremely valuable dataApplication logs is just a yet another source of extremely valuable data; for example, if your app writes about assertions into the log and keeps running afterwards⁶ – you want to know about it ASAP. As a result – logs do need to be integrated into your monitoring system.

With processing logs, we actually have two distinct tasks. First, we need to store the logs somewhere (not strictly required, but usually we do want centralized storage for logs for several reasons). It can be done either by some kind of script – or by using syslogd.

Second – we need to parse those logs in real-time – and to report certain events from the logs to our monitoring system (raising alert if applicable). This is one of those jobs a good monitoring system is supposed to do for us.

⁶ this is pretty common for (Re)Actors, especially in VALIDATE and CALCULATE stages of VALIDATE-CALCULATE-MODIFY-SIMULATE pattern, discussed in Vol. II’s chapter on (Re)Actors.

What to Do with All This Data?

After we got all this data stored – we need to do something with it. Two most common uses for the monitored data, is graphs and alerts.

This is where the whole power of the 3^rd-party monitoring system comes in. Implementing nicely-looking graphs, as well as alert-definition-language and alert actions (such as “send SMS”) – are quite time-consuming (and are not really necessary, as monitoring tool developers have already done it for us).

What you’ll really need process-wise – is to have two things:

Somebody from your team looking at all those graphs on regular basis, looking for unusual patterns, and for potential bottlenecks – and setting rules and alerts for those parameters which may cause trouble.
- Of course, whenever you run into a bottleneck which brings your system to a crawl – you will add an alert too, but it is always better to be proactive, and at least try to identify likely bottlenecks in advance.
Alert your team that “something went horribly wrong” and to raise hell. How to “raise hell” properly, heavily depends on the way your support team normally operates – and on the nature of the problem. First of all, there are “issues which can wait until developers are back to the job”, and “the whole damn thing is down”. For the former – a simple internal e-mail will do, but for the latter – you need to be very aggressive with delivering your alerts; here, there are two distinct scenarios:
- “If you happen to run serious support team 24x7 anyway – then the only thing you really need to do, is just pushing an e-mail with the highest-priority status into support e-mail queueIf you happen to run serious support team 24×7 anyway – then the only thing you really need to do, is just pushing an e-mail with the highest-priority status into support e-mail queue – and make sure that there is a procedure for support people to follow (essentially – what is the order to call people). We’ll discuss e-mail queues (ticket systems, etc.) in Chapter [[TODO]].
- Otherwise – it is better to send the same message using several available channels
  - For root’s sake, don’t bother with detecting-that-notification-service-is-down-and-use-backup-channel routine – just send the same alert over all your channels in parallel – and to several people too (the whole system is down, and there isn’t such thing as an unnecessary notification about it).
  - For this kind of alerts – I am advocating the use of three different channels for each of your admins-to-be-notified – (a) e-mail, (b) SMS,⁷ and (c) app-based notification (usually – using a 3^rd-party service such as Pushover). Note though that while these channels are somewhat-independent on the receiver’s side – all of them still rely on your datacenter having Internet connection (which is itself is a SPOF – see below).
  - Keep in mind that your monitoring server itself can be down at the very moment when it should have raised the alarm. OTOH, for most of the games out there – it is not too big deal (for it to happen, we need two faults – first, the monitoring server should crash, and then – the system should experience severe problems, and all this should happen within a few hours – which is quite unlikely⁸). Of course, you still need to check that your monitoring is working, at least once per working day.
  - Also, keep in mind that the whole connection to the Internet of all your servers in a datacenter can be down. Unlike the above scenario of monitoring server failing at the same time as the system gets down – this can easily be a result of a single fault (Internet-is-down will cause your game to get down). Still, if your ISP is good,⁹ they should detect such an outage themselves. On the other hand – you may want to setup an external service (in a different datacenter) pinging your servers and raising the hell completely independently if something goes wrong.

⁷ You’ll need to use a 3^rd-party service to send SMS; one example of such service is OVH

⁸ Of course, it is possible that a fault in the system has caused a flood of information, causing your monitoring server to crash. Still, as we’re not speaking about monitoring a nuclear station or something – we might have a luxury to ignore this scenario because it is “quite unlikely”

⁹ and for the sake of your players, it’s better be good

Bottom Line

TL;DR on your monitoring system:

While it is not a rocket science – it will take quite a bit of time to set it up.
“You DO want to monitor everythingYou DO want to monitor everything, including:
- Network devices
- Servers
- DBMS
- Your own apps
- Business-level metrics
- Application logs (both from your own apps and from 3^rd-party apps)
You DO need a centralized system to do monitoring.
Even I (a well-known fan of DIY) don’t think that we should implement monitoring system ourselves.
When choosing an existing monitoring system, I prefer:
- My 1^st choice: Zenoss, or
- My 2^nd choice: Zabbix
After you got your new shiny monitoring system with all the data, you should use it proactively. In particular:
- You MUST set alerts (and update them every time your system managed to slow down)
- You MUST spend time studying graphs on regular basis, to identify likely bottlenecks.

[[To Be Continued…

This concludes beta Chapter 25(c) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 26(a), where we’ll start discussing helpdesk and back-end tools]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Ling Zhao says

July 5, 2017 at 3:34 pm

For the part of “Data from your own apps”, I think it is beneficial and also critical to monitor individual app’s memory usage. It certainly will help to detect memory leaks.

- "No Bugs" Hare says
  
  July 7, 2017 at 5:18 am
  
  You’re right, I added it, THANKS!
  
Anton says

July 7, 2017 at 1:01 pm

I’d also recommend something like Prometheus for monitoring.

Michal says

July 11, 2017 at 7:52 am

If you like RRDTools then take a look at graphite (also usually accompanied with collectd for data gathering and grafana for visualizations and alerts since v4.2.x).

Also worth a try is Icinga, libreNMS.

- "No Bugs" Hare says
  
  July 11, 2017 at 12:57 pm
  
  Well, yes – but as the book isn’t really about monitoring – I need to stop somewhere 🙂 (and avoid overwhelming readers with too many options). I will take another look, but at the moment I have no idea which of the currently-not-listed systems will make it into the printed version.
  
  EDIT: On the other hand, indeed at least one free/completely-open-source solution certainly deserves to be listed.
  
Vasilis says

July 21, 2017 at 10:40 am

Prometheus or InfluxDB for timeseries, rules and alerts while grafana for UI and alerts is a good open source solution used a lot.

Shura Luberetsky says

February 19, 2019 at 1:54 pm

A little note about SQL backends for storing monitoring time series – TimescaleDB uses PostgreSQL as a backend, and is able to achieve 100k’s of NVPS, as shown in this article:

https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/

Though, even a simple “do-it-yourself” time-series database may outperform SQL-based ones in simple tasks (and this is the case for monitoring) – an interesting comparison is here: http://luca.ntop.org/tsdb.pdf . I personally saw a similar system which demonstrated performance of about 2 million NVPS being run on a laptop, with disk performance being the bottleneck (that was designed for IoT, not server monitoring, but the case is quite similar). This is not an encouragement to roll your own monitoring system, but a kind of performance accessment – if a sophisticated DB runs significantly slower than naive file-based system, there must be something wrong.

The real problem is that the performance of “naive” SQL solutions degrades quickly when the tables start having too many rows, say tens of millions. Sophisticated SQL-based storage systems, like Timescale, overcome this by storing data in large chunks, which reduces the cardinality.

Shura Luberetsky says

February 24, 2019 at 6:59 am

An interesting article with comparison of currently popular time-series monitoring tools:

https://arxiv.org/abs/1901.08304

They describe a specific data set and queries, with which you can comparatively measure the performance of a time-series DB. Also, TimescaleDB and Influxnshould seriously be considered as having the best performance for today.