Search this site





Monitoring system - request for input

I'm working on a new monitoring system because I can't find one that solves enough of my problems. It's going to be free and have an unrestricted open source license.

I could use your help.

At this stage, the best way you can help is to make sure I get lots of data about various infrastructure architectures, monitoring needs, reporting needs, alerting needs.

If you can, please share with me the following:

  • A description (or diagram) of your infrastructure including network, servers, services, storage, etc.
  • What you are using now for monitoring (can be any number of tools)
    • Why you like them
    • Why you don't like them
    • What you'd rather have, if anything
  • What tools are missing that you wish existed?
  • Would more documentation on monitoring, in general, help?
  • Do you carry a pager? If not, why not? If so, what does it support? (email, sms, html email, mobile web, normal web)
  • Would more documentation help?
    • Better documentation on how to monitor the things you need to monitor, such as best practices, better tool docs, etc?
    • Best practices for monitoring various scenarios?
Any thoughts, please email me [email protected] - I'll be collecting this data into my design document, which you can view in unfinished form, here: Current Design Doc

Munin doesn't scale by default.

Just started playing with Munin as a potentially better option than Cacti (hard to automate) for trending. I have about 30 hosts being watched by munin. The munin update job (which fetches and regenerates graphs, etc), by default, runs every 5 minutes. It takes almost 4 minutes to run on a 2.5gHz host. If we add any more things to monitor it's likely that we'll soon overrun the 5 minute interval.

Examining the process, it looks like most of the time is spent generating graphs. Every graph displayed on the munin webpages is regenerated every 5 minutes whether or not anyone looks at the graphs. This can't scale.

There is an option you can give in your munin.conf:

graph_strategy cgi
This, and a few other changes, will make munin skip the graph prerendering. If I set the graph_strategy to cgi, the runtime drops to 28 seconds, most of which is spent generating the static HTML for the munin web interface - even if no one looks at it.

Really, though, this is 2009: static html, really? Sarcasm aside, dynamic generate-on-the-fly webpages are basically the standard these days. Munin needs a better frontend that isn't static HTML.

Among other oddities, it doesn't seem like you can time travel. Your default graph options are today, this week, this month, this year. Sometimes yesterday, last week, last month, etc, are useful, not to mention the other odd views like 36 hour, 6 hour, etc.

Querying temperature in Windows

It occured to me tonight that I didn't have a good way to query temperatures from a Windows box. I'd used GUI tools to do it before, but that doesn't really lend itself to automation and monitoring.

The default SNMP configuration in XP doesn't export temperature (at least nothing I saw). I knew SMART had temperature information, but that wasn't CPU temperature or anything else outside the harddrive.

SMART data is accessible through a number of tools. I've used smartmontools before, but didn't know they had a build available for Windows until just now. Same tools as the Linux/FreeBSD/whatever versions. The device naming is the same as on the non-windows versions, and the smartctl manpage details the syntax. I wanted temperature information, and powershell helps make this pretty easy:

PS > .\smartctl.exe -a /dev/hda `
     | where {$_ -match "Temperature"} `
     | foreach { $_.split()[-1] }

After a bit of randomly permuting search queries, I found that some temperature information is available through WMI. The temperature values are in tenths of kelvin. We can query this from powershell:

PS > get-wmiobject MSAcpi_ThermalZoneTemperature -namespace "root/wmi" `
     | select CurrentTemperature,InstanceName

CurrentTemperature InstanceName
------------------ ------------
              3102 ACPI\ThermalZone\THRM_0
I found this particular WMI class by doing the following after getting some hints from search results:
PS > get-wmiobject -namespace "root/wmi" -list | findstr Temp
MSAcpi                                    MSAcpi_ThermalZoneTemperature

Python, event tracking, and bdb.

In previous postings, I've put thoughts on monitoring and data aggregation. Last night, I started working on prototyping a python tool to record arbitrary data. Basically it aims to solve the problem of "I want to store data" in a simple way, rather than having to setup a mysql database, user access, tables, a web server, etc, all in the name of viewing graphs and data reports.

The requirements are pretty simple and generic:

  • Storage will be mapping key+timestamp => value
  • Timestamps should be 64bit values, so we can use microsecond values (52 bits to represent current unix epoch in microsecond)
  • Access method is assumed to be random access by key, but reading multiple timestamp entries for a single key is expected to be sequential.
  • Key names must be arbitrary length
  • Storage must be space-efficient on key names
  • Values are arbitrary.
  • Minimal setup overhead (aka, you don't have to setup mysql)
The goal of this is to provide a simple way to store and retrieve timestamped key-value pairs. It's a small piece of the hope that there can be a monitoring tool set that is trivial to configure, easy to manipulate, easy to extend, and dynamic. I don't want to pre-declare data sources and data types (rrdtool, cacti), or make it difficult to add new collection sources, etc. I'm hoping that relying on the unix methodology (small tools that do one or two things well) that this can be achieved. The next steps in this adventure of a monitoring system are:
  • a graphing system
  • aggregation functions
  • templating system (web, email, etc?)
Space efficiency on key names is achieved with a secondary storage containing a list of key to keyid mappings. Key IDs are 64bit values. The first value is 1. We could use a hash function here, but I want a guarantee of zero collisions. However, this means that keys are specifically stored as key ids in insertion order, not lexigraphical order.

This may become a problem if we want to read keys sequentially. However, if we scan the secondary database (one mapping key => 64bit_keyid) we can get keys in lexigraphical order for free. So iterating over all keys starting with the string 'networking' is still possible, but it will result in random-access reads on the primary database. This may be undesirable, so I'll have to think about whether or not this use case is necessary. There are some simple solutions to this, but I'm not sure which one best solves the general case.

Arbitrary key length is a requirement because I found the limitations of RRDTool annoying, where data source names cannot be more than 19 characters - lame! We end up being more space efficient (8 bytes per name) for any length of data source name at the cost of doing a lookup finding the 64bit key id from the name.

I have some of the code written, and a demo that runs 'netstat -s' a once a second for 5 seconds and records total ip packets inbound. The key is ''[1195634167358035]: 1697541[1195634168364598]: 1698058[1195634169368830]: 1698566[1195634170372823]: 1699079[1195634171376553]: 1699590