In previous postings, I've put thoughts on monitoring and data aggregation.
Last night, I started working on prototyping a python tool to record arbitrary
data. Basically it aims to solve the problem of "I want to store data" in a
simple way, rather than having to setup a mysql database, user access, tables,
a web server, etc, all in the name of viewing graphs and data reports.
The requirements are pretty simple and generic:
- Storage will be mapping key+timestamp => value
- Timestamps should be 64bit values, so we can use microsecond values (52
bits to represent current unix epoch in microsecond)
- Access method is assumed to be random access by key, but reading
multiple timestamp entries for a single key is expected to be sequential.
- Key names must be arbitrary length
- Storage must be space-efficient on key names
- Values are arbitrary.
- Minimal setup overhead (aka, you don't have to setup mysql)
The goal of this is to provide a simple way to store and retrieve timestamped
key-value pairs. It's a small piece of the hope that there can be a monitoring
tool set that is trivial to configure, easy to manipulate, easy to extend, and
dynamic. I don't want to pre-declare data sources and data types (rrdtool,
cacti), or make it difficult to add new collection sources, etc. I'm hoping
that relying on the unix methodology (small tools that do one or two things
well) that this can be achieved.
The next steps in this adventure of a monitoring system are:
- a graphing system
- aggregation functions
- templating system (web, email, etc?)
Space efficiency on key names is achieved with a secondary storage containing a
list of key to keyid mappings. Key IDs are 64bit values. The first value is 1.
We could use a hash function here, but I want a guarantee of zero collisions.
However, this means that keys are specifically stored as key ids in insertion
order, not lexigraphical order.
This may become a problem if we want to read keys sequentially. However, if we
scan the secondary database (one mapping key => 64bit_keyid) we can get keys in
lexigraphical order for free. So iterating over all keys starting with the
string 'networking' is still possible, but it will result in random-access
reads on the primary database. This may be undesirable, so I'll have to think
about whether or not this use case is necessary. There are some simple
solutions to this, but I'm not sure which one best solves the general case.
Arbitrary key length is a requirement because I found the limitations of
RRDTool annoying, where data source names cannot be more than 19 characters -
lame! We end up being more space efficient (8 bytes per name) for any length of
data source name at the cost of doing a lookup finding the 64bit key id from
I have some of the code written, and a demo that runs 'netstat -s' a once a
second for 5 seconds and records total ip packets inbound. The key is