Search this site


Metadata

Articles

Projects

Presentations

Graphs in Ruby with RMagick

I'm always finding myself wanting to graph random data. Gnuplot is nice, but not enjoyably scriptable. Matplotlib in python is too matlab-ish, or was when I looked at it last (though it looks much improved now). Some ruby options exist (even ruby+gnuplot), but none were much to my tastes.

I started fiddling around with RMagick and stumbled across what it calls "RVG" (ruby vector graphics). From the site:

RVG (Ruby Vector Graphics) is a facade for RMagick's Draw class that supplies a drawing API based on the Scalable Vector Graphics W3C recommendation.
The API is pretty reasonable and hasn't hindered me yet and feels good after having hacked with it for a few hours: Simple operations like point translation, scaling, rotating, flipping, etc are simple in code; the api is well documented; images can be embedded easily into another which allows for easily writing modular code.

Anyway, the goal of this adventure was to come up with something that would produce non-crappy plots. Main emphasis on having a means to apply axis labels and ticks that wasn't painful. The result is below: (x-axis ticks are hour-aligned and have 12 hour steps, y-axis ticks are single-value aligned)

Here's the code that generates the above graph (using rplot.rb). A lot of things (like axis label tick alignment and stepping) are hardcoded right now, but that will obviously change if I decide this project needs attention (and I don't find something that does the same thing but better).

# graph some random stuff, like log(x) and sin(x)
# use time for the 'x' to demo time formatting
# each point is an hour (i * 3600)
graph = RPlot.new(400, 200, "Happy Graph")

points = 60
axis = GraphAxis.new
(1..points).each do |i|
  axis.points << [Time.now.to_f + i*3600, Math.log(i)]
end

axis2 = GraphAxis.new
(1..points).each do |i|
  axis2.points << [Time.now.to_f + i*3600, Math.sin((i / 2.0).to_f) + 1]
end

graph.axes << axis
graph.axes << axis2

graph.render("/home/jls/public_html/test.gif")

fancydb performance

Various trials with basically the same input set: 2.5 million row entries, maximum 1 entry per second. The insertion rate drops by 60% if you add rule evaluations, which is an unfortunate performance loss. I'll work on making rules less invasive. Unfortunately, python threading will never run on two processors at once I can't gain significant performance from sharding rule processing to separate threads; most unfortunate. Maybe fork+ipc is necesary here, but I am somewhat loathe to doing that.

The slowdown when rules are present are to the record keeping that is done to notify that a rule should be evaluated again (rule evaluations are queued). Basicaly the loop 'is this row being watched by a rule' is the slowdown. I'll try attacking this first.

With 2 rules (unoptimized rules):
    hits.minute => hits.mean.1hour @ 60*60
    hits.minute => hits.mean.1day @ 60*60*24
  insertion rate = 7600/sec

With 2 rules (optimized chaining)
    hits.minute => hits.mean.1hour @ 60*60
    hits.mean.1hour => hits.mean.1day @ 60*60*24
  insertion rate = 12280/sec

With 9 rules (optimized chaining):
  insertion rate: 10000/sec

With 0 rules:
  trial 1: 40000/sec
  trial 2: 26700/sec

Storage utils, eventdb, etc.

Spent lots of time over thanksgiving playing with bdb in python.

Again, I still don't have releaseworthy code, but here's a snippet of rrdtool-like behavior from this system:

% ./evtool.py create /tmp/webhits.db
% ./evtool.py addrule /tmp/webhits.db http.hit agg.http.hit.daily total $((60*60*24)) time
% time cat webhits.data | ./evtool.py update /tmp/webhits.db -
11.10s user 0.80s system 94% cpu 12.627 total
% time ./evtool.py graph /tmp/webhits.db agg.http.hit.daily  
0.49s user 0.11s system 96% cpu 0.624 total
The result is exactly the same graph as mentioned in my previous post. Speed so far is pretty good. The input was 125000 entries, in 12.6 seconds; which equates roughly to 10000 updates per second. That kind of QPS seems pretty reasonable.

The primary difference today is that the aggregates are computed as data enters the system. 'Addrule' tells the database to schedule an aggregation for specific timestamps.

The goal is to be able to chain rules, and have N:M relationships between rule input and output. Those will happen soon. Chaining would've happened tonight, but I'm having some locking problems due to it being quite late ;)

The database code itself is designed to be reusable elsewhere. There are two primary classes: SimpleDB and FancyDB. SimpleDB lets you store and retrieve data based on row+timestamp => value pairs. FancyDB wraps SimpleDB and gives you operation listeners such as the rule used in the above example.

I've already used SimpleDB elsewhere; in the sms traffic tool I mentioned in my last post, I cache geocode data and traffic requests with this same database tool.

Less bullshit, more graph.

I've been working recently on dynamic, simple graphing. Systems like Cacti provide useful interfaces, but getting data into it is a pain in the ass.

You have 1500 machines you want in cacti. How do you do it?

My take is that you shouldn't ever need to preregister data types or data sources. Have a system that you simply throw data at, and it stores it so you can get a graph of it later. All I need to do, to graph new data, is simply write a script that produces that data and sends it to the collector.

The collector is a python cgi script that frontends to rrdtool. It takes all cgi paramters and stores the values with a few exceptions:

  • machine=XX - Spoof machine to store data for. If not given, defaults to REMOTE_ADDR. Useful if you need to proxy data through another machine, or are reporting data about another machine you are probing.
  • timestamp=XX - Override default timestamp ("now").
Everything else gets stored like this: /dataroot/<machine>/<variable>.rrd

Example:

kenya(/mnt/rrds/129.21.60.26) % ls
C_bytes_per_page.rrd                            C_pages_inactive.rrd
C_cpu_context_switches.rrd                      C_rfork_calls.rrd
... etc ...
All of those rrds are created by simply throwing data at the python cgi script. The source of the data is a script that runs 'vmstat -s' and turns it into key-value pairs.

Why are the files prefixed with "C_" ? The data I am feeding in comes from counters, and therefore should be stored as counter datatypes in rrdtool. The 'C_' prefix is a hint that if the variable needs an rrd created for it, that the DS type should be COUNTER. The default without this prefix is GAUGE.

Sample update http request:
http://somehost/updater.py?C_fork_calls=32522875&C_system_calls.rrd=235293874987

Feel free to view the vmstat -s poll script to get a better idea of what this does. I also have another script that will do some scraping on 'netstat -s' in freebsd (probably works in linux too).

vmstat -s looks like this:

456846233 cpu context switches
3220655757 device interrupts
 17964606 software interrupts
  ... etc ...
It's trivial to turn this into key-value pairs. If this were Cacti (or similar system) I would have to go through every line of vmstat -s and create a new data type/source/thing for each one, then create one per host. Screw that. Keep in mind my experience with Cacti is pretty small - I saw I had to register data sources and graphs and such manually and left it alone after that.

Anyway, back at the problem. Now how do I graph it? The interface isn't the best, but we use a cgi script again:

Show me all the machines with 'C_system_calls' graphed over the past 15 minutes:
graph.py?machines=129.21.60.1,<...>,129.21.60.26&keys=C_system_calls&start=-15min

This kind of system has the feature that you never need to explicitly define data input variables or data input sources - All you need is to hack together a script that can pump out key-value pairs. No documentation to read. No time consumed registering 500 new servers in your graph system.

RRDTool to graph log-originating data.

I need to relearn rrdtool, again, for this sysadmin time machine project. Today's efforts were spent testing for features I hoped were in RRDTool. So far, my feature needs are met :)

Take something simple, like webserver logs. Let's graph the hits.

Create the RRD:

rrdtool create webhits.rrd --start 1128626000 -s 60 \
   DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
   RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
   RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a start date of sometime in Novemeber. The step is 60, so it expects data every minute. I then specify one data type, hits, which is a gaugue (rate), and ranges from 0 to infinity (U). The rest of the command is RRA's defining how data is stored. The first one says take 5 samples and average them, and store 600,000 of these samples, at a maximum.

Now that we have the database, we need a "hits-per-minute" data set. I wrote a short perl script, parsehttp that will read from standard input and calculate hits-per-minute and output rrdtool update statements. Capture this output and run it through sh:

./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs and store it in our RRD.

Now that we have the data, we can graph it. However, since I want to view trends and compare time periods, I'll need to do something fancier than simple graphs.

RRDTool lets you graph multiple data sets on the same graph. So, I want to graph this week's hits and last week's hits. However, since the data sets are on different time intervals, I need to shift last week's set forward by one week. Here's the rrdtool command that graphs it for us, with last week's and this week's data on the same graph, displayed at the same time period:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two data sets with DEF: hits and lastweek. They both read from the 'hits' data set in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other starts 2 weeks ago and ends last week. I then shift last week's data forward by 7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green), the other for this weeks (red).

That graph looks like this:

That's not really useful, becuase there's so many data points the graph almost becomes meaningless. This is due to my poor creation of RRAs. We can fix that by redoing the database, or using the TREND feature. Change our graph statement to be:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one day (86400 seconds). This creates a sliding average across time. I store these in new data sets called t_hits and t_lastweek and graph those aswell.

The new graph looks like this:

You'll notice the slide values are chopped off on the left, that's becuase it doesn't have enough data points at those time periods to make an average. However, including the raw data makes the graph scale as it did before, making viewing the trend difference awkward. So, let's fix it by not graphing the raw data. Just cut out the LINE1:lastweek and LINE1:hits options.

Fixing the sliding average cutoff, add a title, and a vertical label:

rrdtool graph webhits.png -s "-1 week" \
   -t "Web Server Hits - This week vs Last week" \
   -v "hits/minute" \
   DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days"  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend beyond those boundaries, so that sliding averages can be calculated throughout. The new, final graph, looks like this:

Now I can compare this week's hits against last weeks, quickly with a nice visual. This is what I'm looking for.

This would become truely useful if we had lots of time periods (days, weeks, whatever) to look at. Then we could calculate standard deviation, etc. A high outlier could be marked automatically with a label, giving an instant visual cue that something is potentially novel. It might be simple to create a sort-of sliding "standard deviation" curve. I haven't tried that yet.