Search this site


Metadata

Articles

Projects

Presentations

Munin doesn't scale by default.

Just started playing with Munin as a potentially better option than Cacti (hard to automate) for trending. I have about 30 hosts being watched by munin. The munin update job (which fetches and regenerates graphs, etc), by default, runs every 5 minutes. It takes almost 4 minutes to run on a 2.5gHz host. If we add any more things to monitor it's likely that we'll soon overrun the 5 minute interval.

Examining the process, it looks like most of the time is spent generating graphs. Every graph displayed on the munin webpages is regenerated every 5 minutes whether or not anyone looks at the graphs. This can't scale.

There is an option you can give in your munin.conf:

graph_strategy cgi
This, and a few other changes, will make munin skip the graph prerendering. If I set the graph_strategy to cgi, the runtime drops to 28 seconds, most of which is spent generating the static HTML for the munin web interface - even if no one looks at it.

Really, though, this is 2009: static html, really? Sarcasm aside, dynamic generate-on-the-fly webpages are basically the standard these days. Munin needs a better frontend that isn't static HTML.

Among other oddities, it doesn't seem like you can time travel. Your default graph options are today, this week, this month, this year. Sometimes yesterday, last week, last month, etc, are useful, not to mention the other odd views like 36 hour, 6 hour, etc.

More rmagick/rvg playtime

While working on graphs tonight, I found that calculation and labelling of ticks should be provided by special 'tick' classes. The iterator (the 'each' method) takes a min and max value and yields ticks in that range. This allows you to:
  • A 'tick' provider should just be an iterable class (foo.each, etc) which provides the tick position and optional label.
  • A graph can have multiple ticks per axis, allowing you to have 'major' ticks labeled while 'minor' ticks are not labeled, and even more than two layers of ticks on each axis.
  • The same tick classes can easily be used to draw both the graph ticks and the grid.
  • Trivially have 'time format' tickers
  • Have a 'smart time ticker' that looks at the min/max and determines the correct set of time ticks to display (display format, tick distance, tick alignment, etc). Can use multiple 'time ticker' instances internally (code reuse!)
I'm sure this has all been though of before, but it's a research experience for me :)

At any rate, I'm finding myself wondering if RMagick/rvg is really the right tool. It certainly makes doing graphics trivial, but even for what I see as a simple graph it takes a little over a second to render, which would hurt usability if multiple graphs needed rendering simultaneously.

The bottleneck seems to be with text rendering. If I disable text display in the graph (tick labels, etc), graph rendering drops by 0.5 seconds (from 1.1). Switching from 'gif' to 'png' output shaved 0.2 seconds on average of rendering, which is interesting.

Today's results, with real data:

graph = RPlot::Graph.new(400, 200, "Ping results for www.google.com")
pingsource = RPlot::ArrayDataSource.new
File.open("/b/pingdata").each do |line|
  time,latency = line.split
  pingsource.points << [time.to_f, latency.to_f]
end
pingsource.points = pingsource.points[-300..-1]
graph.sources << pingsource
graph.xtickers << RPlot::SmartTimeTicker.new
graph.ytickers << RPlot::LabeledTicker.new(alignment=0, step=25)
graph.render("test.png")

Storage utils, eventdb, etc.

Spent lots of time over thanksgiving playing with bdb in python.

Again, I still don't have releaseworthy code, but here's a snippet of rrdtool-like behavior from this system:

% ./evtool.py create /tmp/webhits.db
% ./evtool.py addrule /tmp/webhits.db http.hit agg.http.hit.daily total $((60*60*24)) time
% time cat webhits.data | ./evtool.py update /tmp/webhits.db -
11.10s user 0.80s system 94% cpu 12.627 total
% time ./evtool.py graph /tmp/webhits.db agg.http.hit.daily  
0.49s user 0.11s system 96% cpu 0.624 total
The result is exactly the same graph as mentioned in my previous post. Speed so far is pretty good. The input was 125000 entries, in 12.6 seconds; which equates roughly to 10000 updates per second. That kind of QPS seems pretty reasonable.

The primary difference today is that the aggregates are computed as data enters the system. 'Addrule' tells the database to schedule an aggregation for specific timestamps.

The goal is to be able to chain rules, and have N:M relationships between rule input and output. Those will happen soon. Chaining would've happened tonight, but I'm having some locking problems due to it being quite late ;)

The database code itself is designed to be reusable elsewhere. There are two primary classes: SimpleDB and FancyDB. SimpleDB lets you store and retrieve data based on row+timestamp => value pairs. FancyDB wraps SimpleDB and gives you operation listeners such as the rule used in the above example.

I've already used SimpleDB elsewhere; in the sms traffic tool I mentioned in my last post, I cache geocode data and traffic requests with this same database tool.