Search this site


Metadata

Articles

Projects

Presentations

Bringing test tools to Nagios monitoring

With all the TDD (test-driven design) and BDD (behavior-driven design) going around these days, it'd be a shame not to use these tools on monitoring applications.

You might have a boatload of tests that test your application before you roll a new version, but do you use those tests while the application is in production? Can you? Yes!

Let's take an important example of monitoring some complex interaction, like searching google and checking the results. Simple with a mouse, but perhaps complex in code. Even if you wrote a script to do it, using an existing testing framework gets you pass/fail testing automatically.

For this example, I'll use the following ruby tools: rspec and webrat. This fairly easy, though it took me a bit to find all the right documentation bits to clue me in to the right way.

require 'rubygems'
require 'webrat'

Spec::Runner.configure do |config|
  include Webrat::Methods
end

describe "google search for my name" do
  it "should include semicomplete.com in results" do
    visit "http://www.google.com/"
    webrat.response.title.should =~ /Google/
    query = "jordan sissel"
    fill_in "q", :with => query
    field_named("btnG").click
    webrat.response.title.should == "#{query} - Google Search"
    click_link "semicomplete.com"
  end
end
Now, we run this with the 'spec' tool:
% spec rspec-webrat.rb 
.

Finished in 0.578546 seconds

1 example, 0 failures
Seems ok. Let's break the test and see what happens. Change the 'visit' line to something else:
    visit "http://www.yahoo.com/"
Now rerun the test, which was checking specifically for google things in the page and will now fail on yahoo's page:
 % spec rspec-webrat.rb
F

1)
'google search for my name should include semicomplete.com in results' FAILED
expected: /Google/,
     got: "Yahoo!" (using =~)
./rspec-webrat.rb:29:

Finished in 0.186847 seconds

1 example, 1 failure
This output kind of sucks. Additionally, rspec failures seem to have exit code 1, not 2 as wanted by a nagios check reporting critical. Let's fix those. First, fixing the exit code can be hacked around directly in ruby if you want:
# Nagios checks expect exit code '2' to mean CRITICAL.
# Let's make any nonzero exit attempt always exit 2 (EXIT_CRITICAL).
EXIT_CRITICAL = 2
module Kernel
  alias :original_exit :exit
  def exit(value)
    value = EXIT_CRITICAL if value != 0
    original_exit(value)
  end
end
Fixing the output just means telling spec to use a different output format. I like the 'nested' output. Rerun that test now:
% spec -f nested rspec-webrat.rb
google search for my name
  should include semicomplete.com in results (FAILED - 1)

1)
'google search for my name should include semicomplete.com in results' FAILED
expected: /Google/,
     got: "Yahoo!" (using =~)
./rspec-webrat.rb:30:

Finished in 0.017534 seconds

1 example, 1 failure

% echo $?
2
All set.

Even better is that you can include multiple checks in the same script, if you wanted to. RSpec lets you select any test to run alone, so your nagios checks for a given web application could be a very simple:

define command {
  command_name check_google_for_semicomplete
  command_line /usr/bin/spec -f nested -e "google search for my name" mytests.rb
}

jps output not correct

A nagios alert checking for some java processes started firing because it couldn't find those processes. This check used 'jps' to look for those processes.
% sudo /usr/java/jdk1.6.0_04/bin/jps
15071 Jps
% ps -u root | grep -c java
15
I espected lots of output from jps, but there was only the jps process itself. Confusing. What does jps use to track java processes?

Your old strace (truss, whatever) friend will help you here:

# Always use 'strace -f' on java processes as they spawn new processes/threads
% sudo strace -f /usr/java/jdk1.6.0_04/bin/jps |& grep -F '("/' \
  | fex '"2/{1:2}' | sort | uniq -c | sort -n | tail -5
      5 proc/self 
      5 proc/stat 
     12 usr 
     17 tmp/hsperfdata_root 
    283 usr/java 
It referenced /tmp/hsperfdata_root multiple times. Weird, checking it out:
% ls /tmp/hsperfdata_root | wc -l
0
This directory is empty. Looking further around the strace and confirming by looking at the classes jps invokes (sun.jvmstat.perfdata.monitor.protocol.local.MonitoredHostProvider) shows that /tmp/hsperfdata_<user> is used by each jvm instance. It stores a file named by processes' pid.

So the question is, why is this directory empty?

Of the hosts I know run java, it only seems like long-running instances of java are disappearing from jps, making me think we have a cron job removing files from /tmp. I found this while looking through cron jobs:

% cat /etc/cron.daily/tmpwatch 
/usr/sbin/tmpwatch -x /tmp/.X11-unix -x /tmp/.XIM-unix -x /tmp/.font-unix \
        -x /tmp/.ICE-unix -x /tmp/.Test-unix 240 /tmp
/usr/sbin/tmpwatch 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
        /usr/sbin/tmpwatch -f 720 "$d"
    fi
done
This file comes from the tmpwatch rpm, which appears to come base installed on CentOS. This means that for every file in /tmp (except those specified by '-x dir') are being deleted if they are older than 240 hours (10 days). As an FYI, the default time value inspected is the file's atime, so if you mount noatime, the accesstime is not reliable.

Ultimately, we need to add a new set of flags to the cronjob that excludes /tmp/hsperfdata_*. This should keep me from being paged when a java process lives for more than 10 days ;)

Additionally, it makes me think that the people who use CentOS don't use Java or don't monitor their java processes with jps.