photo
Jordan Sissel
geek

Wed, 24 May 2006

Grok plans and ideas

I'm almost done and graduated from RIT, which is why I haven't posted about anything recently. Finally done with this school. Wee!

Some noise woke me up from my pleasant slumber, and I can't seem to get back to sleep, so I might aswell do some thinking.

I think grok's config syntax is far too cluttered. I would much rather make it more simplified, somehow. Something like:

file /var/log/auth.log: syslog
{blockuser, t=3, i=60} Invalid user %USER% from %IP%

file /var/log/messages: syslog
{tracksudo, prog=su} BAD SU %USER:F% to %USER:T% on %TTY%

reaction blockuser "pfctl -t naughty -T add %IP%"
reaction tracksudo "echo '%USER:F% failed su. (to %USER:T%)' >> /var/log/su.log"
Seems like it's less writing than the existing version. Less writing is good. A more relaxed syntax would be nicer aswell -such as not requiring quotes around filenames.

This new syntax also splits reaction definitions from pattern matches, allowing you to call reactions from different files and other matches. I'll be keepin the perlblock and other features that reactions have, becuase they are quite useful.

I've also come up with a simple solution to reactions that execute shell commands. The current version of grok will run system("your command") every time a shell reaction is run. This is tragicaly suboptimal due to startup overhead. The simple solution is to run "/bin/sh -" so that there is already a shell accepting standard input and waiting for my commands. Simply write the shell command to that program.

I wrote a short program to test the speed of using system() vs printing to the input of a shell. You can view the testsh.pl script and the profiled output.

An abbreviated form of the profiled output follows:

%Time    Sec.     #calls   sec/call  F  name
92.24    1.5127     1000   0.001513     main::sys
 2.04    0.0335     1000   0.000033     main::sh
sys() calls system(), where sh() simply sends data to an existing shell process. The results speak for themselves, even though this example is only running "echo" - the startup time of the shell is obviously a huge factor in runtime. The difference is incredible. I am definitely implementing this in the next version of grok. I've already run into many cases where I am processing extremely large logs post-hoc and I end up using perl block reactions to speed up execution. This shell execution speed-up will help make grok even faster, and it can always use more speed.

Still no time to work on projects right now, perhaps soon! I'm moving to California in less than a week, so this will have to wait until after the move.

Comments: 0 (view comments)
Tags: , , ,
Permalink: /productivity/grok-plans
posted at: 03:04

Mon, 08 May 2006

Parallelization with /bin/sh

I have 89 log files. The average file size is 100ish megs. I want to parse all of the logs into something else useful. Processing 9.1 gigs of logs is not my idea of a good time, nor is it a good application for a single CPU to handle. Let's parallelize it.

I abuse /bin/sh's ability to background processes and wait for children to finish. I have a script that can take a pool of available computers and send tasks to it. These tasks are just "process this apache log" - but the speed increase of parallelization over single process is incredible and very simple to do in the shell.

The script to perform this parallization is here: parallelize.sh

I define a list of hosts to use in the script and pass a list of logs to process on the command line. The host list is multiplied until it is longer than the number of logs. I then pick a log and send it off to a server to process using ssh, which calls a script that outputs to stdout. Output is captured to a file delimited by the hostname and the pid.

I didn't run it single-process in full to compare running times, however, parallel execution gets *much* farther in 10 minutes than single proc does. Sweet :)

Some of the log files are *enormous* - taking up 1 gig alone. I'm experimenting with split(1) to split these files into 100,000 lines each. The problem becomes that all of the tasks are done except for the 4 processes handling the 1 gig log files (there are 4 of them). Splitting will make the individual jobs smaller, allowing us to process them faster becuase we have a more even work load across proceses.

So, a simple application benefiting from parallelization is solved by using simple, standard tools. Sexy.

Comments: 0 (view comments)
Tags: , , , , , ,
Permalink: /productivity/parallelization-with-the-shell
posted at: 17:02

Thu, 04 May 2006

RRDTool to graph log-originating data.

I need to relearn rrdtool, again, for this sysadmin time machine project. Today's efforts were spent testing for features I hoped were in RRDTool. So far, my feature needs are met :)

Take something simple, like webserver logs. Let's graph the hits.

Create the RRD:

rrdtool create webhits.rrd --start 1128626000 -s 60 \
   DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
   RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
   RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a start date of sometime in Novemeber. The step is 60, so it expects data every minute. I then specify one data type, hits, which is a gaugue (rate), and ranges from 0 to infinity (U). The rest of the command is RRA's defining how data is stored. The first one says take 5 samples and average them, and store 600,000 of these samples, at a maximum.

Now that we have the database, we need a "hits-per-minute" data set. I wrote a short perl script, parsehttp that will read from standard input and calculate hits-per-minute and output rrdtool update statements. Capture this output and run it through sh:

./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs and store it in our RRD.

Now that we have the data, we can graph it. However, since I want to view trends and compare time periods, I'll need to do something fancier than simple graphs.

RRDTool lets you graph multiple data sets on the same graph. So, I want to graph this week's hits and last week's hits. However, since the data sets are on different time intervals, I need to shift last week's set forward by one week. Here's the rrdtool command that graphs it for us, with last week's and this week's data on the same graph, displayed at the same time period:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two data sets with DEF: hits and lastweek. They both read from the 'hits' data set in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other starts 2 weeks ago and ends last week. I then shift last week's data forward by 7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green), the other for this weeks (red).

That graph looks like this:

That's not really useful, becuase there's so many data points the graph almost becomes meaningless. This is due to my poor creation of RRAs. We can fix that by redoing the database, or using the TREND feature. Change our graph statement to be:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one day (86400 seconds). This creates a sliding average across time. I store these in new data sets called t_hits and t_lastweek and graph those aswell.

The new graph looks like this:

You'll notice the slide values are chopped off on the left, that's becuase it doesn't have enough data points at those time periods to make an average. However, including the raw data makes the graph scale as it did before, making viewing the trend difference awkward. So, let's fix it by not graphing the raw data. Just cut out the LINE1:lastweek and LINE1:hits options.

Fixing the sliding average cutoff, add a title, and a vertical label:

rrdtool graph webhits.png -s "-1 week" \
   -t "Web Server Hits - This week vs Last week" \
   -v "hits/minute" \
   DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days"  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend beyond those boundaries, so that sliding averages can be calculated throughout. The new, final graph, looks like this:

Now I can compare this week's hits against last weeks, quickly with a nice visual. This is what I'm looking for.

This would become truely useful if we had lots of time periods (days, weeks, whatever) to look at. Then we could calculate standard deviation, etc. A high outlier could be marked automatically with a label, giving an instant visual cue that something is potentially novel. It might be simple to create a sort-of sliding "standard deviation" curve. I haven't tried that yet.

Comments: 3 (view comments)
Tags: , , , , ,
Permalink: /geekery/rrdtool-for-the-win
posted at: 00:57

Wed, 03 May 2006

New event recording database prototype

I finally managed to find time today to work on my events database project. In the processes of doing so, I found a few bugs in grok that needed to get fixed. Some of my regular expressions were being a bit greedy, so certain pattern expansion was breaking.

To summarize the event recording system, it is a webserver listening for event publish requests. It accepts the "when" "where" and "what" of an event, and stores it in a database.

To have my logs pushed to the database, I'll leverage the awesome power of Grok. Before I do that, I gathered all of the auth.log files and archives and compiled them into their respective files.

The grok.conf for this particular maneuver:

exec "cat ./logs/nightfall.auth.log ./logs/sparks.auth.log ./logs/whitefox.auth.log" {
   type "all syslog" {
      match_syslog = 1;
      reaction = 'fetch -qo - "http://localhost:8080/?when=%SYSLOGDATE|parsedate%&where=%HOST%/%PROG|urlescape|shdq%&what=%DATA:GLOB|urlescape|shdq%"';
   };
};
This is farily simple. I added a new standard filter, 'urlescape' to grok becuase I needed it. it will url escape a data piece. Hurray!

Run grok, and it sends event notifications to the webserver for every syslog-matching line. Using FreeBSD's command-line web client, fetch.

sqlite> select count(*) from events;
8085
Now, let's look for something meaningful. I want to know what happened on all sshd services between 1am and 3am this morning (Today, May 3rd):
nightfall(~/projects/eventdb) % date -j 05030100 +%s
1146632400
nightfall(~/projects/eventdb) % date -j 05030400 +%s
1146643200
Now I know the Unix epoch times for May 3rd at 1am and 4am.
sqlite> select count(*) from events where time >= 1146632400 
   ...> and time <= 1146643200 and location like "%/sshd" 
   ...> and data like "Invalid user%";
2465
This query is instant. Much faster than doing 'grep -c' on N log files across M machines. I don't care how good your grep-fu is, you aren't going to be faster.This speed feature is only the beginning. Think broader terms. Nearly instantly zoom to any point in time to view "events" on a system or set of systems. Filter out particular events by keyword or pattern. Look for the last time a service was restarted. I could go on, but you probably get the idea. It's grep, but faster, and with more features.

As far as the protocol and implementation goes, I'm not sure how well this web-based concept is going to prevail. At this point, I am not interested in protocol or database efficiency. The prototype implementation is good enough. From what I've read about Splunk in the past months in the form of advertisements and such, it seems I already have the main feature Splunk has: searching logs easily. Perhaps I should incorporate and sell my own, better-than-Splunk, product? ;)

Bear in mind that I have no idea what Splunk actually does beyond what I've gleaned from advertisements for the product. I'm sure it's atleast somewhat useful, or no one would invest.

Certainly, a pipelined HTTP client could perform this much faster than doing 10000 individual http requests. A step further would be having the web server accept any number of events per page request. The big test is going to see how well HTTP scales, but that can be played with later.

At this point, we have come fairly close to the general idea of this project: Allowing you to zoom to particular locations in time and view system events.

The server code for doing this was very easy. I chose Python and started playing with CherryPy (a webserver framework). I had a working event reciever server in about 30 minutes. 29 minutes of that time was spent writing a threadsafe database class to front for pysqlite. The CherryPy bits only amount to about 10 lines of code, out of 90ish.

The code to do the server can be found here: /scripts/cherrycollector.py

Comments: 0 (view comments)
Tags: , , , , , ,
Permalink: /geekery/grok-and-eventdb
posted at: 03:22

Thu, 06 Apr 2006

Parsing nfsstat(1) for only version X information

nfsstat | sed -ne '/Version 3/,/^$/p'
sed++

When I was bored (at 4 am, no-less), I kept trying to parse this information out using some crazy tricks with 'x' (swap pattern/hold) and other stuff, but I forgot the fact that regexps are valid addresses. So, we can print anything between 'Version 3' and blank lines, anywhere in our output.

The next thing I want to try with this is to automagically parse nfsstat output into a format that is more machine readable, this will probably be using awk or perl, seeing as how doing it with sed may hurt my brain a bit. Furthermore, trying to read the sed that did said operations would be somewhat intense ;)

The output looks something like this, on Solaris 9:

Version 3: (535958 calls)
null        getattr     setattr     lookup      access      readlink    
0 0%        242223 45%  20606 3%    52504 9%    20025 3%    41 0%       
read        write       create      mkdir       symlink     mknod       
14138 2%    146618 27%  5525 1%     145 0%      337 0%      0 0%        
remove      rmdir       rename      link        readdir     readdirplus 
6279 1%     7 0%        1539 0%     1518 0%     1606 0%     6587 1%     
Parsing this would mean generating a tree-like dictionary. In perl, it may look like:
%foo = (
	'Version 3' => {
		null => 0,
		getattr => 242223,
		setattr => 20606,
		lookup => 52504,
		# .... etc ...
		}
	)
Should be simple enough, we'll see what happens next time I get bored.

Comments: 2 (view comments)
Tags: , , , , , , ,
Permalink: /oneliners/solaris-nfsstat-sed
posted at: 04:07

Search this site

Navigation

Metadata

Home About Resume My Code (SVN)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< May 2006 >
SuMoTuWeThFrSa
  1 2 3 4 5 6
7 8 910111213
14151617181920
21222324252627
28293031   

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati