photo
Jordan Sissel
geek

Wed, 24 May 2006

Grok plans and ideas

I'm almost done and graduated from RIT, which is why I haven't posted about anything recently. Finally done with this school. Wee!

Some noise woke me up from my pleasant slumber, and I can't seem to get back to sleep, so I might aswell do some thinking.

I think grok's config syntax is far too cluttered. I would much rather make it more simplified, somehow. Something like:

file /var/log/auth.log: syslog
{blockuser, t=3, i=60} Invalid user %USER% from %IP%

file /var/log/messages: syslog
{tracksudo, prog=su} BAD SU %USER:F% to %USER:T% on %TTY%

reaction blockuser "pfctl -t naughty -T add %IP%"
reaction tracksudo "echo '%USER:F% failed su. (to %USER:T%)' >> /var/log/su.log"
Seems like it's less writing than the existing version. Less writing is good. A more relaxed syntax would be nicer aswell -such as not requiring quotes around filenames.

This new syntax also splits reaction definitions from pattern matches, allowing you to call reactions from different files and other matches. I'll be keepin the perlblock and other features that reactions have, becuase they are quite useful.

I've also come up with a simple solution to reactions that execute shell commands. The current version of grok will run system("your command") every time a shell reaction is run. This is tragicaly suboptimal due to startup overhead. The simple solution is to run "/bin/sh -" so that there is already a shell accepting standard input and waiting for my commands. Simply write the shell command to that program.

I wrote a short program to test the speed of using system() vs printing to the input of a shell. You can view the testsh.pl script and the profiled output.

An abbreviated form of the profiled output follows:

%Time    Sec.     #calls   sec/call  F  name
92.24    1.5127     1000   0.001513     main::sys
 2.04    0.0335     1000   0.000033     main::sh
sys() calls system(), where sh() simply sends data to an existing shell process. The results speak for themselves, even though this example is only running "echo" - the startup time of the shell is obviously a huge factor in runtime. The difference is incredible. I am definitely implementing this in the next version of grok. I've already run into many cases where I am processing extremely large logs post-hoc and I end up using perl block reactions to speed up execution. This shell execution speed-up will help make grok even faster, and it can always use more speed.

Still no time to work on projects right now, perhaps soon! I'm moving to California in less than a week, so this will have to wait until after the move.

Comments: 0 (view comments)
Tags: , , ,
Permalink: /productivity/grok-plans
posted at: 03:04

Mon, 08 May 2006

Parallelization with /bin/sh

I have 89 log files. The average file size is 100ish megs. I want to parse all of the logs into something else useful. Processing 9.1 gigs of logs is not my idea of a good time, nor is it a good application for a single CPU to handle. Let's parallelize it.

I abuse /bin/sh's ability to background processes and wait for children to finish. I have a script that can take a pool of available computers and send tasks to it. These tasks are just "process this apache log" - but the speed increase of parallelization over single process is incredible and very simple to do in the shell.

The script to perform this parallization is here: parallelize.sh

I define a list of hosts to use in the script and pass a list of logs to process on the command line. The host list is multiplied until it is longer than the number of logs. I then pick a log and send it off to a server to process using ssh, which calls a script that outputs to stdout. Output is captured to a file delimited by the hostname and the pid.

I didn't run it single-process in full to compare running times, however, parallel execution gets *much* farther in 10 minutes than single proc does. Sweet :)

Some of the log files are *enormous* - taking up 1 gig alone. I'm experimenting with split(1) to split these files into 100,000 lines each. The problem becomes that all of the tasks are done except for the 4 processes handling the 1 gig log files (there are 4 of them). Splitting will make the individual jobs smaller, allowing us to process them faster becuase we have a more even work load across proceses.

So, a simple application benefiting from parallelization is solved by using simple, standard tools. Sexy.

Comments: 0 (view comments)
Tags: , , , , , ,
Permalink: /productivity/parallelization-with-the-shell
posted at: 17:02

Sat, 06 May 2006

The CSH Bawls Programming Competition

Yesterday, I participated in a 12-hour coding-binge competition. It started at 7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer Science House and Bawls, both sponsors of the event. Needless to say, I haven't gotten much sleep today.

The competition website is here. Go there if you want to view this year's objectives.

The Dream Team consisted of John Resig, Darrin Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at web development, so we decided this year we would represent ourselves as "Team JavaScript" - and do everything possible in javascript. Bruce is not a programmer, but I enlisted his graphical art skills because I figured with our team doing some web-based project, we definitely needed an artist.

After reviewing all the objectives, we came up with a significant modification upon the Sudoku objective. The sudoku objective was a problem that lacked much room for innovation, so we went further and instead of solving Sudoku, wrote a web-based version of an extremely popular game in Second Life. The contest organizer approved of our new objective, so we did just that.

Resig worked on game logic, I worked on chat features, Darrin worked on scoring and game generation, and Bruce worked on the interface graphics. Becuase our tasks were all mostly unrelated, we could develop them independently. Most of the game was completed in about 6 hours, and the remainder of the time was spent fixing bugs, refactoring, and some minor redesign.

The backends were minimal. The chat backend was only 70 lines of perl, and the score backend was 9 lines of /bin/sh. Everything else was handled in the browser. We leveraged Resig's jQuery to make development faster. Development went extremely smooth, a testament to the "Dream Team"-nature of our team, perhaps? ;)

The game worked by presenting everyone with the same game - so you can compete for the highest score. You could also chat during and between games, if you wanted to.

A screenshot can be found here. At the end of the competition, we only had one known bug left. That bug didn't affect gameplay, and we were all tired, so it didn't get fixed. There were a few other issues that remained unresolved that may or may not be related to our code. Firefox was having issues with various things we were doing, and we couldn't tell if it was our fault or not.

Despite the fact that I probably shouldn't have attended the competition due to scholastic time constraints, I was glad I went. We had a blast writing the game.

We may get some time in the near future to improve the codebase and put it up online so anyone can play. There are quite a few important features that need to be added before it'll be useful as a public game.

Comments: 1 (view comments)
Tags: , , , , , , , ,
Permalink: /geekery/bawls-competition-tringo
posted at: 19:11

Thu, 04 May 2006

RRDTool and Data Forecasting

After reading some more RRDTool documentation, I've found some behavior detection features. I have not attempted to use this yet, but am eager to try. For information on this behavior detection and forecasting feature, read the rrdcreate manpage and search for "Aberrant Behavior" or "Holt-Winters Forecasting"

I am not sure how the algorithm works, but reading about it gives me some confidence in it's ability. Testing is required

Comments: 0 (view comments)
Tags: ,
Permalink: /geekery/rrdtool-behavior-detection
posted at: 01:58

RRDTool to graph log-originating data.

I need to relearn rrdtool, again, for this sysadmin time machine project. Today's efforts were spent testing for features I hoped were in RRDTool. So far, my feature needs are met :)

Take something simple, like webserver logs. Let's graph the hits.

Create the RRD:

rrdtool create webhits.rrd --start 1128626000 -s 60 \
   DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
   RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
   RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a start date of sometime in Novemeber. The step is 60, so it expects data every minute. I then specify one data type, hits, which is a gaugue (rate), and ranges from 0 to infinity (U). The rest of the command is RRA's defining how data is stored. The first one says take 5 samples and average them, and store 600,000 of these samples, at a maximum.

Now that we have the database, we need a "hits-per-minute" data set. I wrote a short perl script, parsehttp that will read from standard input and calculate hits-per-minute and output rrdtool update statements. Capture this output and run it through sh:

./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs and store it in our RRD.

Now that we have the data, we can graph it. However, since I want to view trends and compare time periods, I'll need to do something fancier than simple graphs.

RRDTool lets you graph multiple data sets on the same graph. So, I want to graph this week's hits and last week's hits. However, since the data sets are on different time intervals, I need to shift last week's set forward by one week. Here's the rrdtool command that graphs it for us, with last week's and this week's data on the same graph, displayed at the same time period:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two data sets with DEF: hits and lastweek. They both read from the 'hits' data set in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other starts 2 weeks ago and ends last week. I then shift last week's data forward by 7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green), the other for this weeks (red).

That graph looks like this:

That's not really useful, becuase there's so many data points the graph almost becomes meaningless. This is due to my poor creation of RRAs. We can fix that by redoing the database, or using the TREND feature. Change our graph statement to be:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one day (86400 seconds). This creates a sliding average across time. I store these in new data sets called t_hits and t_lastweek and graph those aswell.

The new graph looks like this:

You'll notice the slide values are chopped off on the left, that's becuase it doesn't have enough data points at those time periods to make an average. However, including the raw data makes the graph scale as it did before, making viewing the trend difference awkward. So, let's fix it by not graphing the raw data. Just cut out the LINE1:lastweek and LINE1:hits options.

Fixing the sliding average cutoff, add a title, and a vertical label:

rrdtool graph webhits.png -s "-1 week" \
   -t "Web Server Hits - This week vs Last week" \
   -v "hits/minute" \
   DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days"  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend beyond those boundaries, so that sliding averages can be calculated throughout. The new, final graph, looks like this:

Now I can compare this week's hits against last weeks, quickly with a nice visual. This is what I'm looking for.

This would become truely useful if we had lots of time periods (days, weeks, whatever) to look at. Then we could calculate standard deviation, etc. A high outlier could be marked automatically with a label, giving an instant visual cue that something is potentially novel. It might be simple to create a sort-of sliding "standard deviation" curve. I haven't tried that yet.

Comments: 3 (view comments)
Tags: , , , , ,
Permalink: /geekery/rrdtool-for-the-win
posted at: 00:57

Wed, 03 May 2006

New event recording database prototype

I finally managed to find time today to work on my events database project. In the processes of doing so, I found a few bugs in grok that needed to get fixed. Some of my regular expressions were being a bit greedy, so certain pattern expansion was breaking.

To summarize the event recording system, it is a webserver listening for event publish requests. It accepts the "when" "where" and "what" of an event, and stores it in a database.

To have my logs pushed to the database, I'll leverage the awesome power of Grok. Before I do that, I gathered all of the auth.log files and archives and compiled them into their respective files.

The grok.conf for this particular maneuver:

exec "cat ./logs/nightfall.auth.log ./logs/sparks.auth.log ./logs/whitefox.auth.log" {
   type "all syslog" {
      match_syslog = 1;
      reaction = 'fetch -qo - "http://localhost:8080/?when=%SYSLOGDATE|parsedate%&where=%HOST%/%PROG|urlescape|shdq%&what=%DATA:GLOB|urlescape|shdq%"';
   };
};
This is farily simple. I added a new standard filter, 'urlescape' to grok becuase I needed it. it will url escape a data piece. Hurray!

Run grok, and it sends event notifications to the webserver for every syslog-matching line. Using FreeBSD's command-line web client, fetch.

sqlite> select count(*) from events;
8085
Now, let's look for something meaningful. I want to know what happened on all sshd services between 1am and 3am this morning (Today, May 3rd):
nightfall(~/projects/eventdb) % date -j 05030100 +%s
1146632400
nightfall(~/projects/eventdb) % date -j 05030400 +%s
1146643200
Now I know the Unix epoch times for May 3rd at 1am and 4am.
sqlite> select count(*) from events where time >= 1146632400 
   ...> and time <= 1146643200 and location like "%/sshd" 
   ...> and data like "Invalid user%";
2465
This query is instant. Much faster than doing 'grep -c' on N log files across M machines. I don't care how good your grep-fu is, you aren't going to be faster.This speed feature is only the beginning. Think broader terms. Nearly instantly zoom to any point in time to view "events" on a system or set of systems. Filter out particular events by keyword or pattern. Look for the last time a service was restarted. I could go on, but you probably get the idea. It's grep, but faster, and with more features.

As far as the protocol and implementation goes, I'm not sure how well this web-based concept is going to prevail. At this point, I am not interested in protocol or database efficiency. The prototype implementation is good enough. From what I've read about Splunk in the past months in the form of advertisements and such, it seems I already have the main feature Splunk has: searching logs easily. Perhaps I should incorporate and sell my own, better-than-Splunk, product? ;)

Bear in mind that I have no idea what Splunk actually does beyond what I've gleaned from advertisements for the product. I'm sure it's atleast somewhat useful, or no one would invest.

Certainly, a pipelined HTTP client could perform this much faster than doing 10000 individual http requests. A step further would be having the web server accept any number of events per page request. The big test is going to see how well HTTP scales, but that can be played with later.

At this point, we have come fairly close to the general idea of this project: Allowing you to zoom to particular locations in time and view system events.

The server code for doing this was very easy. I chose Python and started playing with CherryPy (a webserver framework). I had a working event reciever server in about 30 minutes. 29 minutes of that time was spent writing a threadsafe database class to front for pysqlite. The CherryPy bits only amount to about 10 lines of code, out of 90ish.

The code to do the server can be found here: /scripts/cherrycollector.py

Comments: 0 (view comments)
Tags: , , , , , ,
Permalink: /geekery/grok-and-eventdb
posted at: 03:22

Python is getting on my nerves

Add lacking dynamic assignment ability to my "I wish Python had Foo" list.

Python does not appear to have dynamically assignable arrays. Where are we, C? Assembly? When I assign past the end of the array, I mean resize the god damned array. Thanks.

nightfall(~) % python -c "foo = []; foo[3] = 234"
Traceback (most recent call last):
  File "<string>", line 1, in ?
  IndexError: list assignment index out of range
This is completely unacceptable. Sure, I can use list comprehensions to make an N element array that's empty:
foo = [None for x in range(100)]
foo[44] = "Hi"
That only gets me an array with 100 empty elements. Uh.. Not what I want. If I did this on an array with data in it I didn't want to lose, I'd lose all the data.

Sigh...

Comments: 2 (view comments)
Tags: ,
Permalink: /rants/python-problems
posted at: 00:15

Tue, 02 May 2006

Event DB - temporal event storage

Travelling always gives me lots of time to think about new ideas. Today's 12 hours of flight gave me some some to spend brainstorming some ideas for my "sysadmin time machine" project.

I've come up with the following so far:

  • The concept of an event is something which has "when, where, and what"-ness. Other properties of events such as significance and who-reported-it are trivial. The key bits are when the event occurred, where it occurred, and what the event was.
  • Software logs happen to have these three key properties. Put simply: store it in a database that lets you search over a range of times and you have yourself a time machine.
  • Couple this with visualizations and statistical analysis. Trends are important. Automatic novelty detection is important.
  • Trends can be seen by viewing data over time - whether visual or formulaic (though the former is easier for Joe Average to see). An example trend would be showing a gradual increase in disk usage over a period of time.
  • Novelty detection can occur a number of ways. Something as simple as a homoskedasticity test could show if data were "normal" - though homoskedasticity only works well for linear models, iirc. I am not a statistician.
  • Trend calculation can provide useful models predicting resource exhaustion, MTBNF, and other important data.
  • Novelty detection aids in fire fighting and post-hoc "Oops it's broken" forensics.
I'm hoping to find time to get an event storage prototype working soon. The following step would be to leverage RRDtool as a numeric storage medium and perform novelty/trend detection/analysis from its data.

The overall goal of this is to somewhat automate problem detection and significantly aid in problem cause/effect searching.

The eventdb system will likely support many interfaces:

  • syslog - hosts can log directly to eventdb
  • command line - scriptably/manually push data to the eventdb
  • generic numeric data input - a lame frontend to rrdtool, perhaps
Thus, all data would be pushed through eventdb which would figure out which on-disk data medium to store it in. Queries could be done asking eventdb about things such as "show me yesterday's mysql activity around 3am" or "compare average syscall usage across this week and last week"

This sort of trend and novelty mapping would be extremely useful in a production software environment to compare configuration or software changes. That is, last month's syscall averages might be much lower than this months - and perhaps the only change was a configuration file change or new software being pushed to production. You would be able to track the problem back to when it first showed up - hopefully correllating to some change that was known about. After all, the first step in solving a problem is knowing of its existence.

My experience with data analysis techniques is not extensive. So I wouldn't expect the data analysis tools in the prototype to sport anything fancy.

I need more hours in a day! Oh well, back to doing homework. Hopefully I'll have some time to prototype something soon.

Comments: 0 (view comments)
Tags: , ,
Permalink: /geekery/eventdb-ideas
posted at: 04:11

Search this site

Navigation

Metadata

Home About Resume My Code

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< May 2006 >
SuMoTuWeThFrSa
  1 2 3 4 5 6
7 8 910111213
14151617181920
21222324252627
28293031   

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati