I'm almost done and graduated from RIT, which is why I haven't posted about
anything recently. Finally done with this school. Wee!
Some noise woke me up from my pleasant slumber, and I can't seem to get back to
sleep, so I might aswell do some thinking.
I think grok's config syntax is far too cluttered. I would much rather make it
more simplified, somehow. Something like:
file /var/log/auth.log: syslog
{blockuser, t=3, i=60} Invalid user %USER% from %IP%
file /var/log/messages: syslog
{tracksudo, prog=su} BAD SU %USER:F% to %USER:T% on %TTY%
reaction blockuser "pfctl -t naughty -T add %IP%"
reaction tracksudo "echo '%USER:F% failed su. (to %USER:T%)' >> /var/log/su.log"
Seems like it's less writing than the existing version. Less writing is good. A
more relaxed syntax would be nicer aswell -such as not requiring quotes around
filenames.
This new syntax also splits reaction definitions from pattern matches, allowing
you to call reactions from different files and other matches. I'll be keepin
the perlblock and other features that reactions have, becuase they are quite
useful.
I've also come up with a simple solution to reactions that execute shell
commands. The current version of grok will run system("your
command") every time a shell reaction is run. This is tragicaly
suboptimal due to startup overhead. The simple solution is to run
"/bin/sh -" so that there is already a shell accepting standard
input and waiting for my commands. Simply write the shell command to that
program.
I wrote a short program to test the speed of using system() vs
printing to the input of a shell. You can view the testsh.pl script and
the profiled
output.
An abbreviated form of the profiled output follows:
%Time Sec. #calls sec/call F name
92.24 1.5127 1000 0.001513 main::sys
2.04 0.0335 1000 0.000033 main::sh
sys() calls system(), where sh() simply sends data to an existing shell
process. The results speak for themselves, even though this example is only
running "echo" - the startup time of the shell is obviously a huge factor in
runtime. The difference is incredible. I am definitely implementing this in the
next version of grok. I've already run into many cases where I am processing
extremely large logs post-hoc and I end up using perl block reactions to speed
up execution. This shell execution speed-up will help make grok even faster,
and it can always use more speed.
Still no time to work on projects right now, perhaps soon! I'm moving to
California in less than a week, so this will have to wait until after the move.
Comments: 0 (view comments)
Tags: grok, ideas, sysadmin, logs
Permalink: /productivity/grok-plans
posted at: 03:04
I have 89 log files. The average file size is 100ish megs. I want to parse all of the logs into something else useful. Processing 9.1 gigs of logs is not my idea of a good time, nor is it a good application for a single CPU to handle. Let's parallelize it.
I abuse /bin/sh's ability to background processes and wait for children to finish. I have a script that can take a pool of available computers and send tasks to it. These tasks are just "process this apache log" - but the speed increase of parallelization over single process is incredible and very simple to do in the shell.
The script to perform this parallization is here: parallelize.sh
I define a list of hosts to use in the script and pass a list of logs to
process on the command line. The host list is multiplied until it is longer
than the number of logs. I then pick a log and send it off to a server to
process using ssh, which calls a script that outputs to stdout. Output is
captured to a file delimited by the hostname and the pid.
I didn't run it single-process in full to compare running times, however, parallel execution gets *much* farther in 10 minutes than single proc does. Sweet :)
Some of the log files are *enormous* - taking up 1 gig alone. I'm experimenting
with split(1) to split these files into 100,000 lines each. The problem becomes
that all of the tasks are done except for the 4 processes handling the 1 gig
log files (there are 4 of them). Splitting will make the individual jobs
smaller, allowing us to process them faster becuase we have a more even work
load across proceses.
So, a simple application benefiting from parallelization is solved by using simple, standard tools. Sexy.
Comments: 0 (view comments)
Tags: sysadmin, shell, parallelization, productivity, tricks, unix, logs
Permalink: /productivity/parallelization-with-the-shell
posted at: 17:02
Yesterday, I participated in a 12-hour coding-binge competition. It started at
7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer
Science House and Bawls, both sponsors of the event. Needless to say, I haven't
gotten much sleep today.
The competition website is here. Go there if you
want to view this year's objectives.
The Dream Team consisted of John Resig, Darrin
Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at
web development, so we decided this year we would represent ourselves as "Team
JavaScript" - and do everything possible in javascript. Bruce is not a
programmer, but I enlisted his graphical art skills because I figured with our
team doing some web-based project, we definitely needed an artist.
After reviewing all the objectives, we came up with a significant modification
upon the Sudoku objective. The sudoku objective was a problem that lacked much
room for innovation, so we went further and instead of solving Sudoku, wrote a
web-based version of an extremely popular game in Second Life. The contest
organizer approved of our new objective, so we did just that.
Resig worked on game logic, I worked on chat features, Darrin worked on scoring
and game generation, and Bruce worked on the interface graphics. Becuase our
tasks were all mostly unrelated, we could develop them independently. Most of
the game was completed in about 6 hours, and the remainder of the time was
spent fixing bugs, refactoring, and some minor redesign.
The backends were minimal. The chat backend was only 70 lines of perl, and the
score backend was 9 lines of /bin/sh. Everything else was handled in the
browser. We leveraged Resig's jQuery to make development faster. Development
went extremely smooth, a testament to the "Dream Team"-nature of our team,
perhaps? ;)
The game worked by presenting everyone with the same game - so you can compete
for the highest score. You could also chat during and between games, if you
wanted to.
A screenshot can be found here. At the end of the competition, we only had one
known bug left. That bug didn't affect gameplay, and we were all tired, so it
didn't get fixed. There were a few other issues that remained unresolved that
may or may not be related to our code. Firefox was having issues with various
things we were doing, and we couldn't tell if it was our fault or not.
Despite the fact that I probably shouldn't have attended the competition due to
scholastic time constraints, I was glad I went. We had a blast writing the game.
We may get some time in the near future to improve the codebase and put it up
online so anyone can play. There are quite a few important features that need to
be added before it'll be useful as a public game.
Comments: 1 (view comments)
Tags: nosleep, perl, javascript, web2.0, jquery, shell, web, xml, codebinge
Permalink: /geekery/bawls-competition-tringo
posted at: 19:11
After reading some more RRDTool documentation, I've found some behavior
detection features. I have not attempted to use this yet, but am eager to try.
For information on this behavior detection and forecasting feature, read the
rrdcreate manpage and search for "Aberrant Behavior" or "Holt-Winters
Forecasting"
I am not sure how the algorithm works, but reading about it gives me some
confidence in it's ability. Testing is required
Comments: 0 (view comments)
Tags: rrdtool, statistics
Permalink: /geekery/rrdtool-behavior-detection
posted at: 01:58
I need to relearn rrdtool, again, for this sysadmin time machine project.
Today's efforts were spent testing for features I hoped were in RRDTool. So
far, my feature needs are met :)
Take something simple, like webserver logs. Let's graph the hits.
Create the RRD:
rrdtool create webhits.rrd --start 1128626000 -s 60 \
DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a
start date of sometime in Novemeber. The step is 60, so it expects data every
minute. I then specify one data type, hits, which is a gaugue (rate), and
ranges from 0 to infinity (U). The rest of the command is RRA's defining how
data is stored. The first one says take 5 samples and average them, and store
600,000 of these samples, at a maximum.
Now that we have the database, we need a "hits-per-minute" data set. I wrote a
short perl script, parsehttp that will
read from standard input and calculate hits-per-minute and output rrdtool
update statements. Capture this output and run it through sh:
./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs
and store it in our RRD.
Now that we have the data, we can graph it. However, since I want to view
trends and compare time periods, I'll need to do something fancier than simple
graphs.
RRDTool lets you graph multiple data sets on the same graph. So, I want to
graph this week's hits and last week's hits. However, since the data sets are
on different time intervals, I need to shift last week's set forward by one
week. Here's the rrdtool command that graphs it for us, with last week's and
this week's data on the same graph, displayed at the same time period:
rrdtool graph webhits.png -s "-1 week" \
DEF:hits=webhits.rrd:hits:AVERAGE \
DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
SHIFT:lastweek:604800 \
LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two
data sets with DEF: hits and lastweek. They both read from the 'hits' data set
in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other
starts 2 weeks ago and ends last week. I then shift last week's data forward by
7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green),
the other for this weeks (red).
That graph looks like this:
That's not really useful, becuase there's so many data points the graph almost
becomes meaningless. This is due to my poor creation of RRAs. We can fix that
by redoing the database, or using the TREND feature.
Change our graph statement to be:
rrdtool graph webhits.png -s "-1 week" \
DEF:hits=webhits.rrd:hits:AVERAGE \
DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
SHIFT:lastweek:604800 \
CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one
day (86400 seconds). This creates a sliding average across time. I store these
in new data sets called t_hits and t_lastweek and graph those aswell.
The new graph looks like this:
You'll notice the slide values are chopped off on the left, that's becuase it
doesn't have enough data points at those time periods to make an average.
However, including the raw data makes the graph scale as it did before, making
viewing the trend difference awkward. So, let's fix it by not graphing the raw
data. Just cut out the LINE1:lastweek and LINE1:hits options.
Fixing the sliding average cutoff, add a title, and a vertical label:
rrdtool graph webhits.png -s "-1 week" \
-t "Web Server Hits - This week vs Last week" \
-v "hits/minute" \
DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days" \
DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
SHIFT:lastweek:604800 \
CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend
beyond those boundaries, so that sliding averages can be calculated throughout.
The new, final graph, looks like this:
Now I can compare this week's hits against last weeks, quickly with a nice
visual. This is what I'm looking for.
This would become truely useful if we had lots of time periods (days, weeks,
whatever) to look at. Then we could calculate standard deviation, etc. A high
outlier could be marked automatically with a label, giving an instant visual
cue that something is potentially novel. It might be simple to create a sort-of
sliding "standard deviation" curve. I haven't tried that yet.
Comments: 3 (view comments)
Tags: rrdtool, statistics, perl, graphs, logs, sysadmin
Permalink: /geekery/rrdtool-for-the-win
posted at: 00:57
I finally managed to find time today to work on my events database project. In
the processes of doing so, I found a few bugs in grok that needed to get fixed.
Some of my regular expressions were being a bit greedy, so certain pattern
expansion was breaking.
To summarize the event recording system, it is a webserver listening for event
publish requests. It accepts the "when" "where" and "what" of an event, and
stores it in a database.
To have my logs pushed to the database, I'll leverage the awesome power of
Grok. Before I do that, I gathered all of the auth.log files and archives and
compiled them into their respective files.
The grok.conf for this particular maneuver:
exec "cat ./logs/nightfall.auth.log ./logs/sparks.auth.log ./logs/whitefox.auth.log" {
type "all syslog" {
match_syslog = 1;
reaction = 'fetch -qo - "http://localhost:8080/?when=%SYSLOGDATE|parsedate%&where=%HOST%/%PROG|urlescape|shdq%&what=%DATA:GLOB|urlescape|shdq%"';
};
};
This is farily simple. I added a new standard filter, 'urlescape' to grok
becuase I needed it. it will url escape a data piece. Hurray!
Run grok, and it sends event notifications to the webserver for every
syslog-matching line. Using FreeBSD's command-line web client, fetch.
sqlite> select count(*) from events;
8085
Now, let's look for something meaningful. I want to know what happened on all sshd services between 1am and 3am this morning (Today, May 3rd):
nightfall(~/projects/eventdb) % date -j 05030100 +%s
1146632400
nightfall(~/projects/eventdb) % date -j 05030400 +%s
1146643200
Now I know the Unix epoch times for May 3rd at 1am and 4am.
sqlite> select count(*) from events where time >= 1146632400
...> and time <= 1146643200 and location like "%/sshd"
...> and data like "Invalid user%";
2465
This query is instant. Much faster than doing 'grep -c' on N log files across M
machines. I don't care how good your grep-fu is, you aren't going to be
faster.This speed feature is only the beginning. Think broader terms. Nearly
instantly zoom to any point in time to view "events" on a system or set of
systems. Filter out particular events by keyword or pattern. Look for the last
time a service was restarted. I could go on, but you probably get the idea.
It's grep, but faster, and with more features.
As far as the protocol and implementation goes, I'm not sure how well this
web-based concept is going to prevail. At this point, I am not interested in
protocol or database efficiency. The prototype implementation is good enough.
From what I've read about Splunk in the past months in the form of
advertisements and such, it seems I already have the main feature Splunk has:
searching logs easily. Perhaps I should incorporate and sell my own,
better-than-Splunk, product? ;)
Bear in mind that I have no idea what Splunk actually does beyond what I've
gleaned from advertisements for the product. I'm sure it's atleast somewhat
useful, or no one would invest.
Certainly, a pipelined HTTP client could perform this much faster than doing
10000 individual http requests. A step further would be having the web server
accept any number of events per page request. The big test is going to see how
well HTTP scales, but that can be played with later.
At this point, we have come fairly close to the general idea of this project:
Allowing you to zoom to particular locations in time and view system events.
The server code for doing this was very easy. I chose Python and started
playing with CherryPy (a webserver framework). I had a working event reciever
server in about 30 minutes. 29 minutes of that time was spent writing a
threadsafe database class to front for pysqlite. The CherryPy bits only amount
to about 10 lines of code, out of 90ish.
The code to do the server can be found here:
/scripts/cherrycollector.py
Comments: 0 (view comments)
Tags: grok, eventdb, sysadmin, perl, python, projects, oh-my-god-it's-4am
Permalink: /geekery/grok-and-eventdb
posted at: 03:22
Add lacking dynamic assignment ability to my "I wish Python had Foo" list.
Python does not appear to have dynamically assignable arrays. Where are we, C? Assembly? When I assign past the end of the array, I mean resize the god damned array. Thanks.
nightfall(~) % python -c "foo = []; foo[3] = 234"
Traceback (most recent call last):
File "<string>", line 1, in ?
IndexError: list assignment index out of range
This is completely unacceptable.
Sure, I can use list comprehensions to make an N element array that's empty:
foo = [None for x in range(100)]
foo[44] = "Hi"
That only gets me an array with 100 empty elements. Uh.. Not what I want. If I
did this on an array with data in it I didn't want to lose, I'd lose all the
data.
Sigh...
Comments: 2 (view comments)
Tags: rants, python
Permalink: /rants/python-problems
posted at: 00:15
Travelling always gives me lots of time to think about new ideas.
Today's 12 hours of flight gave me some some to spend brainstorming
some ideas for my "sysadmin time machine" project.
I've come up with the following so far:
-
The concept of an event is something which has "when, where, and
what"-ness. Other properties of events such as significance and
who-reported-it are trivial. The key bits are when the event
occurred, where it occurred, and what the event was.
-
Software logs happen to have these three key properties. Put
simply: store it in a database that lets you search over a range
of times and you have yourself a time machine.
-
Couple this with visualizations and statistical analysis.
Trends are important. Automatic novelty detection is important.
-
Trends can be seen by viewing data over time - whether visual or
formulaic (though the former is easier for Joe Average to see). An
example trend would be showing a gradual increase in disk usage
over a period of time.
-
Novelty detection can occur a number of ways. Something as simple
as a homoskedasticity test could show if data were "normal" -
though homoskedasticity only works well for linear models, iirc. I
am not a statistician.
-
Trend calculation can provide useful models predicting resource
exhaustion, MTBNF, and other important data.
-
Novelty detection aids in fire fighting and post-hoc "Oops it's
broken" forensics.
I'm hoping to find time to get an event storage prototype working soon.
The following step would be to leverage RRDtool as a numeric storage
medium and perform novelty/trend detection/analysis from its data.
The overall goal of this is to somewhat automate problem detection and
significantly aid in problem cause/effect searching.
The eventdb system will likely support many interfaces:
- syslog - hosts can log directly to eventdb
- command line - scriptably/manually push data to the eventdb
- generic numeric data input - a lame frontend to rrdtool, perhaps
Thus, all data would be pushed through eventdb which would figure out
which on-disk data medium to store it in. Queries could be done asking
eventdb about things such as "show me yesterday's mysql activity around
3am" or "compare average syscall usage across this week and last week"
This sort of trend and novelty mapping would be extremely useful in a
production software environment to compare configuration or software
changes. That is, last month's syscall averages might be much lower than
this months - and perhaps the only change was a configuration file
change or new software being pushed to production. You would be able to
track the problem back to when it first showed up - hopefully
correllating to some change that was known about. After all, the first
step in solving a problem is knowing of its existence.
My experience with data analysis techniques is not extensive. So I
wouldn't expect the data analysis tools in the prototype to sport
anything fancy.
I need more hours in a day! Oh well, back to doing homework. Hopefully
I'll have some time to prototype something soon.
Comments: 0 (view comments)
Tags: eventdb, oh-my-god-it's-4am, ideas
Permalink: /geekery/eventdb-ideas
posted at: 04:11
|
Search this site
Navigation
Metadata
Home
About
Resume
My Code
ARP Security
Dynamic DNS with DHCP
OpenLDAP+Kerberos+SASL
PPP over SSH
SSH Security: /bin/false
Week of Unix Tools
Work Efficiency
fex
firefox tabsearch
firefox urledit
grok
keynav
liboverride
newpsm (FreeBSD)
nis2ldap
pam_captcha
poor man's backup
Solaris audio utility
xboxproxy
xdotool
xmlpresenter
xpathtool
misc scripts
Presentations
Yahoo! Hack Day '06
Unix Essentials
Vi/Vim Essentials
Tag Cloud
Calendar
| < |
May 2006 |
> |
| | 1 | 2 | 3 | 4 | 5 | 6 |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 | 30 | 31 | | | |
Friends
BarCamp
Kent Brewster
Tantek Çelik
John Resig
Wesley Shields
Tyler Shields
Technorati
|