I finished up work on the pattern discovery feature for the C++ port of grok. As
soon as it was finished, I wanted to see the dpeed differences between the perl
and C++ versions.
- Perl grok: 6 lines analyzed per second
- C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.
130 lines per second isn't fantastic, but it's 21.66 times faster than the perl
version, and that's huge.
I still have to implement a few other features to make the C++ version
equivalent to the perl version:
- config file (same format, ideally, as the perl version)
- filters, like %SYSLOGDATE|parsedate%
Comments: 1 (view comments)
Tags: grok, c++grok, C++, perl, performance
Permalink: /geekery/cgrok-vs-grok-pattern-discovery
posted at: 19:07
I just finished implementing predicates in c++grok (tentative name) and wanted
to compare the performance against perl grok.
An input of 50000 lines of apache logfile amounting to 9.7megs of data.
I initially attempted this using the regex predicate %IP~/^129% but I realized
that perl grok compiles the predicate regex every time it is executed, and
wasn't a fair test. So I switched to %IP>=129% instead, which converts the
match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.
# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps
# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
- Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
- Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the
pattern discovery
portion from perl.
Exciting :)
Comments: 0 (view comments)
Tags: grok, perl, C++, benchmark, performance
Permalink: /geekery/grok-predicates-perl-vs-cplusplus
posted at: 02:31
The benchmark for testing speed improvements was to process 3000 lines for
'%SYSLOGBASE .*? %IPORHOST%'. Profiler was Devel::Profile.
Before:
time running program: 11.9738 (85.07%)
number of calls: 409362
%Time Sec. #calls sec/call F name
64.52 7.6336 2870 0.002660 main::meta2string
10.00 1.1838 5740 0.000206 main::filter
8.88 1.0507 3000 0.000350 main::handle
3.14 0.3710 54828 0.000007 main::handle_capture
0.77 0.0908 2870 0.000032 main::react
After
time running program: 2.5216 (82.73%)
number of calls: 105152
%Time Sec. #calls sec/call F name
40.56 1.0228 3000 0.000341 main::handle
15.22 0.3838 54828 0.000007 main::handle_capture
4.47 0.1128 2870 0.000039 main::meta2string
3.31 0.0834 2870 0.000029 main::react
2.61 0.0658 14747 0.000004 main::debug
1.81 0.0456 237 0.000192 main::readlog
Primary changes were mostly to pregenerate a few regular expressions.
Previously, I was generating the same regex every time filter() or
meta2string() was being called. These small changes gave grok a serious boost
in speed: what was taking 12 seconds now takes 2.5 seconds.
One example of a simple optimization is this:
before: my $re = join("|", map { qr/(?:$_)/ } keys(%$filters));
after: our $re ||= join("|", map { qr/(?:$_)/ } keys(%$filters));
This may decrease readability, but it only sets $re once no matter how many
times that line of code is executed. This line of code change was in the
filter() function. Just doing this one simple change reduced the runtime of
each filter() call by 97%; which reduces its runtime to something trivially
small by comparison.
At this point, this is where I was very happy to have written tests for grok.
To verify that grok still behaved properly after making these speed
improvements, I just ran the tests suite. It passed, giving me some confidence
in the stability of the changes.
Comments: 0 (view comments)
Tags: grok, profiling, perl
Permalink: /geekery/grok-speed-improvements
posted at: 03:27
Hop on over to the grok project page and
download the new version.
The changelist from the previous announced release is as follows:
Changes for 20070915:
* Added 'grok_patfind.pl' which adds 'grok -F' functionality. Read about it
here:
http://www.semicomplete.com/blog/geekery/grok-pattern-autodiscovery.html
* Proper shutdown is called to kill the "hack" subprocess
* Add 'shell' option to 'type' sections; not currently used.
* Warn if we're trying to override an existing filter.
* Added more perlthoughts to t/theory/
* Numerous pattern regex changes:
- NUMBER no longer uses Regexp::Common's real number regex since that one
matches '.' and my thoughts are that '.' is not a number
- Added POSITIVENUM pattern
- Fix HOSTNAME to match hostnames starting with a number (Again,
Regexp::Common fails me)
- Add path patterns: UNIXPATH, WINPATH, URI
- MONTH no longer matches 0 through 12
- DAY no longer matches 0 through 6
- SYSLOGPROG is more specific now, since valid prog names can have dashes.
Comments: 0 (view comments)
Tags: grok, perl, patterns
Permalink: /geekery/grok-20070915
posted at: 04:13
My todo list has had "grok - magical 'pattern finder' thing" since May of this
year. I added it after I wrote up some thoughts on pattern matching, string
sequencing, and automation.
I spent many hours on that problem tonight.
Initially, I wrote a python script which would compare each line in a file
against every other line in the file. Using difflib and string split, I could
figure out what words were changed between two lines. If the lines were
similar, I could compute the difference and replace the differences with a
token placeholder, such as "WORD".
Download this python script
Here's a sample output of the tool where it compares one line against another, computes the difference and makes a new string that will match both lines:
0.91: error: PAM: authentication error for illegal user karinebeers
: error: PAM: authentication error for illegal user leshana
==> error: PAM: authentication error for illegal user %WORD%
This script is fairly primitive in execution. It only compares whole tokens,
which are delimited by space. This was good, but not enough. It doesn't know
about compound patterns such as quoted strings, or complex patterns such as
those matching an IP address or a file path.
How do we consider complex patterns like hostnames and quoted strings? Turns
out, most of the pattern matching wizardry is in grok. Grok knows about named
patterns, and we can abuse this to use grok in a different way - instead of
parsing text, we're going to use it to turn input text into a grok pattern.
Example:
% echo '/foo/bar/baz "foo bar baz" hello ther 1234 www.google.com' \
| perl grok -F
%UNIXPATH% %QS% hello ther 1234 %IPORHOST%
What did it do?
input: /foo/bar/baz "foo bar baz" hello ther 1234 www.google.com
output: %UNIXPATH% %QS% hello ther 1234 %IPORHOST%
Using a new hack on top of grok, we can now turn an unknown plaintext input
into a pattern that is reusable and human-meaningful. This is totally awesome.
This hack only considers complex tokens for simplicity's sake; that is, tokens
only containing letters and numbers are ignored. Why? Is 'foo1234' a word or a
hostname? Is 1997 a number or a year? Grok allows you to make these
distinctions, but I skip simple tokens so I don't have to programatically
weight random patterns. Note the above example, where '1234' was not replaced
with '%NUMBER%' or something similar.
So, I run this new grok hack on my /var/log/messages file. Here's a sampling of
output from grok:
# grok -F < /var/log/messages
%SYSLOGBASE% logfile turned over due to size>100K
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% error: PAM: authentication error for testuser from %IPORHOST%
%SYSLOGBASE% Limiting closed port RST response from 380 to 200 packets/sec
%SYSLOGBASE% error: PAM: authentication error for admin from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for pants from %IPORHOST%
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% kernel time sync enabled 2001
This is pretty close to a reusable pattern that captures all the changing data.
You'll notice there are commonalities among certain lines such as the
'authentication error' lines where only the username is different. Let's run
the 'authentication error' lines only through the python script and see what
happens:
# messages.patterns contains the output of 'grok -F'
% grep 'authentication error' messages.patterns > testinput
% python distance_sort.py testinput | sort | uniq
%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for illegal user %WORD% from %IPORHOST%
Wow, that's exactly what I wanted out of this process. Now we have some
useful patterns, generated with almost zero effort. Let's apply this to our log
file and see what we can get. How about counting, by user, failed login
attempts?
pattern="%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%"
./grok -m "$pattern" -r "%WORD%" < messages \
| sort | uniq -c | sort -n | tail -5
5 backup
9 angel
10 pgsql
12 mail
17 ident
So what have we really accomplished?
You can take any log file and run it through this process. The output of this
process is a set of patterns you can use to query data in that log file (and
others like it). The real power here is not that patterns are generated, it's
that named, queryable patterns are generated. That's awesome.
Comments: 0 (view comments)
Tags: grok, perl, regexp ninja
Permalink: /geekery/grok-pattern-autodiscovery
posted at: 02:23
In my last post, I
discussed a way to perform additional assertions on a given captured group
while still inside of the regex state machine.
I spent some time today and implemented it in grok and it works like a charm!
This kind of functionality gives you extreme power in the kind of matches you
can specify.
Here's an example: How can we find the first number on a line that's greater
than 20?
% jot 50 | xargs -n15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47 48 49 50
Using the above as our sample input:
% jot 50 | xargs -n15 | perl grok -m "%NUMBER>20%" -r "%NUMBER%"
21
31
46
The assertion '>20' is applied during the regular expression match. This is
sweet!
Another example would be to 'grep' the output of 'ls -l' for any lines
containing a number greater than 5000:
% ls -l | perl grok -m "%INT>5000%"
-rw-r--r-- 1 jls jls 21590 Apr 2 20:17 foo
-rw-r--r-- 1 jls jls 22451 Apr 2 23:52 grok
-rw-r--r-- 1 jls jls 129536 Apr 2 23:47 grok-20070402.tar.gz
-rw-r--r-- 1 jls jls 13539 Mar 22 14:22 grok.1
The equivalent awk statement would look like this:
% ls -l | awk '$5 > 5000'
But if you look at the awk statement, you have to count columns. With grok's
solution, all you need to know is you want any line with an integer greater
than 5000. You are able to specify the specifics of your match without having
to know the precise layout of the data you are matching.
What do you get if you can chain predicates? I haven't added that functionality
yet, but it would be trivial to add, so expect it soon.
If you're interested, please try the latest version.
Comments: 0 (view comments)
Tags: grok, perl, hacks
Permalink: /geekery/grok-gets-more-ridiculous
posted at: 01:05
Yesterday, I participated in a 12-hour coding-binge competition. It started at
7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer
Science House and Bawls, both sponsors of the event. Needless to say, I haven't
gotten much sleep today.
The competition website is here. Go there if you
want to view this year's objectives.
The Dream Team consisted of John Resig, Darrin
Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at
web development, so we decided this year we would represent ourselves as "Team
JavaScript" - and do everything possible in javascript. Bruce is not a
programmer, but I enlisted his graphical art skills because I figured with our
team doing some web-based project, we definitely needed an artist.
After reviewing all the objectives, we came up with a significant modification
upon the Sudoku objective. The sudoku objective was a problem that lacked much
room for innovation, so we went further and instead of solving Sudoku, wrote a
web-based version of an extremely popular game in Second Life. The contest
organizer approved of our new objective, so we did just that.
Resig worked on game logic, I worked on chat features, Darrin worked on scoring
and game generation, and Bruce worked on the interface graphics. Becuase our
tasks were all mostly unrelated, we could develop them independently. Most of
the game was completed in about 6 hours, and the remainder of the time was
spent fixing bugs, refactoring, and some minor redesign.
The backends were minimal. The chat backend was only 70 lines of perl, and the
score backend was 9 lines of /bin/sh. Everything else was handled in the
browser. We leveraged Resig's jQuery to make development faster. Development
went extremely smooth, a testament to the "Dream Team"-nature of our team,
perhaps? ;)
The game worked by presenting everyone with the same game - so you can compete
for the highest score. You could also chat during and between games, if you
wanted to.
A screenshot can be found here. At the end of the competition, we only had one
known bug left. That bug didn't affect gameplay, and we were all tired, so it
didn't get fixed. There were a few other issues that remained unresolved that
may or may not be related to our code. Firefox was having issues with various
things we were doing, and we couldn't tell if it was our fault or not.
Despite the fact that I probably shouldn't have attended the competition due to
scholastic time constraints, I was glad I went. We had a blast writing the game.
We may get some time in the near future to improve the codebase and put it up
online so anyone can play. There are quite a few important features that need to
be added before it'll be useful as a public game.
Comments: 1 (view comments)
Tags: nosleep, perl, javascript, web2.0, jquery, shell, web, xml, codebinge
Permalink: /geekery/bawls-competition-tringo
posted at: 19:11
I need to relearn rrdtool, again, for this sysadmin time machine project.
Today's efforts were spent testing for features I hoped were in RRDTool. So
far, my feature needs are met :)
Take something simple, like webserver logs. Let's graph the hits.
Create the RRD:
rrdtool create webhits.rrd --start 1128626000 -s 60 \
DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a
start date of sometime in Novemeber. The step is 60, so it expects data every
minute. I then specify one data type, hits, which is a gaugue (rate), and
ranges from 0 to infinity (U). The rest of the command is RRA's defining how
data is stored. The first one says take 5 samples and average them, and store
600,000 of these samples, at a maximum.
Now that we have the database, we need a "hits-per-minute" data set. I wrote a
short perl script, parsehttp that will
read from standard input and calculate hits-per-minute and output rrdtool
update statements. Capture this output and run it through sh:
./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs
and store it in our RRD.
Now that we have the data, we can graph it. However, since I want to view
trends and compare time periods, I'll need to do something fancier than simple
graphs.
RRDTool lets you graph multiple data sets on the same graph. So, I want to
graph this week's hits and last week's hits. However, since the data sets are
on different time intervals, I need to shift last week's set forward by one
week. Here's the rrdtool command that graphs it for us, with last week's and
this week's data on the same graph, displayed at the same time period:
rrdtool graph webhits.png -s "-1 week" \
DEF:hits=webhits.rrd:hits:AVERAGE \
DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
SHIFT:lastweek:604800 \
LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two
data sets with DEF: hits and lastweek. They both read from the 'hits' data set
in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other
starts 2 weeks ago and ends last week. I then shift last week's data forward by
7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green),
the other for this weeks (red).
That graph looks like this:
That's not really useful, becuase there's so many data points the graph almost
becomes meaningless. This is due to my poor creation of RRAs. We can fix that
by redoing the database, or using the TREND feature.
Change our graph statement to be:
rrdtool graph webhits.png -s "-1 week" \
DEF:hits=webhits.rrd:hits:AVERAGE \
DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
SHIFT:lastweek:604800 \
CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one
day (86400 seconds). This creates a sliding average across time. I store these
in new data sets called t_hits and t_lastweek and graph those aswell.
The new graph looks like this:
You'll notice the slide values are chopped off on the left, that's becuase it
doesn't have enough data points at those time periods to make an average.
However, including the raw data makes the graph scale as it did before, making
viewing the trend difference awkward. So, let's fix it by not graphing the raw
data. Just cut out the LINE1:lastweek and LINE1:hits options.
Fixing the sliding average cutoff, add a title, and a vertical label:
rrdtool graph webhits.png -s "-1 week" \
-t "Web Server Hits - This week vs Last week" \
-v "hits/minute" \
DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days" \
DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
SHIFT:lastweek:604800 \
CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend
beyond those boundaries, so that sliding averages can be calculated throughout.
The new, final graph, looks like this:
Now I can compare this week's hits against last weeks, quickly with a nice
visual. This is what I'm looking for.
This would become truely useful if we had lots of time periods (days, weeks,
whatever) to look at. Then we could calculate standard deviation, etc. A high
outlier could be marked automatically with a label, giving an instant visual
cue that something is potentially novel. It might be simple to create a sort-of
sliding "standard deviation" curve. I haven't tried that yet.
Comments: 3 (view comments)
Tags: rrdtool, statistics, perl, graphs, logs, sysadmin
Permalink: /geekery/rrdtool-for-the-win
posted at: 00:57
I finally managed to find time today to work on my events database project. In
the processes of doing so, I found a few bugs in grok that needed to get fixed.
Some of my regular expressions were being a bit greedy, so certain pattern
expansion was breaking.
To summarize the event recording system, it is a webserver listening for event
publish requests. It accepts the "when" "where" and "what" of an event, and
stores it in a database.
To have my logs pushed to the database, I'll leverage the awesome power of
Grok. Before I do that, I gathered all of the auth.log files and archives and
compiled them into their respective files.
The grok.conf for this particular maneuver:
exec "cat ./logs/nightfall.auth.log ./logs/sparks.auth.log ./logs/whitefox.auth.log" {
type "all syslog" {
match_syslog = 1;
reaction = 'fetch -qo - "http://localhost:8080/?when=%SYSLOGDATE|parsedate%&where=%HOST%/%PROG|urlescape|shdq%&what=%DATA:GLOB|urlescape|shdq%"';
};
};
This is farily simple. I added a new standard filter, 'urlescape' to grok
becuase I needed it. it will url escape a data piece. Hurray!
Run grok, and it sends event notifications to the webserver for every
syslog-matching line. Using FreeBSD's command-line web client, fetch.
sqlite> select count(*) from events;
8085
Now, let's look for something meaningful. I want to know what happened on all sshd services between 1am and 3am this morning (Today, May 3rd):
nightfall(~/projects/eventdb) % date -j 05030100 +%s
1146632400
nightfall(~/projects/eventdb) % date -j 05030400 +%s
1146643200
Now I know the Unix epoch times for May 3rd at 1am and 4am.
sqlite> select count(*) from events where time >= 1146632400
...> and time <= 1146643200 and location like "%/sshd"
...> and data like "Invalid user%";
2465
This query is instant. Much faster than doing 'grep -c' on N log files across M
machines. I don't care how good your grep-fu is, you aren't going to be
faster.This speed feature is only the beginning. Think broader terms. Nearly
instantly zoom to any point in time to view "events" on a system or set of
systems. Filter out particular events by keyword or pattern. Look for the last
time a service was restarted. I could go on, but you probably get the idea.
It's grep, but faster, and with more features.
As far as the protocol and implementation goes, I'm not sure how well this
web-based concept is going to prevail. At this point, I am not interested in
protocol or database efficiency. The prototype implementation is good enough.
From what I've read about Splunk in the past months in the form of
advertisements and such, it seems I already have the main feature Splunk has:
searching logs easily. Perhaps I should incorporate and sell my own,
better-than-Splunk, product? ;)
Bear in mind that I have no idea what Splunk actually does beyond what I've
gleaned from advertisements for the product. I'm sure it's atleast somewhat
useful, or no one would invest.
Certainly, a pipelined HTTP client could perform this much faster than doing
10000 individual http requests. A step further would be having the web server
accept any number of events per page request. The big test is going to see how
well HTTP scales, but that can be played with later.
At this point, we have come fairly close to the general idea of this project:
Allowing you to zoom to particular locations in time and view system events.
The server code for doing this was very easy. I chose Python and started
playing with CherryPy (a webserver framework). I had a working event reciever
server in about 30 minutes. 29 minutes of that time was spent writing a
threadsafe database class to front for pysqlite. The CherryPy bits only amount
to about 10 lines of code, out of 90ish.
The code to do the server can be found here:
/scripts/cherrycollector.py
Comments: 0 (view comments)
Tags: grok, eventdb, sysadmin, perl, python, projects, oh-my-god-it's-4am
Permalink: /geekery/grok-and-eventdb
posted at: 03:22
|
Search this site
Navigation
Metadata
Home
About
Resume
My Code (SVN Web)
ARP Security
Dynamic DNS with DHCP
OpenLDAP+Kerberos+SASL
PPP over SSH
SSH Security: /bin/false
Week of Unix Tools
Work Efficiency
fex
firefox tabsearch
firefox urledit
grok
keynav
liboverride
newpsm (FreeBSD)
nis2ldap
pam_captcha
poor man's backup
Solaris audio utility
xboxproxy
xdotool
xmlpresenter
xpathtool
misc scripts
Presentations
Yahoo! Hack Day '06
Unix Essentials
Vi/Vim Essentials
Tag Cloud
Calendar
Friends
BarCamp
Kent Brewster
Tantek Çelik
John Resig
Wesley Shields
Tyler Shields
Technorati
|