Search this site

Page 1 of 4  [next]





Flashback 2004: Too much about perl and regexp

Continuing the flashback series. Here's 2004.

Still in school. Hacking on countless projects in perl.

While writing a command-line interpreter, I wanted you to be able to type 'st' or 'sta' instead of 'status' and have the code find a best match for your command. This is often called 'ambiguous commands' or somesuch. I came up with a clever way to do this using regular expressions (May 2004).

I also posted about using regular expression to match quoted and nonquoted strings (June 2004).

Finally, I experiment with using regular expressions to find word boundaries near the cursor for my command line interpreter (Oct 2004).

Are you drowning in perl and regular expressions, yet? ;)

Terminals, titles, and prompts.

Drew Stephens spent some time on Christmas to share some of his shell configuration, including different ways he uses prompts and colors.

I'll start with prompts.

I use zsh. My prompt looks like this:

# Plain
snack(~) % 

# Long directory is truncated at the left
snack(...jects/grok/ruby/test/general) % 

# I get exit status only if it is nonzero:
snack(~) % true
snack(~) % false
snack(~) !1! % 

# if I am root, and using zsh, the '%' becomes '#'
snack(~) # 
This is all achieved with the following PS1 in zsh:
PS1='%m(%35<...<%~) %(?..!%?! )%# '
We have configurable prompts to give us all kinds of information, why? It's a place to gather context from. I include host, directory, exit status, and an "am i root" flag.

PS1 isn't the only place you can store useful state. I like to have similar information in my terminal's titlebar, too. I use screen and xterm, and both can be fed some delicious data.

I use this in my .screenrc, which tells screen to have some default status format and tells screen how to change xterm's title. I have it include the screen window number (%n), hostname (%h), and terminal title (%t):

hardstatus string "[%n] %h - %t"
termcapinfo xterm 'hs:ts=\E]2;:fs=\007:ds=\E]2;screen (not title yet)\007'
windowlist title "Num Name%=Location Flags"
windowlist string "%03n %t%=%h %f"
I also use this bit of .vimrc, which tells vim what kind of title I want, and if the $TERM is screen, how to tell screen about it.
" Set title string and push it to xterm/screen window title
set titlestring=vim\ %<%F%(\ %)%m%h%w%=%l/%L-%P
set titlelen=70
if &term == "screen"
  set t_ts=^[k
  set t_fs=^[\
if &term == "screen" || &term == "xterm" 
  set title
And then use this bit of my zshrc.

All of these combined together make for some pretty good terminal and screen titles. The functions preexec, precmd, and title, mentioned below, come from the above zshrc link.

The preexec function in my zshrc runs before each command execution and allows me to change the terminal title to reflect the command I am running. It also supports resumed execution of a process: if you run 'cat', then hit ^Z, then type 'fg', the title will correctly be set to 'cat' again.

The precmd function runs before each prompt. Rather than cluttering up $PS1 with byte strings to set the title, I just make precmd set the title to 'zsh - $PWD'.

The title function takes care of any necessary escaping and also does nice things like string truncation if it is too long (similar to how my $PS1 is configured).

I only use vim's titlestring options because it gives me some better context on what I am doing in vim at the time, mainly because vim allows you to edit multiple files at once.

Here's an example of a few screen windows in a single screen session when viewed in the windowlist:

The first 3 columns are most meaningful: number, name, and location. Note that each location correctly identifies the host that shell is using. My zshrc 'title' function manages setting the name and the location.

The same data listed above is combined into the actual terminal's title. Window 2 above would have this title in xterm:

[2] jls - zsh - /home/jsissel

I mentioned above that I use screen and xterm together. I do this for everything using This script will run screen in an xterm with a randomly chosen, dark color background. I find the dark-random color selection quite a nice deviation from the solid-black my desktop used to bear. Here's what it looks like if I run a 20+ xterms on a blank desktop:

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

Grok speed improvements

The benchmark for testing speed improvements was to process 3000 lines for '%SYSLOGBASE .*? %IPORHOST%'. Profiler was Devel::Profile.


time running program:  11.9738  (85.07%)
number of calls:       409362

%Time    Sec.     #calls   sec/call  F  name
64.52    7.6336     2870   0.002660     main::meta2string
10.00    1.1838     5740   0.000206     main::filter
 8.88    1.0507     3000   0.000350     main::handle
 3.14    0.3710    54828   0.000007     main::handle_capture
 0.77    0.0908     2870   0.000032     main::react
time running program:  2.5216  (82.73%)
number of calls:       105152

%Time    Sec.     #calls   sec/call  F  name
40.56    1.0228     3000   0.000341     main::handle
15.22    0.3838    54828   0.000007     main::handle_capture
 4.47    0.1128     2870   0.000039     main::meta2string
 3.31    0.0834     2870   0.000029     main::react
 2.61    0.0658    14747   0.000004     main::debug
 1.81    0.0456      237   0.000192     main::readlog
Primary changes were mostly to pregenerate a few regular expressions. Previously, I was generating the same regex every time filter() or meta2string() was being called. These small changes gave grok a serious boost in speed: what was taking 12 seconds now takes 2.5 seconds.

One example of a simple optimization is this:

before: my $re = join("|", map { qr/(?:$_)/ } keys(%$filters));
after:  our $re ||= join("|", map { qr/(?:$_)/ } keys(%$filters));
This may decrease readability, but it only sets $re once no matter how many times that line of code is executed. This line of code change was in the filter() function. Just doing this one simple change reduced the runtime of each filter() call by 97%; which reduces its runtime to something trivially small by comparison.

At this point, this is where I was very happy to have written tests for grok. To verify that grok still behaved properly after making these speed improvements, I just ran the tests suite. It passed, giving me some confidence in the stability of the changes.

new grok version available (20070915)

Hop on over to the grok project page and download the new version.

The changelist from the previous announced release is as follows:

Changes for 20070915:

  * Added '' which adds 'grok -F' functionality. Read about it

  * Proper shutdown is called to kill the "hack" subprocess
  * Add 'shell' option to 'type' sections; not currently used.
  * Warn if we're trying to override an existing filter.
  * Added more perlthoughts to t/theory/

  * Numerous pattern regex changes:
    - NUMBER no longer uses Regexp::Common's real number regex since that one
      matches '.' and my thoughts are that '.' is not a number
    - Added POSITIVENUM pattern
    - Fix HOSTNAME to match hostnames starting with a number (Again,
      Regexp::Common fails me)
    - Add path patterns: UNIXPATH, WINPATH, URI
    - MONTH no longer matches 0 through 12
    - DAY no longer matches 0 through 6
    - SYSLOGPROG is more specific now, since valid prog names can have dashes.

Grok and automatic log pattern discovery

My todo list has had "grok - magical 'pattern finder' thing" since May of this year. I added it after I wrote up some thoughts on pattern matching, string sequencing, and automation.

I spent many hours on that problem tonight.

Initially, I wrote a python script which would compare each line in a file against every other line in the file. Using difflib and string split, I could figure out what words were changed between two lines. If the lines were similar, I could compute the difference and replace the differences with a token placeholder, such as "WORD".

Download this python script

Here's a sample output of the tool where it compares one line against another, computes the difference and makes a new string that will match both lines:

0.91: error: PAM: authentication error for illegal user karinebeers
    : error: PAM: authentication error for illegal user leshana
  ==> error: PAM: authentication error for illegal user %WORD%
This script is fairly primitive in execution. It only compares whole tokens, which are delimited by space. This was good, but not enough. It doesn't know about compound patterns such as quoted strings, or complex patterns such as those matching an IP address or a file path.

How do we consider complex patterns like hostnames and quoted strings? Turns out, most of the pattern matching wizardry is in grok. Grok knows about named patterns, and we can abuse this to use grok in a different way - instead of parsing text, we're going to use it to turn input text into a grok pattern.


% echo '/foo/bar/baz "foo bar baz" hello ther 1234' \
| perl grok -F 
%UNIXPATH% %QS% hello ther 1234 %IPORHOST%
What did it do?
input:  /foo/bar/baz "foo bar baz" hello ther 1234
output: %UNIXPATH%   %QS%          hello ther 1234 %IPORHOST%
Using a new hack on top of grok, we can now turn an unknown plaintext input into a pattern that is reusable and human-meaningful. This is totally awesome.

This hack only considers complex tokens for simplicity's sake; that is, tokens only containing letters and numbers are ignored. Why? Is 'foo1234' a word or a hostname? Is 1997 a number or a year? Grok allows you to make these distinctions, but I skip simple tokens so I don't have to programatically weight random patterns. Note the above example, where '1234' was not replaced with '%NUMBER%' or something similar.

So, I run this new grok hack on my /var/log/messages file. Here's a sampling of output from grok:

# grok -F < /var/log/messages
%SYSLOGBASE% logfile turned over due to size>100K
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% error: PAM: authentication error for testuser from %IPORHOST%
%SYSLOGBASE% Limiting closed port RST response from 380 to 200 packets/sec
%SYSLOGBASE% error: PAM: authentication error for admin from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for pants from %IPORHOST%
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% kernel time sync enabled 2001
This is pretty close to a reusable pattern that captures all the changing data. You'll notice there are commonalities among certain lines such as the 'authentication error' lines where only the username is different. Let's run the 'authentication error' lines only through the python script and see what happens:
# messages.patterns contains the output of 'grok -F'
% grep 'authentication error' messages.patterns > testinput
% python  testinput | sort | uniq
%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for illegal user %WORD% from %IPORHOST%
Wow, that's exactly what I wanted out of this process. Now we have some useful patterns, generated with almost zero effort. Let's apply this to our log file and see what we can get. How about counting, by user, failed login attempts?
pattern="%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%"
./grok -m "$pattern" -r "%WORD%" < messages \
| sort | uniq -c | sort -n | tail -5
   5 backup
   9 angel
  10 pgsql
  12 mail
  17 ident

So what have we really accomplished?

You can take any log file and run it through this process. The output of this process is a set of patterns you can use to query data in that log file (and others like it). The real power here is not that patterns are generated, it's that named, queryable patterns are generated. That's awesome.

A letter to Ruby and Python.

Dear Ruby and Python,

Please implement Perl's (?{}) and (??{}) in your regular expression engines so I can do outrageous state machine and pattern matching in your languages. Thank you.


Grok gets more ridiculous

In my last post, I discussed a way to perform additional assertions on a given captured group while still inside of the regex state machine.

I spent some time today and implemented it in grok and it works like a charm! This kind of functionality gives you extreme power in the kind of matches you can specify.

Here's an example: How can we find the first number on a line that's greater than 20?

% jot 50 | xargs -n15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47 48 49 50
Using the above as our sample input:
% jot 50 | xargs -n15 | perl grok -m "%NUMBER>20%" -r "%NUMBER%"
The assertion '>20' is applied during the regular expression match. This is sweet!

Another example would be to 'grep' the output of 'ls -l' for any lines containing a number greater than 5000:

% ls -l | perl grok -m "%INT>5000%"
-rw-r--r--  1 jls  jls   21590 Apr  2 20:17 foo
-rw-r--r--  1 jls  jls   22451 Apr  2 23:52 grok
-rw-r--r--  1 jls  jls  129536 Apr  2 23:47 grok-20070402.tar.gz
-rw-r--r--  1 jls  jls   13539 Mar 22 14:22 grok.1

The equivalent awk statement would look like this:

% ls -l | awk '$5 > 5000'
But if you look at the awk statement, you have to count columns. With grok's solution, all you need to know is you want any line with an integer greater than 5000. You are able to specify the specifics of your match without having to know the precise layout of the data you are matching.

What do you get if you can chain predicates? I haven't added that functionality yet, but it would be trivial to add, so expect it soon.

If you're interested, please try the latest version.

The CSH Bawls Programming Competition

Yesterday, I participated in a 12-hour coding-binge competition. It started at 7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer Science House and Bawls, both sponsors of the event. Needless to say, I haven't gotten much sleep today.

The competition website is here. Go there if you want to view this year's objectives.

The Dream Team consisted of John Resig, Darrin Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at web development, so we decided this year we would represent ourselves as "Team JavaScript" - and do everything possible in javascript. Bruce is not a programmer, but I enlisted his graphical art skills because I figured with our team doing some web-based project, we definitely needed an artist.

After reviewing all the objectives, we came up with a significant modification upon the Sudoku objective. The sudoku objective was a problem that lacked much room for innovation, so we went further and instead of solving Sudoku, wrote a web-based version of an extremely popular game in Second Life. The contest organizer approved of our new objective, so we did just that.

Resig worked on game logic, I worked on chat features, Darrin worked on scoring and game generation, and Bruce worked on the interface graphics. Becuase our tasks were all mostly unrelated, we could develop them independently. Most of the game was completed in about 6 hours, and the remainder of the time was spent fixing bugs, refactoring, and some minor redesign.

The backends were minimal. The chat backend was only 70 lines of perl, and the score backend was 9 lines of /bin/sh. Everything else was handled in the browser. We leveraged Resig's jQuery to make development faster. Development went extremely smooth, a testament to the "Dream Team"-nature of our team, perhaps? ;)

The game worked by presenting everyone with the same game - so you can compete for the highest score. You could also chat during and between games, if you wanted to.

A screenshot can be found here. At the end of the competition, we only had one known bug left. That bug didn't affect gameplay, and we were all tired, so it didn't get fixed. There were a few other issues that remained unresolved that may or may not be related to our code. Firefox was having issues with various things we were doing, and we couldn't tell if it was our fault or not.

Despite the fact that I probably shouldn't have attended the competition due to scholastic time constraints, I was glad I went. We had a blast writing the game.

We may get some time in the near future to improve the codebase and put it up online so anyone can play. There are quite a few important features that need to be added before it'll be useful as a public game.