Search this site


Metadata

Articles

Projects

Presentations

Splitting large logs

I've started using grok more. I don't have any regular log processing to do, so any usage is random and not related to previous runs. As such, I'd would really love it if grok could process faster. Rather than concentrating on a potentially-dead-end of making grok faster (complex regex patterns are cpu intensive), I think it'll be easier to make it parallelizable. The first step towards parallelization of grok is being able to split logs into small pieces, quickly.

Both FreeBSD's and Linux's (GNU, really) split(1) tool is fairly slow when you want to split by line count. Splitting by byte count is (should be) faster, but that will break logs because the byte boundary probably won't ever align on a log boundary..

I'd prefer both: the speed of byte size splitting and the sanity of line based splitting. To that end, I wrote a small tool to do just that. Yes, there's probably a tool already available that does exactly what I want, but this was only 100 lines of C, so it was quick to write. GNU split's --line-bytes option is mostly what I want, but it's still crap slow.

download fastsplit

Here's a comparison between my tool and gnu's split, run on the fastest workstation I have access to. My tool runs 4 times faster than gnu split for this task.

# Source file is 382 megs, which is tiny.
% du -hs access
382M    access

# (Fast) Split into approximately 20meg chunks while preserving lines.
% time ./fastsplit -b 20971520 -p ./split.fast. access 

real    0m1.260s
user    0m0.018s
sys     0m1.242s

# (GNU) Split into 87000-line chunks, no guarantee on size.
% time split -l 87000 access split.normal..

real    0m4.943s
user    0m0.395s
sys     0m2.440s

# (GNU) Split into 20mb (max) chunks, preserve lines.
% time split --line-bytes 20971520 access split.normal_bytes.

real    0m4.391s
user    0m0.001s
sys     0m1.779s
You can see that the actual 'system' time is somewhat close (mine wins by 0.4s), but 'real' time is much longer for Linux's split(1).. My solution is really good if you want quickly split logs for parallel processing and you don't really care how many lines there are so much as you get near N-sized chunks.

What's the output look like?

fast split gnu split -l gnu split --line-bytes
% wc -l split.fast.*
 86140 split.fast.00000
 81143 split.fast.00001
 92725 split.fast.00002
...
 91067 split.fast.00016
 86308 split.fast.00017
 84533 split.fast.00018
  1654604 total
% wc -l split.normal.*
 87000 split.normal.aa
 87000 split.normal.ab
 87000 split.normal.ac
...
 87000 split.normal.ar
 87000 split.normal.as
  1604 split.normal.at
  1654604 total
% wc -l split.normal_bytes.*
 85973 split.normal_bytes.aa
 80791 split.normal_bytes.ab
 92363 split.normal_bytes.ac
...
 86141 split.normal_bytes.ar
 85665 split.normal_bytes.as
  3999 split.normal_bytes.at
  1654604 total
% du -hs split.fast.*
21M     split.fast.00000
21M     split.fast.00001
21M     split.fast.00002
...
21M     split.fast.00016
21M     split.fast.00017
20M     split.fast.00018
% du -hs split.normal.*
21M     split.normal.aa
22M     split.normal.ab
19M     split.normal.ac
...
21M     split.normal.ar
21M     split.normal.as
352K    split.normal.at
% du -hs split.normal_bytes.*
21M     split.normal_bytes.aa
21M     split.normal_bytes.ab
21M     split.normal_bytes.ac
...
21M     split.normal_bytes.ar
21M     split.normal_bytes.as
896K    split.normal_bytes.at

Parallelization with /bin/sh

I have 89 log files. The average file size is 100ish megs. I want to parse all of the logs into something else useful. Processing 9.1 gigs of logs is not my idea of a good time, nor is it a good application for a single CPU to handle. Let's parallelize it.

I abuse /bin/sh's ability to background processes and wait for children to finish. I have a script that can take a pool of available computers and send tasks to it. These tasks are just "process this apache log" - but the speed increase of parallelization over single process is incredible and very simple to do in the shell.

The script to perform this parallization is here: parallelize.sh

I define a list of hosts to use in the script and pass a list of logs to process on the command line. The host list is multiplied until it is longer than the number of logs. I then pick a log and send it off to a server to process using ssh, which calls a script that outputs to stdout. Output is captured to a file delimited by the hostname and the pid.

I didn't run it single-process in full to compare running times, however, parallel execution gets *much* farther in 10 minutes than single proc does. Sweet :)

Some of the log files are *enormous* - taking up 1 gig alone. I'm experimenting with split(1) to split these files into 100,000 lines each. The problem becomes that all of the tasks are done except for the 4 processes handling the 1 gig log files (there are 4 of them). Splitting will make the individual jobs smaller, allowing us to process them faster becuase we have a more even work load across proceses.

So, a simple application benefiting from parallelization is solved by using simple, standard tools. Sexy.