This was all done as a fun experiment to see if automating the efnet captcha
was doable.
A few (all? many?) efnet servers use a figlet captcha on irc clients connecting
from hosts that aren't running identd. While this blends happily with the same
kind of captcha I put into pam_captcha, it's too easy to break.
Specifically, it uses 6 characters, A-Z. Generating a lookup table is as easy
as a few lines of code.
Generating the lookup table for all combinations using the previous script
would be almost 11 gigs. It stores MD5 values of figlet output instead of the
figlet output to save space and make for simpler lookups (40 bytes per entry,
including newline, uncompressed).
However, if you don't respond answer correctly within a short period, you get
disconnected. Timing it shows you have 30ish seconds. It's probably not
feasible to grep through 11 gigs of data in 30 seconds, is it? That's reading
through almost 400 mbytes per second. Then again, that's if you store it as a
flat, unsorted structure.
If you sort the data by MD5, you get the benefits of a binary search, which
finds you a result in 19 iterations. Doing binary search in ruby (like most languages) is very simple. Here's bsearch.rb
The output is 'token md5' and on 11 gigs of data, and GNU sort is smart enough to use disk for merge sorting on large files. However, I did this first, instead, since I assumed sort would be dumb and try to sort all in memory:
choplog -x -p /b/split -b $((50 << 20)) /c/captchas \
| xargs -n1 -tP2 sh -c 'sort -k2 $1 > sort.$(basename $1)' -
sort --merge /b/sort.* > /b/sortedcaptchas
choplog is a
project I
did
last year when I needed a fast way to split large logfiles (GNU split is
slower and less-featured for this task). I split the output into 50 meg chunks,
sort each chunk, then use sort's merge feature to merge all the data back
together quickly.
As it turns out, I don't need to do any of the above splitting and sorting
because gnu sort is smart enough to properly merge sort on-disk for really
large files. You can tweak the memory buffer size with the -S flag and the temp
directory with -T. The manpage says you can specify buffer sizes with unit
notations (M, G, etc) and they go all the way up through E, Z, and Y... just in
case you have a yotabyte of memory? ;)
While I was waiting for the table to generate, I started poking fetching a few
captchas for testing. It seems like the server I'm connecting to is using a
different version of figlet or a different version of the fonts or that figlet
is being invoked differently. The spacing between only some letters is off.
I can reliably get results if I figlet each letter and paste them together like:
# This matches efnet's captcha output
paste -d "" <(figlet -fbig W) <(figlet -fbig S)
instead of
# This doesn't match efnet's captcha output
figlet -fbig WS
Playing with the kerning options (using -k or -W) doesn't produce the right
output either, only pasting together does.
Pretty close to automatically passing the captcha, but I stopped caring about
it. I've run out of energy working on this. I did learn a few edge case bits
though about gnu sort and had a reasonable excuse to dork around with ruby that
didn't involve $work. It also reminded me how much muscle memory I still have for using xargs.