photo
Jordan Sissel
geek

Wed, 25 Jul 2007

Field extraction tool

Tonight was spent implementing and extending one of my favorite features of xapply: its subfield extracting feature, aka this syntax: %[1,2:1]

The gist of this is that you specify a sequence of field number, separator, field number, separator, etc, to get some very quick tokenization to pull the specific data you want. Basically it gives you *extremely* concise syntax for the a subset of the features provided by cut(1).

My tool expands on this a bit further. It's best shown by example:

% ./fex '0:-2/1' < /etc/passwd | sort  | uniq -c
      3 bin 
      1 dev 
      4 home 
      2 nonexistent 
      1 root 
      2 usr 
     14 var 
The string '0:-2/1' means:
  • 0 - the full string (aka "root:x:0:0:root:/root:/bin/bash".
    "0" here uses awk semantics where $0 in awk is the full record and $1 is the first field.
  • : - split by colons
  • -2 - take the 2nd to last token (by colon) (aka "root")
    Negative offsets aren't available in xapply, but are valid here.
  • / - split that by "/"
  • 1 - take the 1st token (aka "root")
The output is essentially the root directory for everyone's home directories. Doing this in awk, cut, perl, or any other tool would be much more typing.

You can also specify multiple field extractions on a single invocation:

# Take the first and 2nd to last token split by colon
% ./fex '0:1' '0:-2' < /etc/passwd  
root /root 
daemon /usr/sbin 
bin /bin 

# Alternatively, {x,y,z,...} syntax selects multiple tokens
# note that the output is joined by colons.
# Again, this is a feature unavailable in xapply's subfield extraction
% ./fex '0:{1,-2}' < /etc/passwd
root:/root
daemon:/usr/sbin
bin:/bin

# Parse urls out of apache logs:
% ./fex '0"2 2' < access | head -4
/
/icons/blank.gif
/icons/folder.gif
/favicon.ico

I still have tests to write and bugs to fix, so you won't find a release yet.

Comments: 1 (view comments)
Tags: , ,
Permalink: /geekery/field-extraction-tool
posted at: 04:04

Search this site

Navigation

Metadata

Home About Resume My Code (SVN)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< July 2007 >
SuMoTuWeThFrSa
1 2 3 4 5 6 7
8 91011121314
15161718192021
22232425262728
293031    

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati