Search this site


Metadata

Articles

Projects

Presentations

Field extraction tool

Tonight was spent implementing and extending one of my favorite features of xapply: its subfield extracting feature, aka this syntax: %[1,2:1]

The gist of this is that you specify a sequence of field number, separator, field number, separator, etc, to get some very quick tokenization to pull the specific data you want. Basically it gives you *extremely* concise syntax for the a subset of the features provided by cut(1).

My tool expands on this a bit further. It's best shown by example:

% ./fex '0:-2/1' < /etc/passwd | sort  | uniq -c
      3 bin 
      1 dev 
      4 home 
      2 nonexistent 
      1 root 
      2 usr 
     14 var 
The string '0:-2/1' means:
  • 0 - the full string (aka "root:x:0:0:root:/root:/bin/bash".
    "0" here uses awk semantics where $0 in awk is the full record and $1 is the first field.
  • : - split by colons
  • -2 - take the 2nd to last token (by colon) (aka "root")
    Negative offsets aren't available in xapply, but are valid here.
  • / - split that by "/"
  • 1 - take the 1st token (aka "root")
The output is essentially the root directory for everyone's home directories. Doing this in awk, cut, perl, or any other tool would be much more typing.

You can also specify multiple field extractions on a single invocation:

# Take the first and 2nd to last token split by colon
% ./fex '0:1' '0:-2' < /etc/passwd  
root /root 
daemon /usr/sbin 
bin /bin 

# Alternatively, {x,y,z,...} syntax selects multiple tokens
# note that the output is joined by colons.
# Again, this is a feature unavailable in xapply's subfield extraction
% ./fex '0:{1,-2}' < /etc/passwd
root:/root
daemon:/usr/sbin
bin:/bin

# Parse urls out of apache logs:
% ./fex '0"2 2' < access | head -4
/
/icons/blank.gif
/icons/folder.gif
/favicon.ico

I still have tests to write and bugs to fix, so you won't find a release yet.