photo
Jordan Sissel
geek

Tue, 27 Nov 2007

fex - flexible token-field extraction

What is fex?

Fex is a powerful field extraction tool. Fex provides a very concise language for tokenizeing strings and extracting fields.

The basic usage model is that you provide a series of delimiter and field selection pairs. Tokens can be any character, while field selections have a specific syntax.

Field selection

There are a few ways to specify field selections.
  • Just a number, picks the Nth field.
  • Comma-separated list inside curly braces: {1,2,3}
  • Colon-delimited range, inside curly braces: {N:M}. Examples: {1:3}, {1:}, or {:3}. If no M is specified, {N:}, then the range is from N to the end. If no N is specified, {:M}, then N is assumed to be 1 (start of the string). If no N or M is specified, {:}, it behaves as selecting the entire string
Notes:
  • Negative numbers treated as a negative offset against the end of the string
  • The number '0' is special and means the entire string, as is {:}

Tokenizing behavior

Empty fields are ignored. In other words, consecutive delimiters are considered as one single delimiter. For example:
% echo "foo.....bar..baz.fizz" | fex .2
bar

Command line arguments

You can specify multiple, independent field selectors on the command line. Each argument is treated as a standalone field selector. Selectors are split by spaces on output (though I am open to changing this).

For example, output the IP and URL from an apache request log:
echo '208.36.144.8 - - [22/Aug/2007:23:39:05 -0400] "GET /svnweb/logwatch/tags/?pathrev=420 HTTP/1.0" 200 3595' \
| fex 1 '"2 2'
208.36.144.8 /svnweb/logwatch/tags/?pathrev=420 

Usage Examples

Simple splitting
Input: "/usr/local/bin/firefox"
fex /1 == "usr"
fex /{2:3} == "local/bin"
fex /{1,-1} == "usr/firefox"
fex /-1 == "firefox"
fex /{:} == "/usr/local/bin/firefox/
fex /0 == "/usr/local/bin/firefox/
Greedy vs nongreedy splitting
Input: "a:b::c:::d"
fex :{1:3} == "a:b:c"
fex :{?1:3} == "a:b:"
fex :{3} = "c"
fex :{?3} == "" (empty result)

Real world uses

Here's a simple example, to find which root directories contain home directories:
% ./fex '0:-2/1' < /etc/passwd | sort | uniq -c
      3 bin
      1 dev
      4 home
      2 nonexistent
      1 root
      2 usr
     14 var
The string '0:-2/1' means:
  • 0 - the full string, "root:x:0:0:root:/root:/bin/bash"
    "0" here uses awk semantics where $0 in awk is the full record and $1 is the first field.
  • : - split by colons
  • -2 - take the 2nd to last token, "/root"
  • / - split that by "/"
  • 1 - take the 1st token, "root"
The output is essentially the root directory for everyone's home directories. Doing this in awk, cut, perl, or any other tool would be much more typing.

You can also specify multiple field extractions on a single invocation:
# Take the first and 2nd to last token split by colon
% ./fex '0:1' '0:-2' < /etc/passwd--
root /root-
daemon /usr/sbin-
bin /bin-

# Alternatively, {x,y,z,...} syntax selects multiple tokens
# note that the output is joined by colons.
# Again, this is a feature unavailable in xapply's subfield extraction
% ./fex '0:{1,-2}' < /etc/passwd
root:/root
daemon:/usr/sbin
bin:/bin

# Parse urls out of apache logs:
% ./fex '0"2 2' < access | head -4
/
/icons/blank.gif
/icons/folder.gif
/favicon.ico

Download

fex-20071119.tar.gz

Looking for an older version? Try the fex release archive

Comments: 4 (view comments)

Permalink: /projects/fex/main
posted at: 15:40


4 responses to 'fex - flexible token-field extraction'

Adam posted at Tue Oct 30 06:01:38 2007...
I'm curious about this:

Usage Examples

Simple splitting
  Input: "/usr/local/bin/firefox"
  fex /1 == "usr"

Why does this not return an empty string? Since 0 is a special character to represent the entire string, 1, then, must be the first element before the delimiter. In this case, the delimiter is '/' and the first string would be empty.

[1]/[2]usr/[3]local/[4]bin/[5]firefox

If I have a CSV where some records do not have the first element defined and some do, is there any way to reliably grab the first element for only the ones where it is defined using fex?

Jordan Sissel posted at Tue Oct 30 12:32:20 2007...
Adam,

I was mostly going for awk semantics here with how awk treats spaces, but fex treats all delimeters the same.

I agree this might be bad for some use cases. I'll add a flag that gives you the semantics you want.

Adam posted at Wed Oct 31 04:50:13 2007...
Jordan, that sounds pretty awesome. I love fex, it's so handy for manipulating data. I was getting tired of using awk and cut and everything else. ALso, are you sure on the awk semantics? I just tried the following:

# cat test.csv
,second,third,forth,,sixth
first,second,third,,,
first,,third,forth,fifth,sixth
,,third,,fifth,
# cat test.csv | awk -F',' '{print $1}'

first
first

#

Jordan Sissel posted at Thu Nov 1 16:39:00 2007...
Yeah, I meant how awk treats spaces. For any other delimiter awk behaves the way you describe.

Both behaviors are great to have, I'll look into adding a modifier/flag that lets you change behaviors.


Leave a reply

You need javascript enabled to use this form. Anti-spam efforts ongoing. Also, if the comment doesn't show up, it's because the form expired. Go back and copy your comment, reload the form, and resubmit. Apologies if this is a hassle, I'm just playing with antispam methods right now. If this insists on not working, please email me about it.

Name (required)
E-mail (optional, if you want me to be able to email you back)
URL (also optional)
Comment:


Search this site

Navigation

Metadata

Home About Resume My Code (SVN)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< November 2007 >
SuMoTuWeThFrSa
     1 2 3
4 5 6 7 8 910
11121314151617
18192021222324
252627282930 

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati