Search this site

Metadata

Articles

Projects

Presentations

fex - flexible token-field extraction

What is fex?

Fex is a powerful field extraction tool. Fex provides a very concise language for tokenizeing strings and extracting fields.

The basic usage model is that you provide a series of delimiter and field selection pairs. Tokens can be any character, while field selections have a specific syntax.

Field selection

There are a few ways to specify field selections.
  • Just a number, picks the Nth field.
  • Comma-separated list inside curly braces: {1,2,3}
  • Colon-delimited range, inside curly braces: {N:M}. Examples: {1:3}, {1:}, or {:3}. If no M is specified, {N:}, then the range is from N to the end. If no N is specified, {:M}, then N is assumed to be 1 (start of the string). If no N or M is specified, {:}, it behaves as selecting the entire string
Notes:
  • Negative numbers treated as a negative offset against the end of the string
  • The number '0' is special and means the entire string, as is {:}

Tokenizing behavior

The default behavior is to ignore empty fields. That is, a string "foo...bar" would only have two fields when split by "." rather than four. If you want fex to not ignore empty fields, you should prefix your field selection with "?"
# Greedy (default)
% echo "foo.....bar..baz.fizz" | fex .2
bar

# Nongreedy
% echo "foo.....bar..baz.fizz" | fex '.{?6}'
bar

Command line arguments

You can specify multiple, independent field selectors on the command line. Each argument is treated as a standalone field selector. Selectors are split by spaces on output (though I am open to changing this).

For example, output the IP and URL from an apache request log:
echo '208.36.144.8 - - [22/Aug/2007:23:39:05 -0400] "GET /svnweb/logwatch/tags/?pathrev=420 HTTP/1.0" 200 3595' \
| fex 1 '"2 2'
208.36.144.8 /svnweb/logwatch/tags/?pathrev=420 

Usage Examples

Simple splitting
Input: "/usr/local/bin/firefox"
fex /1 == "usr"
fex /{2:3} == "local/bin"
fex /{1,-1} == "usr/firefox"
fex /-1 == "firefox"
fex /{:} == "/usr/local/bin/firefox/
fex /0 == "/usr/local/bin/firefox/
Greedy vs nongreedy splitting
Input: "a:b::c:::d"
fex :{1:3} == "a:b:c"
fex :{?1:3} == "a:b:"
fex :{3} = "c"
fex :{?3} == "" (empty result)

Real world uses

Here's a simple example, to find which root directories contain home directories:
% ./fex '0:-2/1' < /etc/passwd | sort | uniq -c
      3 bin
      1 dev
      4 home
      2 nonexistent
      1 root
      2 usr
     14 var
The string '0:-2/1' means:
  • 0 - the full string, "root:x:0:0:root:/root:/bin/bash"
    "0" here uses awk semantics where $0 in awk is the full record and $1 is the first field.
  • : - split by colons
  • -2 - take the 2nd to last token, "/root"
  • / - split that by "/"
  • 1 - take the 1st token, "root"
The output is essentially the root directory for everyone's home directories. Doing this in awk, cut, perl, or any other tool would be much more typing.

You can also specify multiple field extractions on a single invocation:
# Take the first and 2nd to last token split by colon
% ./fex '0:1' '0:-2' < /etc/passwd--
root /root-
daemon /usr/sbin-
bin /bin-

# Alternatively, {x,y,z,...} syntax selects multiple tokens
# note that the output is joined by colons.
# Again, this is a feature unavailable in xapply's subfield extraction
% ./fex '0:{1,-2}' < /etc/passwd
root:/root
daemon:/usr/sbin
bin:/bin

# Parse urls out of apache logs:
% ./fex '0"2 2' < access | head -4
/
/icons/blank.gif
/icons/folder.gif
/favicon.ico

Download

fex-20071119.tar.gz

Looking for an older version? Try the fex release archive

4 responses to 'fex - flexible token-field extraction'

Showing last 4 comments... (Click here to view all comments)

Adam wrote at Tue Oct 30 06:01:38 2007...
I'm curious about this:

Usage Examples

Simple splitting
  Input: "/usr/local/bin/firefox"
  fex /1 == "usr"

Why does this not return an empty string? Since 0 is a special character to represent the entire string, 1, then, must be the first element before the delimiter. In this case, the delimiter is '/' and the first string would be empty.

[1]/[2]usr/[3]local/[4]bin/[5]firefox

If I have a CSV where some records do not have the first element defined and some do, is there any way to reliably grab the first element for only the ones where it is defined using fex?

Jordan Sissel wrote at Tue Oct 30 12:32:20 2007...
Adam,

I was mostly going for awk semantics here with how awk treats spaces, but fex treats all delimeters the same.

I agree this might be bad for some use cases. I'll add a flag that gives you the semantics you want.

Adam wrote at Wed Oct 31 04:50:13 2007...
Jordan, that sounds pretty awesome. I love fex, it's so handy for manipulating data. I was getting tired of using awk and cut and everything else. ALso, are you sure on the awk semantics? I just tried the following:

# cat test.csv
,second,third,forth,,sixth
first,second,third,,,
first,,third,forth,fifth,sixth
,,third,,fifth,
# cat test.csv | awk -F',' '{print $1}'

first
first

#

Jordan Sissel wrote at Thu Nov 1 16:39:00 2007...
Yeah, I meant how awk treats spaces. For any other delimiter awk behaves the way you describe.

Both behaviors are great to have, I'll look into adding a modifier/flag that lets you change behaviors.


Leave a reply

You need javascript enabled to use this form. Anti-spam efforts ongoing. Also, if the comment doesn't show up, it's because the form expired. Go back and copy your comment, reload the form, and resubmit. Apologies if this is a hassle, I'm just playing with antispam methods right now. If this insists on not working, please email me about it.

Name (required)
E-mail (optional, if you want me to be able to email you back)
URL (also optional)
Comment: