photo
Jordan Sissel
geek

Wed, 09 Apr 2008

awk - Week of Unix Tools; Day 3

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

What is awk?

Hands-down, one of *the* most useful filter tools you'll find. Awk is a scripting language, but I find it is best used from the shell in oneliners.

Basic awk(1) usage

awk [-F<field_sep>] [awk_script]

Records and Fields

Awk has two data concepts that come from file input: Records and Fields.

A record is generally a whole line. The default input record separator (RS) is a newline. You can change this at any time.

A field is generally a word split by any number of whitespace (tab or space). The default input field separator (FS) is a single space. FS can be a single character or a regular expression. If FS is a single space, it is treated magically as if you had specified [ \t]+.

Field selection

Fields are accessed using the $ "operator". The following are valid:
$1, $2, $3 ...
(first, second and third fields)
$NF
The last field. Nothing special. NF is a variable holding the total number of fields in the current record, therefore $NF would be the last field
x=1; $(x + 3)
The 4th field. $(x + 3) == $(1 + 3) == $4

Patterns and functions

Awk expressions come in two forms, a function or a pattern. I've never bothered writing functions.

Here's what a pattern looks like: [condition_expressions] { [action_expressions] }

Basically this equates to the folloing psuedocode: if (condition_expressions) { action_expressions }

If no action_expression is defined, the default is 'print' which means 'print $0' which means printthe current record. If no condition is given, the default is to execute the action for all records.

Magic patterns: BEGIN and END

BEGIN and END are magic "conditions". BEGIN is used to execute things before the first record has been parsed, and END is obviously to do things after the last record. These patterns cannot be combined with others.

Sample pattern expressions

length($0) > 72 (From FreeBSD's awk manpage)
Print lines longer than 72 characters
$1 ~ /foo/ { print $2 }
Print the 2nd field of all records where the first field matches /foo/
$5 > 0
Print all records where the 5th field is greater than 0. (Complete with magical number conversion, when possible.
int($5) > 0
Same as above, but force $5 to int before comparing

Variables

Variables are the same syntax as in C. You do not declare variables.

Examples:
$2 == "test" { x++ }; END { print x }
Total records where $2 == "test"
{ $1 = ""; print }
Delete the first field of every record, print the new record
{ $3 = "Hello"; print }
Should be obvious. This one is *super* useful; modifying fields inline is awesome

Arrays

Arrays are magical. You simply start using a variable as an array, and it becomes an array. Arrays are more like dictionaries/hash tables/associative arrays than "real" arrays. Quite useful.

Example: awk '{ a[$1]++ } END { for (i in a) { print i, a[i] } }'

String concatonation

String appending is simple.
x = "foo"; x = x"test";    # x == "footest"

print $1","$2" = "$3;      # if input was "hello there world"
                           # output will be: "hello,there = world"

Example: Open files by user

This example is basically "add things up by a given key, then print them at the end". I use it so often I'm probably just going to write an alias for it in my shell.
% fstat | sed -e 1d \
  | awk '{a[$1]++} END { for (i in a) { print i, a[i] } }' \
  | sort -nk2
smmsp 8
_dhcp 11
www 45
root 328
jls 482

Example: Datestamp input

This particular example is *extremely* useful for long-running programs that output logs or other data without any kind of timestamp. This requires GNU awk.
% (echo hello; sleep 5; echo world) \
  | awk '{ print strftime("%Y/%m/%d %H:%M:%S", systime()), $0 }'
2007/05/22 01:09:47 hello
2007/05/22 01:09:52 world

Example: show non-empty files

% ls -l | awk '$5 > 0'

Example: Date-scan your logs

Let's assume all log entries are syslog format:
May 22 01:12:02 nightfall pptp[860]: anon log ...
Show only log entries between May 10th and May 20th (inclusive)
% cat *.log | awk '$1 == "May" && ($2 >= 10 && $2 <= 20)'

Example: Scrape host(1) output

% host www.google.com | awk '/has address/ { print $4 }'

Example: Find an environment variable

I often login to my workstation remotely and want to use its ssh-agent. So, I need to find the most common value for SSH_AUTH_SOCK on all processes.
% ps aexww \
  | awk '{ for (i = 0; i < NF; i++) { if ($i ~ /^SSH_AUTH_SOCK=/) { print $i } } }' \
  | sort | uniq -c
  24 SSH_AUTH_SOCK=/tmp/ssh-sc4iKR7ZIf/agent.721

Teeth that will bite you

Awk falls to the same problem C does. You can assign in conditions. Here's how you screw up:
% cat *.log | awk '$1 = "May"'
This will replace the first field with "May" for every record, and since "May" is a positive value, it will print your modified $0 with $1 set to "May" now. Ouch.

Comments: 6 (view comments)

Permalink: /articles/week-of-unix-tools/day-3-awk
posted at: 13:23


6 responses to 'awk - Week of Unix Tools; Day 3'

Eduardo I. posted at Wed Aug 15 11:26:43 2007...
Awesome series of articles, nice examples. Thanks for producing this.

Eduardo I. posted at Wed Aug 15 11:27:29 2007...
Awesome series of articles, nice examples. Thanks for writing this.

Eduardo I. posted at Wed Aug 15 11:29:54 2007...
Your blog engine doesn't quite cut it though : )

jon leslie posted at Wed Apr 9 12:35:57 2008...
great tutorial, however:

{ $1 = ""; print }
  Delete the first record of every record, print the new record

don't you mean

  Delete the first FIELD of every record, print
the new record

Jordan Sissel posted at Wed Apr 9 12:57:57 2008...
Yep, that's what I meant. I'll fix it, thanks! :)

Xavier posted at Wed May 14 09:44:03 2008...
{ $1=""; print } does not delete the first field. It just sets it to the empty string.
echo "a,b,c" | awk 'BEGIN {FS=","; OFS=","} {$1=""; print;}' returns ",b,c".
If there were only two fields in $0, we should see "b,c" instead


Leave a reply

You need javascript enabled to use this form. Anti-spam efforts ongoing. Also, if the comment doesn't show up, it's because the form expired. Go back and copy your comment, reload the form, and resubmit. Apologies if this is a hassle, I'm just playing with antispam methods right now. If this insists on not working, please email me about it.

Name (required)
E-mail (optional. Not posted or recorded anywhere, ever)
URL (also optional)
Comment:


Search this site

Navigation

Metadata

Home About Resume My Code

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< April 2008 >
SuMoTuWeThFrSa
   1 2 3 4 5
6 7 8 9101112
13141516171819
20212223242526
27282930   

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati