Search this site


Metadata

Articles

Projects

Presentations

awk - Week of Unix Tools; Day 3

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

What is awk?

Hands-down, one of *the* most useful filter tools you'll find. Awk is a scripting language, but I find it is best used from the shell in oneliners.

Basic awk(1) usage

awk [-F<field_sep>] [awk_script]

Records and Fields

Awk has two data concepts that come from file input: Records and Fields.

A record is generally a whole line. The default input record separator (RS) is a newline. You can change this at any time.

A field is generally a word split by any number of whitespace (tab or space). The default input field separator (FS) is a single space. FS can be a single character or a regular expression. If FS is a single space, it is treated magically as if you had specified [ \t]+.

Field selection

Fields are accessed using the $ "operator". The following are valid:
$1, $2, $3 ...
(first, second and third fields)
$NF
The last field. Nothing special. NF is a variable holding the total number of fields in the current record, therefore $NF would be the last field
x=1; $(x + 3)
The 4th field. $(x + 3) == $(1 + 3) == $4

Patterns and functions

Awk expressions come in two forms, a function or a pattern. I've never bothered writing functions.

Here's what a pattern looks like: [condition_expressions] { [action_expressions] }

Basically this equates to the folloing psuedocode: if (condition_expressions) { action_expressions }

If no action_expression is defined, the default is 'print' which means 'print $0' which means printthe current record. If no condition is given, the default is to execute the action for all records.

Magic patterns: BEGIN and END

BEGIN and END are magic "conditions". BEGIN is used to execute things before the first record has been parsed, and END is obviously to do things after the last record. These patterns cannot be combined with others.

Sample pattern expressions

length($0) > 72 (From FreeBSD's awk manpage)
Print lines longer than 72 characters
$1 ~ /foo/ { print $2 }
Print the 2nd field of all records where the first field matches /foo/
$5 > 0
Print all records where the 5th field is greater than 0. (Complete with magical number conversion, when possible.
int($5) > 0
Same as above, but force $5 to int before comparing

Variables

Variables are the same syntax as in C. You do not declare variables.

Examples:
$2 == "test" { x++ }; END { print x }
Total records where $2 == "test"
{ $1 = ""; print }
Delete the first field of every record, print the new record
{ $3 = "Hello"; print }
Should be obvious. This one is *super* useful; modifying fields inline is awesome

Arrays

Arrays are magical. You simply start using a variable as an array, and it becomes an array. Arrays are more like dictionaries/hash tables/associative arrays than "real" arrays. Quite useful.

Example: awk '{ a[$1]++ } END { for (i in a) { print i, a[i] } }'

String concatonation

String appending is simple.
x = "foo"; x = x"test";    # x == "footest"

print $1","$2" = "$3;      # if input was "hello there world"
                           # output will be: "hello,there = world"

Example: Open files by user

This example is basically "add things up by a given key, then print them at the end". I use it so often I'm probably just going to write an alias for it in my shell.
% fstat | sed -e 1d \
  | awk '{a[$1]++} END { for (i in a) { print i, a[i] } }' \
  | sort -nk2
smmsp 8
_dhcp 11
www 45
root 328
jls 482

Example: Datestamp input

This particular example is *extremely* useful for long-running programs that output logs or other data without any kind of timestamp. This requires GNU awk.
% (echo hello; sleep 5; echo world) \
  | awk '{ print strftime("%Y/%m/%d %H:%M:%S", systime()), $0 }'
2007/05/22 01:09:47 hello
2007/05/22 01:09:52 world

Example: show non-empty files

% ls -l | awk '$5 > 0'

Example: Date-scan your logs

Let's assume all log entries are syslog format:
May 22 01:12:02 nightfall pptp[860]: anon log ...
Show only log entries between May 10th and May 20th (inclusive)
% cat *.log | awk '$1 == "May" && ($2 >= 10 && $2 <= 20)'

Example: Scrape host(1) output

% host www.google.com | awk '/has address/ { print $4 }'

Example: Find an environment variable

I often login to my workstation remotely and want to use its ssh-agent. So, I need to find the most common value for SSH_AUTH_SOCK on all processes.
% ps aexww \
  | awk '{ for (i = 0; i < NF; i++) { if ($i ~ /^SSH_AUTH_SOCK=/) { print $i } } }' \
  | sort | uniq -c
  24 SSH_AUTH_SOCK=/tmp/ssh-sc4iKR7ZIf/agent.721

Teeth that will bite you

Awk falls to the same problem C does. You can assign in conditions. Here's how you screw up:
% cat *.log | awk '$1 = "May"'
This will replace the first field with "May" for every record, and since "May" is a positive value, it will print your modified $0 with $1 set to "May" now. Ouch.

7 responses to 'awk - Week of Unix Tools; Day 3'

Showing last 7 comments... (Click here to view all comments)

Eduardo I. wrote at Wed Aug 15 08:26:43 2007...
Awesome series of articles, nice examples. Thanks for producing this.

Eduardo I. wrote at Wed Aug 15 08:27:29 2007...
Awesome series of articles, nice examples. Thanks for writing this.

Eduardo I. wrote at Wed Aug 15 08:29:54 2007...
Your blog engine doesn't quite cut it though : )

jon leslie wrote at Wed Apr 9 09:35:57 2008...
great tutorial, however:

{ $1 = ""; print }
  Delete the first record of every record, print the new record

don't you mean

  Delete the first FIELD of every record, print
the new record

Jordan Sissel wrote at Wed Apr 9 09:57:57 2008...
Yep, that's what I meant. I'll fix it, thanks! :)

Xavier wrote at Wed May 14 06:44:03 2008...
{ $1=""; print } does not delete the first field. It just sets it to the empty string.
echo "a,b,c" | awk 'BEGIN {FS=","; OFS=","} {$1=""; print;}' returns ",b,c".
If there were only two fields in $0, we should see "b,c" instead

mb wrote at Thu Jul 17 20:27:04 2008...
To make it look neater you need to get rid of the extra spaces created, using gsub function, or pipe it to awk '{$1=$1}1'

echo "a b c d e f 1 2 3 "|awk '{$1=$4=$5=$6="";gsub(/[ ]+/," ")}1'

b c 1 2 3

With no gsub you'll see extra gaps in the output:

echo "a b c d e f 1 2 3 "|awk '{$1=$4=$5=$6=""}1'

b c  1 2 3


Leave a reply

You need javascript enabled to use this form. Anti-spam efforts ongoing. Also, if the comment doesn't show up, it's because the form expired. Go back and copy your comment, reload the form, and resubmit. Apologies if this is a hassle, I'm just playing with antispam methods right now. If this insists on not working, please email me about it.

Name (required)
E-mail (optional, if you want me to be able to email you back)
URL (also optional)
Comment: