Search this site


Metadata

Articles

Projects

Presentations

sed - Week of Unix Tools; Day 1

Intro

I think it's fair to say that not enough people know sed. Mostly, because it's probably scary. This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

What is sed?

Sed is short for 'stream editor' and basically lets you do lots of things to streams of text.

Basic usage and Invocation

sed [-lrn] [-e 'sedscript'] [file1 file2 ...]
-l means line buffered (ie; flush output every line), -r means use extended regex, -n silences default output, and -e should be self explanatory. There are other flags (such as -f) but I never use them. Seek the man page for more information.

If you've ever seen the perlism s/foo/bar/, that came from sed. Sed is basically a string processing language. The language consists of a very small grammar, but is still very powerful. Here are some examples:

Simple text replacement.
% echo "Hello there foo" |  sed -e 's/foo/bar/'
Hello there bar
Grep-like behavior.
% sed -ne '/FreeBSD/p' /etc/motd
FreeBSD 6.2-PRERELEASE (FOO) #0: Sat Nov 11 00:12:52 EST 2006
Welcome to FreeBSD!
Grep '-v' like behavior
% echo "foo\nbar\nbaz\nfoobar" | sed -ne '/foo/!p'r
bar
baz

Backreferences

Backreferences are using a captured group's matched value later in your pattern. You group regexp patterns with parenthesis, but in non-extended mode (ie; without -r), you must escape your parentheses. Example:
% echo "hello world" | sed -e 's/\([a-z]*\) world/\1 sed/'
hello sed

# Now with -r (or -E on FreeBSD and OS X):
% echo "hello world" | sed -r -e 's/([a-z]*) world/\1 sed/'
hello sed
There is a special "reference" when using substitution (s///). Ampersand (&). This will expand to the entire matched pattern:
% echo "hello world" | sed -e 's/.*/I say, "&"/'
I say, "hello world"

Syntax and Functions

Sed syntax is pretty straight forward. A general expression will look like this:

address[,address]function

That's it. Expressions are separated by newlines or semicolons.

What is a address?

A address is a way to indicate a location in your data stream. An address can be any of:
  1. A line number (eg 1). The first line is '1'
  2. A regexp match expression, such as /foo/.
  3. The literal '$', which means 'last line of file'
  4. Nothing at all, which means "every line in the file"
If you specify two addresses, it means "inclusive" of the first and last address, and includes all lines in between. After the last address is hit, the first address is searched for again further down the file. More on this later.

What are functions?

Functions are always one-letter in sed. The useful ones (to me) are:
  • p (print)
  • s (substitute)
  • d (delete)
  • x (swap pattern and hold buffer)
  • h and H (copy and append to hold buffer)
  • ! (apply the next function against lines not matched)

What can I do with sed?

Print the first line of input (same as head -n 1)
 sed -ne 1p 
Print everything *except* the first line
sed -ne '1!p' # print everything not on the first line
or
sed -e '1d'   # delete the first line
              # default action is to print, so everything else is printed
Print the first non-whitespace, non-comment line in httpd.conf
sed -ne '/^[^# ]/{p;q;}' httpd.conf
or
sed -ne '/^#/! { /^ *$/! { p;q; }; }' httpd.conf
Show only 'Received:' headers in a mail
% cat mymail \
  | sed -ne '/^[A-Za-z0-9]/ { x; /^Received: /{p;}; }; /^[A-Za-z0-9]/!H' 
Received: from localhost (localhost [127.0.0.1])
        by whitefox.csh.rit.edu (Postfix) with ESMTP id 731F81145C
        for <email-snipped>; Sat, 19 May 2007 01:19:30 -0400 (EDT)
Received: from whitefox.csh.rit.edu ([127.0.0.1])
        by localhost (whitefox.csh.rit.edu [127.0.0.1]) (amavisd-new, port 10024)
        with ESMTP id EURHKUeHSrao for <email-snipped>;
        Sat, 19 May 2007 01:19:16 -0400 (EDT)
... etc ...
  
Noisey code, eh? Gets the job done though. There are two checks here. The first pattern checks to see if the line starts with a letter or number, if so, it swaps to the "hold" buffer and checks if it starts with 'Received:' and prints if it does. The side effect is that the current input line is now in the hold buffer and the old header "line" is in the pattern space, which we discard. After that, we check if the line does *not* start with a letter or number, in which case we append the input (aka pattern space) to the hold space.

Basically, we build the current header (which can be multiple lines) in the hold buffer until the next header happens.
Output a file, but color matched patterns.
# The '^[' below are raw escape characters, entered at the shell 
# with CTRL+V and hitting escape.
% dmesg | sed -e 's/ath0/^[[33m&^[[0m/g'

Use sed to make a 'section grep' tool

You can use sed to "grep" paragraphs of data using similar techniques to the above mail header example. This script will let you 'grep' whole paragraphs (empty-line-delimited).
#!/bin/sh

if [ $# -eq 0 -o "${1:-}" = "-h" ] ; then
  echo "usage: $0 [-v] pattern [files]"
  return 1
fi

func='!d'
if [ "$1" = "-v" ]; then
  # support '-v' like 'grep -v' 
  func='d'
  shift
fi

pattern="$1"
shift

sed -ure '/./{H;$!d;}; '"x;/${pattern}/$func;" $1
Call this 'sgrep.sh', put it somewhere, and make it executable. Let's use it to find anything with 'Delete' and 'cycle' in FreeBSD's sed manpage :
% man sed | ./sgrep.sh 'Delete .* cycle' 

     [2addr]d
             Delete the pattern space and start the next cycle.

     [2addr]D
             Delete the initial segment of the pattern space through the first
             newline character and start the next cycle.

Bonus notes

  • The 's' function has a 'p' flag, which prints only if a substitution was made.
    # this:
    sed -ne '/foo/ { s/foo/bar/; p }'
    
    # is the same as
    sed -ne 's/foo/bar/p'
    
  • You can insert data into the hold space (or the pattern space) if you really want:
    # Print 'Hello there' before the second line
    % echo "one\ntwo\nthree" | sed -e '2 { x; s/.*/Hello there/; p; x; }'
    one
    Hello there
    two
    three
    

Ok, now what?

Given your choice of filter tools, sed is an extremely useful one that often allows you to describe what you want to do with your text in a shorter, simpler form than awk or perl can offer you. If you wish to venture down the path of unix ninja, then sed should be on your list of commands to understand.

Want to really make your eyes hurt? Check out this calculator written entirely in sed.

xargs - Week of Unix Tools; Day 5

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

xargs?

xargs is used to take input to generate a list of arguments and run a program with those arguments. GNU xargs is extremely useful. You can:
  • Use xargs as a 'threadpool' for parallel processing of input.
  • Take a list of things on input, and run a command with one thing as the argument, for all things input.
  • Reformat word lists.

And then... ?

No "and then".

Usage

xargs [flags] [command [args]]
'command' is optional. If unspecified, it defaults to '/bin/echo' or something very similar.

Flags that matter

-t
trace. Show the commands being run before they are executed
-P{N}
Specify number (where {N} is) of forked processes for parallelism; ie, -P5 for 5 processes.
-n{N}
Call the command with {N} arguments
-I{REPLSTR}
Specify that '{REPLSTR}' will be replaced with the argument

First example

Find some files and delete them.
% mkdir i; touch i/a.png i/b.png i/c.png i/otherfile
% find ./i/ -name '*.png' | xargs
./i/a.png ./i/b.png ./i/c.png

# Default action above was to echo, so it output as if you had typed:
# echo ./i/a.png ./i/b.png ./i/c.png

# Actually delete the png files:
% find ./i/ -name '*.png' | xargs rm
% ls ./i
otherfile

Advanced xargs

Sometimes the "thing" you want to do with this argument list is not just one command. You know how express what you need in shell, but you don't know how to get xargs to play nice? Not to fear. Simply use "sh" as your command, like this example:

Rename *.foo -> *.bar
% touch a.foo b.foo c.foo
% ls *.foo | xargs -n1 [email protected] sh -c 'x="@"; mv $x ${x%.foo}.bar'
# ${x%.foo} means remove '.foo' from the end of $x. 

% ls *.bar
a.bar   b.bar   c.bar
This example shows using replacement with the character '@'. This means any instance of '@' in the command string will be replaced with the current value (a.foo, for example).

A better example is this, using sh's natural argument handling:
# A quick refresher for you:
% sh -c 'echo $0,$1,$2,$3' hello one two three
hello,one,two,three

# Now let's use it:
% ls *.foo | xargs -n1 sh -c 'mv $1 ${1%.foo}.bar' -
# the '-' is to pass '-' as $0 to sh, so we can use $* and [email protected] like normal if we wanted.
% ls
a.bar   b.bar   c.bar
This lets you use [email protected], $*, and $1, etc, and is more natural to shell scripting.

Parallel ssh'ing

This example expects 'somehosts' to a line-delimited list of hosts to connect to. Please note that I am using ssh keys (with ssh-agent) so ssh will not prompt for a password when logging in.
% echo kenya scorn | tr ' ' '\n' > /tmp/somehosts
% cat /tmp/somehosts \
  | xargs -P10 -I"HOST" -n1 ssh HOST uptime
 1:28AM  up 11 days,  3:44, 4 users, load averages: 0.13, 0.08, 0.02
 4:28AM  up 2 days, 18:54, 10 users, load averages: 0.10, 0.05, 0.05
There's a slight problem with the above invocation. You don't know what host is outputting what data! This becomes much more clear of a problem when you aren't executing 'uptime' but instead are doing something that outputs many lines.

An easy solution is to use sed (or awk, et al) to prefix everything with the hostname.
% cat /tmp/somehosts \
  | xargs -P10 -I"HOST" -n1 sh -c 'ssh HOST uptime | sed -e "s/^/HOST: /"'
kenya:  1:36AM  up 11 days,  3:52, 4 users, load averages: 0.07, 0.07, 0.03
scorn:  4:36AM  up 2 days, 19:03, 10 users, load averages: 0.01, 0.02, 0.02

awk - Week of Unix Tools; Day 3

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

What is awk?

Hands-down, one of *the* most useful filter tools you'll find. Awk is a scripting language, but I find it is best used from the shell in oneliners.

Basic awk(1) usage

awk [-F<field_sep>] [awk_script]

Records and Fields

Awk has two data concepts that come from file input: Records and Fields.

A record is generally a whole line. The default input record separator (RS) is a newline. You can change this at any time.

A field is generally a word split by any number of whitespace (tab or space). The default input field separator (FS) is a single space. FS can be a single character or a regular expression. If FS is a single space, it is treated magically as if you had specified [ \t]+.

Field selection

Fields are accessed using the $ "operator". The following are valid:
$1, $2, $3 ...
(first, second and third fields)
$NF
The last field. Nothing special. NF is a variable holding the total number of fields in the current record, therefore $NF would be the last field
x=1; $(x + 3)
The 4th field. $(x + 3) == $(1 + 3) == $4

Patterns and functions

Awk expressions come in two forms, a function or a pattern. I've never bothered writing functions.

Here's what a pattern looks like: [condition_expressions] { [action_expressions] }

Basically this equates to the folloing psuedocode: if (condition_expressions) { action_expressions }

If no action_expression is defined, the default is 'print' which means 'print $0' which means printthe current record. If no condition is given, the default is to execute the action for all records.

Magic patterns: BEGIN and END

BEGIN and END are magic "conditions". BEGIN is used to execute things before the first record has been parsed, and END is obviously to do things after the last record. These patterns cannot be combined with others.

Sample pattern expressions

length($0) > 72 (From FreeBSD's awk manpage)
Print lines longer than 72 characters
$1 ~ /foo/ { print $2 }
Print the 2nd field of all records where the first field matches /foo/
$5 > 0
Print all records where the 5th field is greater than 0. (Complete with magical number conversion, when possible.
int($5) > 0
Same as above, but force $5 to int before comparing

Variables

Variables are the same syntax as in C. You do not declare variables.

Examples:
$2 == "test" { x++ }; END { print x }
Total records where $2 == "test"
{ $1 = ""; print }
Delete the first field of every record, print the new record
{ $3 = "Hello"; print }
Should be obvious. This one is *super* useful; modifying fields inline is awesome

Arrays

Arrays are magical. You simply start using a variable as an array, and it becomes an array. Arrays are more like dictionaries/hash tables/associative arrays than "real" arrays. Quite useful.

Example: awk '{ a[$1]++ } END { for (i in a) { print i, a[i] } }'

String concatonation

String appending is simple.
x = "foo"; x = x"test";    # x == "footest"

print $1","$2" = "$3;      # if input was "hello there world"
                           # output will be: "hello,there = world"

Example: Open files by user

This example is basically "add things up by a given key, then print them at the end". I use it so often I'm probably just going to write an alias for it in my shell.
% fstat | sed -e 1d \
  | awk '{a[$1]++} END { for (i in a) { print i, a[i] } }' \
  | sort -nk2
smmsp 8
_dhcp 11
www 45
root 328
jls 482

Example: Datestamp input

This particular example is *extremely* useful for long-running programs that output logs or other data without any kind of timestamp. This requires GNU awk.
% (echo hello; sleep 5; echo world) \
  | awk '{ print strftime("%Y/%m/%d %H:%M:%S", systime()), $0 }'
2007/05/22 01:09:47 hello
2007/05/22 01:09:52 world

Example: show non-empty files

% ls -l | awk '$5 > 0'

Example: Date-scan your logs

Let's assume all log entries are syslog format:
May 22 01:12:02 nightfall pptp[860]: anon log ...
Show only log entries between May 10th and May 20th (inclusive)
% cat *.log | awk '$1 == "May" && ($2 >= 10 && $2 <= 20)'

Example: Scrape host(1) output

% host www.google.com | awk '/has address/ { print $4 }'

Example: Find an environment variable

I often login to my workstation remotely and want to use its ssh-agent. So, I need to find the most common value for SSH_AUTH_SOCK on all processes.
% ps aexww \
  | awk '{ for (i = 0; i < NF; i++) { if ($i ~ /^SSH_AUTH_SOCK=/) { print $i } } }' \
  | sort | uniq -c
  24 SSH_AUTH_SOCK=/tmp/ssh-sc4iKR7ZIf/agent.721

Teeth that will bite you

Awk falls to the same problem C does. You can assign in conditions. Here's how you screw up:
% cat *.log | awk '$1 = "May"'
This will replace the first field with "May" for every record, and since "May" is a positive value, it will print your modified $0 with $1 set to "May" now. Ouch.

Week of Unix Tools -

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

The goal of these articles is to give you many real examples of each tool being used.

Intended Audience

These are not "newbie guides" by any means. They're really geared towards people who have probably seen these tools before, can pick up new tools quickly by learning-through-example, and probably won't be reading the manpage any time soon.

Days of Unix Tools

data sources - Week of Unix Tools; Day 4

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

Data, where are you?

Data comes from lots of places. Loosely categorizing, they come from 3 places:
  1. Files and devices
  2. Output of other tools
  3. The network (via other tools)

cat

Cat means 'concatonate'. It is mostly useful for doing a few things:
  • Cat lots of files together; eg 'cat *.c' for processing by another tool, or generally glueing data sets (from files) together.
  • Make a shell script more readable by making the input more obvious

nc

Netcat. Basically gives you the ability to talk tcp and udp from the shell. You can send data using standard input, and receive data from standard output. Simple.
tcp client (connect to google.com port 80)
nc google.com 80
tcp server (listen on port 8080)
nc -l 8080
udp client (connect to ns1.slashdot.org port 53)
nc -u ns1.slashdot.org 53
udp server (listen on port 5353)
nc -l -u 5353
Examples:
Basic HTTP request
% echo "GET / HTTP/1.0\n" | nc google.com 80 | head -1
HTTP/1.0 200 OK

openssl

openssl is a command that any unix-like system will probably have installed. The command itself can do many many things, but for this article I'll only cover the s_client command.

'openssl s_client' is essentially 'netcat + ssl'. This tool is extremely useful for debugging text-based protocols behind SSL such as ssl'd nntp, imaps, and https.

Example:
Open an https connection to addons.mozilla.org
% echo "GET / HTTP/1.0\r\n\r\n" \
| openssl s_client -quiet -connect addons.mozilla.org:443 \
| col \
| sed -e '/^$/q'
depth=3 /C=BE/O=GlobalSign nv-sa/OU=Root CA/CN=GlobalSign Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
HTTP/1.1 302 Found
Date: Fri, 25 May 2007 10:07:25 GMT
Server: Apache/2.0.52 (Red Hat)
Location: http://www.mozilla.com/
Content-Length: 293
Keep-Alive: timeout=300, max=1000
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
* The 'col' command will strip the \r (carriage return) characters from the http response, allowing sed's /^$/ to match an empty line (end of headers).

GET/curl/wget/fetch

You can query webservers (http) with any number of tools and you'll get the raw source or data for any page you query. This is really useful.
  • GET, POST, lwp-request, et al. Comes with libwww-perl
  • curl
  • wget
  • fetch (FreeBSD)
Most of the time I need to fetch pages to stdout, I use GET, becuase it's less typing. Here's some examples of the above commands:
Fetch / from www.w3schools.com and output page to stdout
  • GET http://www.w3schools.com/
  • wget -O - -q http://www.w3schools.com/
  • fetch -o -q http://www.w3schools.com/
  • curl http://www.w3schools.com/

w3m/lynx

But what if you don't want the raw html from a webpage? You can have w3m and lynx do some basic rendering for you, also to stdout. I recommend w3m instead of lynx, but use whatever.
  • w3m -dump http://www.google.com/
  • lynx -dump http://www.google.com/
w3m's output looks like this.

ssh

ssh can be a data source too. Run a command on 1000 machines and process the output locally, for fun and profit.

Login to N systems and get uptime. Prefix output with the hostname
% echo "fury\ntempest" \
| xargs -n1 [email protected] sh -c 'ssh @ "uptime" | sed -e "s/^/@/"'
fury  6:18am  up  2:25,  1 user,  load average: 0.06, 0.04, 0.04
tempest 06:18:00 up  9:01,  2 users,  load average: 0.12, 0.09, 0.09
 
Combining xargs and ssh gives you a powerful ability to execute commands on multiple machines easily, even in parallel.

cut and paste - Week of Unix Tools; Day 2

Intro

This week-of-unix-tools is intended to be a high concentration of information with little fluff. I'll be covering only GNU versions of the tools, for the sake of choosing only one version for sanity sake.

What is cut?

cut is a tool that lets you 'cut' out pieces of data. You can cut by field, column (character), and byte number.

Basic cut(1) usage

cut [-d delim -f range] [-c range] [-b range]
There are 3 different ways to cut. One by bytes, one by characters, and one by fields.

A range is one or more sequences, separated by commas. A sequence can be any of the following:
  • N = Select only the Nth piece (on each line)
  • N- = Select the Nth piece through end-of-line
  • N-M = Select the Nth through Mth pieces
  • -M = Select the first through Mth pieces
field cutting - get me the 1st and 3rd fields delimited by comma
% echo "one,two,three,four" | cut -d"," -f 1,3
one,three

# output space-delimited from comma-separated input (GNU only)
% echo "one,two,three,four" | cut -d"," -f 1,3 --output-delimiter=" "
one three
 
character cutting - output everything except the first character of every line
% seq 15 19 | cut -c 2-
5
6
7
8
9
 

When to use cut

Cut provides features easily done in sed and awk. Why would you use it instead of sed or awk?

Simplicity of statement.

Example: Let's print the 1st and 7th fields from /etc/passwd:
% grep '^root' /etc/passwd | cut -d: -f1,7
root:/bin/sh
In the above invocation, it is very clear to the reader that you want the 1st and 7th field. Yes, this in awk would also be simple, but if you don't know awk syntax, then awk might be harder for you to write and/or read.

Use the tool that best allows you concisely and unobfuscatingly describe what you want to do. Cut often lets you do that. However, there are a few cut-like things that you can't do in cut because of the way it determines fields.

When not to use cut

If your input will have multiple instances of the delimiter in a row, cut won't do it the way you might think: delimiters are single characters in cut. Multiple delimiters in a row are not handled in any special way as awk handles whitespace. Consider this simple example:
% echo "one    two     three" | cut -d' ' -f 2

% echo "one    two     three" | awk '{print $2}'
two
According to cut, field 2 in the above example is an empty string because it occurs between the first and second space (delimiter). Not what we wanted. Keep this behavior in mind.

What is paste?

Think of it as a horizontal version of cat(1). It will join lines by delimiters from different file inputs.

Basic paste(1) usage

 paste [-s] [-d delimiter_list] [input1 input2 input3 ...] 
Paste reads a line from each input, in order, and prints them without newlines. After one line has been read from each input, a newline is printed. Optional '-d' will let you specify a list of delimiters that are to be used to separate each input file. The default is separation with tab characters.

'-s' is a neat little flag, that supresses newline output entirely. The effect is that all lines of all inputs are concatonated on one single line of output. Very similar to "tr '\n' '*delimiter*'" except there's no trailing delimiter. Useful!

The same input file can be specified multiple times, which gives you some neat effects.

Why is it useful?

I rarely use paste, but what it does is quite useful.
Join input lines in triplets
% seq 9 | paste - - -
1       2       3
4       5       6
7       8       9
Print line numbers
% FILE="/etc/hosts"
% seq `wc -l < $FILE` | paste - $FILE | head -3
1       # /etc/hosts
2       #
3       # This file describes a number of hostname-to-address
List of users on a system
% cut -d: -f1 /etc/passwd | paste -d, -s -
root,bin,daemon,adm,lp,sync,shutdown,halt,mail,news,uucp,operator

Conclusion

Cut and paste are somewhat niche tools, but keep them in your toolbelt because of the functionality they provide. Sometimes it's much simpler to use cut or paste instead of another tool.