Search this site


Metadata

Articles

Projects

Presentations

Distributed xargs

I like xargs. However, xargs becomes less useful when you want to run in parallel many cpu-intensive tasks with more parallelism than you have cpus cores local to your machine.

Enter dxargs. For now, dxargs is a simple python script that will distribute tasks in a similar way to xargs but will distribute them to remote hosts over ssh. Basically, it's a threadpool of ssh sessions. An idle worker will ask for something to do, letting you get the maximum throughput possible; meaning your faster servers will be given more tasks to execute than slower ones simply because they complete them sooner.

As an example, let's run 'hostname' in parallel across a few machines for 100 total calls.

% seq 100 | ./dxargs.py -P0 -n1 --hosts "snack scorn" hostname | sort | uniq -c
    14 scorn.csh.rit.edu
    86 snack.home

# Now use per-input-set output collating:
% seq 100 | ./dxargs.py -P0 -n1 --hosts "snack scorn" --output_dir=/tmp/t 'uname -a'
% ls /tmp/t | tail -5
535.95.0.snack.1191918835
535.96.0.snack.1191918835
535.97.0.snack.1191918835
535.98.0.snack.1191918835
535.99.0.snack.1191918835
% cat /tmp/t/535.99.0.snack.1191918835
Linux snack.home 2.6.20-15-generic #2 SMP Sun Apr 15 06:17:24 UTC 2007 x86_64 GNU/Linux
Design requirements:
  • Argument input must work the same way as xargs (-n<num>, etc) and come from stdin
  • Don't violate POLA where unnecessary - same flags as xargs.
Basically, I want dxargs to be a drop in replacement for xargs with respect to compatibility. I may intentionally break compatibility later where it makes sense, however. Also, don't consider this first version POLA-compliant.

Neat features so far:

  • Uses OpenSSH Protocol 2's "Control" sockets (-M and -S flags) to keep the session handshaking down to once per host.
  • Each worker competes for work with the goal of having zero idle workers.
  • Collatable output to a specified directory by input set, pid, number, host, and time
  • '0' (aka -P0) for parallelism means parallelize to the same size as the host list
  • Ability to specify multiplicity by machine with notation like 'snack*4' to indicate snack can run 4 tasks in parallel
  • 'stdout' writing is wrapped with a mutex, so tasks can't interfere with output midline (I see this often with xargs)
Desired features (not yet implemented):
  • Retrying of input sets when workers malfunction
  • Good handling of ssh problems (worker connect timeouts, etc
  • More xargs and xapply behaviors

Download dxargs.py

xargs tip

Under normal circumstances, I use this kind of xargs invocation:
xargs -n1 [email protected] sh -c 'wget http://@/ | sed -e "s/^/@ /"'
The one argument passed to each invocation is replaced by '@'. This sucks if you have awkward characters such as quotes. Instead, use sh's argument processing.

# Failed invocation due to quotes:
easel(~) % echo "one\n'\"two'\nthree" | xargs -n1 [email protected] sh -c 'echo "@"'
one
sh: -c: line 0: unexpected EOF while looking for matching `"'
sh: -c: line 1: syntax error: unexpected end of file
three

# Successful invocation:
% echo "one\n'\"two'\nthree" | xargs -n1 sh -c 'echo "$1"' - 
one
"two
three
The trailing - is required, because sh will set $0, $1, etc, based on those arguments. For example:
% sh -c 'echo "$0, $1"' foo bar
foo, bar
In an effort to use the shell "properly" I use $1 and pass - as the $0 argument. This lets you do neater things that the -I flag doesn't, such as multiple arguments in a given invocation.

% echo "one\ntwo\nthree\nfour" | \
  xargs -n2 sh -c 'echo $1 and $2' -
one and two
three and four
Super useful.

Week of unix tools; day 5: xargs

Day 5 is online. It's about how to rock out with your friend, xargs(1).

day 5; xargs