Search this site





Non-compiler caching

I setup ccache again (trivial) to help me with building freebsd repeatedly. I noticed that much of the time spent in the kernel build process was in building dependency lists using awk.

Why couldn't we apply the ccache idea to everything else? If the same input always means the same output, then we could cache it if it is computationally expensive to compute that output.


Above is a hack that runs like ccache, but tracks all files created by the process (and its subprocesses). Here's a sample run, of counting the number of lines in a file with awk and outputing the result (within awk) to another file.

% /usr/bin/time ./ awk '{x++} END { print "Total records: " x > "/tmp/hello"}' bigdata
Running original...
        1.60 real         0.05 user         0.74 sys
% cat /tmp/hello
Total records: 1000000

# Remove the old output file
% rm /tmp/hello

# Rerun it again, unmodified, and it will use the cached output.
% /usr/bin/time ./ awk '{x++} END { print "Total records: " x > "/tmp/hello"}' bigdata
Using cache...
        0.06 real         0.00 user         0.06 sys
% cat /tmp/hello
Total records: 1000000
It doesn't work with everything just yet, but the problems seem to be with truss's behavior and not the script's fault, like sometimes truss hangs, or it doesn't follow a fork like it should. Beyond truss problems, the scripts doesn't track file renames. It also doesn't understand how to figure out what the input files for each command is. Ideally it would checksum any inputs and use that as the cache key; currently it only checksums the commandline arguments and not the external files being used (such as 'bigfile').

I started initially without using truss, but awk doesn't call open(2) via libc when it opens files, for some reason, and I can't figure out a clean way to capture specific function calls from a process (even a child process).

Dtrace would be sexy here, but it is unavailable in the main freebsd trunk.

The speedup is pretty obvious for cpu-intensive things, but the real test is to see how it performs when working properly and hooked into the freebsd kernel build.

Google webmaster tools tip

Google knows a lot about the web. The webmaster tools allows me to find out how much google knows about my site, in addition to some other cool features..

One of these pieces of data is "what sites are linking to me" which google webmaster tools gives you. It offers this data in a CSV format for offline consumption. I downloaded this, and wanted to see who was linking to me sorted by source url:

sed -re '[email protected]([^,]+),([^,]+),(.*$)@\3,\2,\[email protected]' \
| awk '
  $2 ~ /^[0-9],$/ { $2 = "0"$2 } 
    split($0, a, ","); 
    split($3, b, ","); 
    $3 = b[1]; ref=a[3]; url=a[4]; 
    printf("%s %-130s %s\n", $1" "$2" "$3, ref, url)
  }' \
| sort | sort -k4 | less
Yes, the above code could probably be better, but I'm not interested in elegance: I want data. This lets me get a good overview of who is linking to me and to what specific url they are linking.

Week of unix tools; day 3: awk!

Day 3 is ready for viewing. It's about awk.

This article has lots of usage examples for the many ways you can use awk to do hard work for you. Check out the article here:

day 3: awk