One of these pieces of data is "what sites are linking to me" which google webmaster tools gives you. It offers this data in a CSV format for offline consumption. I downloaded this, and wanted to see who was linking to me sorted by source url:
sed -re 's@([^,]+),([^,]+),(.*$)@\3,\2,\1@' \
| awk '
$2 ~ /^[0-9],$/ { $2 = "0"$2 }
{
split($0, a, ",");
split($3, b, ",");
$3 = b[1]; ref=a[3]; url=a[4];
printf("%s %-130s %s\n", $1" "$2" "$3, ref, url)
}' \
| sort | sort -k4 | less
Yes, the above code could probably be better, but I'm not interested in
elegance: I want data. This lets me get a good overview of who is linking to me
and to what specific url they are linking.