Search this site


Metadata

Articles

Projects

Presentations

Migration to Google Code hosting

I've been maintaining my own repository(s) for years, and I've finally grown out of doing it.

My first major repository move was to merge all my CVS and Subversion repositories into a single Subversion repository. This move made me happy for a while, but from time to time the machine hosting the repository would go down, and I'd be out of Subversion access for a while. Additionally, the machine hosting this repository grants me only a small quota (500mb) and my subversion repository was occupying 10% of the space. Lastly, I couldn't be bothered to setup webdav+svn, so I couldn't grant arbitrary users (like you) proper read (and perhaps write) access.

To solve all of these problems, in part or in full, I created a new project on googlecode called 'semicomplete' for my repository. All of my projects will now live there.

I used svnsync to upload my local repository so as to keep all the change history, which took 5 hours, but was otherwise painless.

New repository: http://semicomplete.googlecode.com/

As a side bonus, Google Code Hosting allows you to publish "downloads", which means all of my releases can be put here, saving me 24 megs of used quota on the old machine. Further bonuses include an issue tracking system (so you and I can file bugs that won't get lost) and a project wiki. I don't know if I'll use the wiki yet.

Google webmaster tools tip

Google knows a lot about the web. The webmaster tools allows me to find out how much google knows about my site, in addition to some other cool features..

One of these pieces of data is "what sites are linking to me" which google webmaster tools gives you. It offers this data in a CSV format for offline consumption. I downloaded this, and wanted to see who was linking to me sorted by source url:

sed -re 's@([^,]+),([^,]+),(.*$)@\3,\2,\1@' \
| awk '
  $2 ~ /^[0-9],$/ { $2 = "0"$2 } 
  { 
    split($0, a, ","); 
    split($3, b, ","); 
    $3 = b[1]; ref=a[3]; url=a[4]; 
    printf("%s %-130s %s\n", $1" "$2" "$3, ref, url)
  }' \
| sort | sort -k4 | less
Yes, the above code could probably be better, but I'm not interested in elegance: I want data. This lets me get a good overview of who is linking to me and to what specific url they are linking.

One anti-spam effort too easily defeated.

I see lots of times where people put their mailing addresses as "foo at bar dot org" in a hopeful effort to keep spammers from scraping your mailing address. Heck, mail archive systems often have (and are deployed with) options to obfuscate email addresses systematically, using the same pattern: foo at bar dot com.

All it does is hurt usability.

Googlng for "* at * dot *" clearly shows lots of matches. It also matches all of the following variants, due to google searches ignoring brackets and such in words:

  • foo at bar dot com
  • foo [at] bar [dot] com
  • foo (at) bar (dot) com
  • ... etc ...
Query, scrape, replace 'at' and 'dot' as desired. I now have 54 million email addresses. What now?

Seems like this effort only serves to have people fool themselves as well as to impede usability. It certainly won't protect you from spam. Why is this method used?