Search this site


Metadata

Articles

Projects

Presentations

jquerycmd+xpathtool == direction scraping on google

Show the first 3 steps that google maps tells you to take.
./getpath.sh "atlanta to nyc" | head -3
Head southeast on Trinity Ave SW toward Washington St SW        0.2mi
Slight left at Memorial Dr SW   0.3mi
Turn left at Martin St SE       361ft
Pipe that to lpr and you've got printed directions on under 5 seconds.

Why not just do this with plain page scraping? Because there's lots of javascript in google maps that presents the user with the directions. Firefox (Gecko, really) already parses it, so why bother reinventing the wheel? Let's use the wheel that already works.

Download jquery-20070623.1828.tar.gz. The download of jquerycmd comes with the xul app, 'jquerycmd.sh' and 'getpath.sh'.

For the lazy who just want to see the scripts:

Command line xpath tool

I've recently found myself needing to do one-off scrapers to pull information from various services. However, using 'w3m -dump' doesn't always suffice when I need to parse html and maybe throw it at awk.

I know XSLT and XPath, but I don't know of a good xpath tool for the commandline. XML::XPath in perl comes with one, but it's not up to my demands.

So, like most problems I come across, I solved it myself. A simple query of "Show all external links" is very cumbersome to do in a one-off manner unless you've got the right tools. Here's how you do basically do it with xpath:

//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")][email protected]
It's a bit complicated, but whatever, I can express what I want, right? Toss this along with my new xpathtool project, and I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'

http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen as text, html, or xml. If you specify html or xml output, it will use <xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4

<?xml version="1.0"?>
<toplevel>
  <a href="http://www.viewvc.org/">ViewVC</a>
  <a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>

  ... etc ...
Interested? Try xpathtool.