Search this site


Metadata

Articles

Projects

Presentations

Command line xpath tool

I've recently found myself needing to do one-off scrapers to pull information from various services. However, using 'w3m -dump' doesn't always suffice when I need to parse html and maybe throw it at awk.

I know XSLT and XPath, but I don't know of a good xpath tool for the commandline. XML::XPath in perl comes with one, but it's not up to my demands.

So, like most problems I come across, I solved it myself. A simple query of "Show all external links" is very cumbersome to do in a one-off manner unless you've got the right tools. Here's how you do basically do it with xpath:

//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss this along with my new xpathtool project, and I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'

http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen as text, html, or xml. If you specify html or xml output, it will use <xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4

<?xml version="1.0"?>
<toplevel>
  <a href="http://www.viewvc.org/">ViewVC</a>
  <a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>

  ... etc ...
Interested? Try xpathtool.