Search this site


Metadata

Articles

Projects

Presentations

Command line xpath tool

I've recently found myself needing to do one-off scrapers to pull information from various services. However, using 'w3m -dump' doesn't always suffice when I need to parse html and maybe throw it at awk.

I know XSLT and XPath, but I don't know of a good xpath tool for the commandline. XML::XPath in perl comes with one, but it's not up to my demands.

So, like most problems I come across, I solved it myself. A simple query of "Show all external links" is very cumbersome to do in a one-off manner unless you've got the right tools. Here's how you do basically do it with xpath:

//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss this along with my new xpathtool project, and I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'

http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen as text, html, or xml. If you specify html or xml output, it will use <xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4

<?xml version="1.0"?>
<toplevel>
  <a href="http://www.viewvc.org/">ViewVC</a>
  <a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>

  ... etc ...
Interested? Try xpathtool.

1 responses to 'Command line xpath tool'

Showing last 1 comments... (Click here to view all comments)

Stephen Crim wrote at Mon May 14 13:53:10 2007...
that's fantastic - i'd actually taken to loading up firebug w/ jquery and wading through the DOM with it.


Leave a reply

You need javascript enabled to use this form. Anti-spam efforts ongoing. Also, if the comment doesn't show up, it's because the form expired. Go back and copy your comment, reload the form, and resubmit. Apologies if this is a hassle, I'm just playing with antispam methods right now. If this insists on not working, please email me about it.

Name (required)
E-mail (optional, if you want me to be able to email you back)
URL (also optional)
Comment: