I've recently found myself needing to do one-off scrapers to pull information
from various services. However, using 'w3m -dump' doesn't always suffice when I
need to parse html and maybe throw it at awk.
I know XSLT and XPath, but I don't know of a good xpath tool for the
commandline. XML::XPath in perl comes with one, but it's not up to my demands.
So, like most problems I come across, I solved it myself. A simple query of
"Show all external links" is very cumbersome to do in a one-off manner unless
you've got the right tools. Here's how you do basically do it with xpath:
//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss
this along with my new
xpathtool project, and
I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
'//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'
http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen
as text, html, or xml. If you specify html or xml output, it will use
<xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
'//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4
<?xml version="1.0"?>
<toplevel>
<a href="http://www.viewvc.org/">ViewVC</a>
<a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>
... etc ...
Interested?
Try xpathtool.