photo
Jordan Sissel
geek. sysadmin. blogger.

Mon, 14 May 2007

Command line xpath tool

I've recently found myself needing to do one-off scrapers to pull information from various services. However, using 'w3m -dump' doesn't always suffice when I need to parse html and maybe throw it at awk.

I know XSLT and XPath, but I don't know of a good xpath tool for the commandline. XML::XPath in perl comes with one, but it's not up to my demands.

So, like most problems I come across, I solved it myself. A simple query of "Show all external links" is very cumbersome to do in a one-off manner unless you've got the right tools. Here's how you do basically do it with xpath:

//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss this along with my new xpathtool project, and I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'

http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen as text, html, or xml. If you specify html or xml output, it will use <xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4

<?xml version="1.0"?>
<toplevel>
  <a href="http://www.viewvc.org/">ViewVC</a>
  <a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>

  ... etc ...
Interested? Try xpathtool.

Comments: 1 (view comments)
Tags: , , , ,
Permalink: /geekery/command-line-xpath
posted at: 04:51

When are typos deadly? Nethack.

'e' and 'w' are tragically close. They'll kill you.

A typo caused the untimely demise of my latest nethack venture. I ate a cockatrice corpse, instead of wielding it, while on the astral plane, poised to completely decimate the current top score.


                       ----------
                      /          \
                     /    REST    \
                    /      IN      \
                   /     PEACE      \
                  /                  \
                  |     psionic      |
                  |       0 Au       |
                  |   petrified by   |
                  |     tasting      |
                  | cockatrice meat  |
                  |                  |
                  |       2007       |
                 *|     *  *  *      | *
        _________)/\\_//(\/(/\)/\//\/|_)_______


Farvel psionic the Valkyrie...

You turned to stone in The Astral Plane with 33241022 points,
and 0 pieces of gold, after 173216 moves.
You were level 30 with a maximum of 1371 hit points when you turned to stone.
This game was way fun. I had built a character that was unstoppable. See what I had in this screen scrape. I also had another bag of goodies that contained everything points-wise valuable from this large box I had been storing goodies in.

While I'm a little sad that the game ended this way, but whatever. Now I can go about my normal project work, since the hold on my life nethack has held on my after-work activities has been broken.

Comments: 3 (view comments)
Tags:
Permalink: /geekery/when-a-typo-kills-you
posted at: 03:24

Search this site

Navigation

Metadata

Home About Resume My Code (SVN Web)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< May 2007 >
SuMoTuWeThFrSa
   1 2 3 4 5
6 7 8 9101112
13141516171819
20212223242526
2728293031  

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati