Search this site

Page 1 of 2  [next]





Command line xpath tool

I've recently found myself needing to do one-off scrapers to pull information from various services. However, using 'w3m -dump' doesn't always suffice when I need to parse html and maybe throw it at awk.

I know XSLT and XPath, but I don't know of a good xpath tool for the commandline. XML::XPath in perl comes with one, but it's not up to my demands.

So, like most problems I come across, I solved it myself. A simple query of "Show all external links" is very cumbersome to do in a one-off manner unless you've got the right tools. Here's how you do basically do it with xpath:

//a[not(contains(@href,"")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss this along with my new xpathtool project, and I can find out where the urls are going:
% GET \
| ./ --ihtml \
  '//a[not(contains(@href,"")) and starts-with(@href, "http://")]'
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen as text, html, or xml. If you specify html or xml output, it will use <xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
| ./ --ihtml --oxml \
  '//a[not(contains(@href,"")) and starts-with(@href, "http://")]' \
| head -4

<?xml version="1.0"?>
  <a href="">ViewVC</a>
  <a href=""></a>

  ... etc ...
Interested? Try xpathtool.

Python's sad xml business and modules vs packages.

So, I've been reading docs on python's xml stuff, hoping there's something simple or comes-default-with-python that'll let me do xpath. Everyone overcomplicates xml processing. I have no idea why. Python seems to have enough alternatives to make dealing with xml less painful.

Standard python docs will lead you astray:

kenya(...ojects/pimp/pimp/controllers) % pydoc xml.dom | wc -l
Clearly, the pydoc for "xml.dom" has some nice things, right? I mean, documentation is clearly an indication that THE THING THAT IS DOCUMENTED BEING AVAILABLE. Right?

Sounds great. Let's try to use this 'xml.dom' module!

kenya(...ojects/pimp/pimp/controllers) % python -c 'import xml; xml.dom'
Traceback (most recent call last):
  File "", line 1, in ?
AttributeError: 'module' object has no attribute 'dom'

Googling around, it turns out that 'xml' is a fake module that only actually works if you have it the 4Suite modules installed? Maybe?

Why include fake modules that provide complete documentation to modules that do not exist in the standard distribution?

Who's running this ship? I want off. I'll swim if necessary.

As it turns out, I made too-strong of an assumption about python's affinity towards java-isms. I roughly equated 'import foo' in python as 'import foo.*' in java. That was incorrect. Importing foo doesn't get you access to things in it's directory, they have to be imported explicity.

In summary, 'import xml' gets you nothing. 'import xml.dom' gets you nothing. If you really want minidom's parser, you'll need 'import xml.dom.minidom' or a 'from import' variant.

On another note, the following surprised me. I had a module, foo/ I figured 'from foo import *' would grab it. This means 'from xml.dom import *' doesn't get you minidom and friends.

Perhaps I was hoping for too much, but maybe it's better to import explicitly. If that's the case ,then why push exceptions that allow '*' to be imported only from modules, not packages?

Strip XML comments with sed

sed -ne '/<!--/ { :c; /-->/! { N; b c; }; /-->/s/<!--.*-->//g }; /^  *$/!p;'
You might consider stripping blanklines and/or filtering through xmllint --format to make the xml pretty printed.

The CSH Bawls Programming Competition

Yesterday, I participated in a 12-hour coding-binge competition. It started at 7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer Science House and Bawls, both sponsors of the event. Needless to say, I haven't gotten much sleep today.

The competition website is here. Go there if you want to view this year's objectives.

The Dream Team consisted of John Resig, Darrin Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at web development, so we decided this year we would represent ourselves as "Team JavaScript" - and do everything possible in javascript. Bruce is not a programmer, but I enlisted his graphical art skills because I figured with our team doing some web-based project, we definitely needed an artist.

After reviewing all the objectives, we came up with a significant modification upon the Sudoku objective. The sudoku objective was a problem that lacked much room for innovation, so we went further and instead of solving Sudoku, wrote a web-based version of an extremely popular game in Second Life. The contest organizer approved of our new objective, so we did just that.

Resig worked on game logic, I worked on chat features, Darrin worked on scoring and game generation, and Bruce worked on the interface graphics. Becuase our tasks were all mostly unrelated, we could develop them independently. Most of the game was completed in about 6 hours, and the remainder of the time was spent fixing bugs, refactoring, and some minor redesign.

The backends were minimal. The chat backend was only 70 lines of perl, and the score backend was 9 lines of /bin/sh. Everything else was handled in the browser. We leveraged Resig's jQuery to make development faster. Development went extremely smooth, a testament to the "Dream Team"-nature of our team, perhaps? ;)

The game worked by presenting everyone with the same game - so you can compete for the highest score. You could also chat during and between games, if you wanted to.

A screenshot can be found here. At the end of the competition, we only had one known bug left. That bug didn't affect gameplay, and we were all tired, so it didn't get fixed. There were a few other issues that remained unresolved that may or may not be related to our code. Firefox was having issues with various things we were doing, and we couldn't tell if it was our fault or not.

Despite the fact that I probably shouldn't have attended the competition due to scholastic time constraints, I was glad I went. We had a blast writing the game.

We may get some time in the near future to improve the codebase and put it up online so anyone can play. There are quite a few important features that need to be added before it'll be useful as a public game.

jQuery on XML Documents

Ever since BarCampNYC, I've been geeking out working with jQuery, a project by my good friend John Resig. It's a JavaScript library that takes ideas from Prototype and Behavior and some good smarts to make writing fancy JavaScript pieces so easy I ask myself "Why wasn't this available before?" I won't bother going into the details of how the library works, but it's based around querying documents. It supports CSS1, CSS2 and CSS3 selectors (and some simple XPath) to query documents for fun and profit.

In the car ride back from BarCampNYC, I asked Resig if he knew whether or not jQuery would work for querying on xml document objects. "Well, I'm not sure" was the response. I took the time today to test that theory. Becuase jQuery does not rely on document.getElementById() to look for elements the way Prototype does. Bypassing that limitation, you can successfully query XML documents and even subdocuments of HTML or XML. This is fantastic.

Today's magic was a demo I wrote to pull my rss feed via XMLHttpRequest (AJAX) and very simply pull the data I wanted to use out of the XML document object returned.

The gist of the magic of jQuery revolves around the $() function. This function is generations ahead of what the Prototype $() function provides.

The magic is here, in the XMLHttpRequest onreadystatechange function

// For each 'item' element in the RSS document, alert() out the title.
var entries = $("item",xml.responseXML).each(
   function() {
      var title = $(this).find("title").text();
      alert("Title: " + title);
The actual demo is quite impressive, I think. I can query through a complex XML document in only a few lines of code. Select the data you want, use it, go about your life. So simple!

View the RSS-to-HTML jQuery Demo

templating with xhtml, xpath, and python

I've been gradually researching interesting ways to go about templating pages for Pimp 4.0 (rewrite in python). I've come to the conclusion that regexp replacement is hackish. Using a big templating toolkit is too much effort for now. However, I've come up with a solution I've yet to test thorougly, but the gist of it is:

Use an XML DOM parser to get a DOM-ified version of the webpage. Use XPath to find elements I want to modify and do so as necessary. Poof, templating. A sample template is layout.html

The following python will parse it and insert "Testing" into the content div.


import sys
from xml.dom import minidom
from xml import xpath

if __name__ == '__main__':
        foo = minidom.parse("layout.html")

        # Append a text node to the element with 'id="content"'
        div = xpath.Evaluate("//*[@id='content']", foo.documentElement)
It seems pretty simple. I'm probably going to come up with a simple-ish xml/xpath way of doing templating. We'll see how well it actually works later on, but for now it seems like a pretty simple way of doing templating. Move the complicated parts (complex xpath notions) to a templating class with an "insert text" or somesuch method and poof, simple templating. Even for complex situations where I may need to produce a table it is easy to provide a default node-tree for replicating. The particular DOM implementation I am using provides me a wonderful cloneNode() method with which to do this.

Ofcourse, if you know of any other simpler ways of doing templating in python (or in general) definitely let me know :)

xmlrpc javascript library and pimp rewrite

Yet Another Rewrite of Pimp, my music jukebox software, has commenced. This time, I'm writing it in Python. This was the best excuse I could find to learn python. I've tinkered with it before but never written an application in it.

Anyway, the interface has moved from telnet-based to web-based and uses XMLHTTPRequest (AJAX) to perform XMLRPC calls on the purely-python webserver. Python provides a wonderful standard module called 'xmlrpclib' to marshall/unmarshall XMLRPC requests and responses to/from python and XML. JavaScript, howver, lacks these marshalling features.

Some quick googling found jsolait and vcXMLRPC. Both of these are huge frameworks and are well beyond my particular needs. BOTH of them have "the suck" and fail to cleanly load into Firefox without warnings. Bah! Back at square-one. I'm left without a way to marshall xmlrpc requests and responses between javascript and xml

I spent some time learning about XMLRPC. Turns out it's a very very simple xml-based protocol for calling methods and getting results. JavaScript has DOM already so parsing XMLRPC messages is very easy.

Take a look at the 'rpcparam2hash' and 'hash2rpcparam' functions in pimp.js and see how I convert between JavaScript hashes (dictionaries) and XMLRPC messages. If I get bored I may create my own xmlrpc library specifically for making xmlrpc calls with javascript. If you want this to get done, please let me know and give me encouragement ;)

xmlfo, round 1.

Start with: sitebook.xml
End with: sitebook.pdf

This uses an xsl template to generate an xmlfo document. Then it uses fop (an apache project) to turn the xmlfo into a pdf.


I finally got unlazy and found the energy to start on the new revision of this website. The layout is going to stay the same, but the way the site works is changing drastically. The changes should allow me to add new features more quickly aswell as adding cooler features (comments on every page, for example).

I'm still brainstorming how it should all come together, but for the most part I've got a decent xml- and make-based website framework. Webpages are written in pure XML/XHTML and HTML is created using XSLT. The whole website is managed with simple makefiles so when I change one thing, I can simply type 'make' and it republishes itself. Ideally, this would be done by a cronjob so updates simply publish themselves.

I've posted more information about it on the new site, check it out: (this url no longer works)

Hurray XML!

xml, xml, xml.

My love for XML as a document format has only been growing over the past months. I write almost all of my formatted documents using XML these days. Articles and Project pages are written in XML, as are a number of my projects. Most notably, my xmlpresenter project is one of the cooler examples. I can fully publish articles by typing 'make' now, which executes this makefile. No magic cgi scripts involved. Plain HTML is served: Simple, clean, efficient.

I've been wanting to completely rewrite my website using xml and makefiles becuase they're just so simple and xslt makes document formatting the easiest thing in the world. I'm hoping to soon have gathered enough effort points to want to spend on redesigning the internals of this site. We'll see. I'll post more probably in a month when I finally get off my lazy bum.

Files of possible interest:
Article Makefile
ssh security article xml