I've recently found myself needing to do one-off scrapers to pull information
from various services. However, using 'w3m -dump' doesn't always suffice when I
need to parse html and maybe throw it at awk.
I know XSLT and XPath, but I don't know of a good xpath tool for the
commandline. XML::XPath in perl comes with one, but it's not up to my demands.
So, like most problems I come across, I solved it myself. A simple query of
"Show all external links" is very cumbersome to do in a one-off manner unless
you've got the right tools. Here's how you do basically do it with xpath:
//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss
this along with my new xpathtool project, and
I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
'//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'
http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen
as text, html, or xml. If you specify html or xml output, it will use
<xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
'//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4
<?xml version="1.0"?>
<toplevel>
<a href="http://www.viewvc.org/">ViewVC</a>
<a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>
... etc ...
Interested? Try xpathtool.
Comments: 1 (view comments)
Tags: xml, xpath, productivity, scraping, xpathtool
Permalink: /geekery/command-line-xpath
posted at: 04:51
So, I've been reading docs on python's xml stuff, hoping there's something
simple or comes-default-with-python that'll let me do xpath. Everyone
overcomplicates xml processing. I have no idea why. Python seems to have enough
alternatives to make dealing with xml less painful.
Standard python docs will lead you astray:
kenya(...ojects/pimp/pimp/controllers) % pydoc xml.dom | wc -l
643
Clearly, the pydoc for "xml.dom" has some nice things, right? I mean, documentation is clearly an indication that THE THING THAT IS DOCUMENTED BEING AVAILABLE. Right?
Sounds great. Let's try to use this 'xml.dom' module!
kenya(...ojects/pimp/pimp/controllers) % python -c 'import xml; xml.dom'
Traceback (most recent call last):
File "", line 1, in ?
AttributeError: 'module' object has no attribute 'dom'
WHAT. THE. HELL.
Googling around, it turns out that 'xml' is a fake module that only actually works if you have it the 4Suite modules installed? Maybe?
Why include fake modules that provide complete documentation to modules that do not exist in the standard distribution?
Who's running this ship? I want off. I'll swim if necessary.
As it turns out, I made too-strong of an assumption about python's affinity
towards java-isms. I roughly equated 'import foo' in python as 'import foo.*'
in java. That was incorrect. Importing foo doesn't get you access to things in
it's directory, they have to be imported explicity.
In summary, 'import xml' gets you nothing. 'import xml.dom' gets you nothing.
If you really want minidom's parser, you'll need 'import xml.dom.minidom' or a
'from import' variant.
On another note, the following surprised me. I had a module, foo/bar.py. I
figured 'from foo import *' would grab it. This means 'from xml.dom import *'
doesn't get you minidom and friends.
Perhaps I was hoping for too much, but maybe it's better to import explicitly.
If that's the case ,then why push exceptions that allow '*' to be imported only
from modules, not packages?
Comments: 2 (view comments)
Tags: rants, python, xml
Permalink: /geekery/python-and-xml
posted at: 21:23
Yesterday, I participated in a 12-hour coding-binge competition. It started at
7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer
Science House and Bawls, both sponsors of the event. Needless to say, I haven't
gotten much sleep today.
The competition website is here. Go there if you
want to view this year's objectives.
The Dream Team consisted of John Resig, Darrin
Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at
web development, so we decided this year we would represent ourselves as "Team
JavaScript" - and do everything possible in javascript. Bruce is not a
programmer, but I enlisted his graphical art skills because I figured with our
team doing some web-based project, we definitely needed an artist.
After reviewing all the objectives, we came up with a significant modification
upon the Sudoku objective. The sudoku objective was a problem that lacked much
room for innovation, so we went further and instead of solving Sudoku, wrote a
web-based version of an extremely popular game in Second Life. The contest
organizer approved of our new objective, so we did just that.
Resig worked on game logic, I worked on chat features, Darrin worked on scoring
and game generation, and Bruce worked on the interface graphics. Becuase our
tasks were all mostly unrelated, we could develop them independently. Most of
the game was completed in about 6 hours, and the remainder of the time was
spent fixing bugs, refactoring, and some minor redesign.
The backends were minimal. The chat backend was only 70 lines of perl, and the
score backend was 9 lines of /bin/sh. Everything else was handled in the
browser. We leveraged Resig's jQuery to make development faster. Development
went extremely smooth, a testament to the "Dream Team"-nature of our team,
perhaps? ;)
The game worked by presenting everyone with the same game - so you can compete
for the highest score. You could also chat during and between games, if you
wanted to.
A screenshot can be found here. At the end of the competition, we only had one
known bug left. That bug didn't affect gameplay, and we were all tired, so it
didn't get fixed. There were a few other issues that remained unresolved that
may or may not be related to our code. Firefox was having issues with various
things we were doing, and we couldn't tell if it was our fault or not.
Despite the fact that I probably shouldn't have attended the competition due to
scholastic time constraints, I was glad I went. We had a blast writing the game.
We may get some time in the near future to improve the codebase and put it up
online so anyone can play. There are quite a few important features that need to
be added before it'll be useful as a public game.
Comments: 1 (view comments)
Tags: nosleep, perl, javascript, web2.0, jquery, shell, web, xml, codebinge
Permalink: /geekery/bawls-competition-tringo
posted at: 19:11
Ever since BarCampNYC, I've been geeking out working with jQuery, a project by
my good friend John Resig. It's a JavaScript
library that takes ideas from Prototype and Behavior and some good smarts to
make writing fancy JavaScript pieces so easy I ask myself "Why wasn't this
available before?" I won't bother going into the details of how the library
works, but it's based around querying documents. It supports CSS1, CSS2 and
CSS3 selectors (and some simple XPath) to query documents for fun and profit.
In the car ride back from BarCampNYC, I asked Resig if he knew whether or not
jQuery would work for querying on xml document objects. "Well, I'm not sure"
was the response. I took the time today to test that theory. Becuase jQuery
does not rely on document.getElementById() to look for elements
the way Prototype does. Bypassing that limitation, you can successfully query
XML documents and even subdocuments of HTML or XML. This is fantastic.
Today's magic was a demo I wrote to pull my rss feed via XMLHttpRequest (AJAX)
and very simply pull the data I wanted to use out of the XML document object
returned.
The gist of the magic of jQuery revolves around the $() function.
This function is generations ahead of what the Prototype $()
function provides.
The magic is here, in the XMLHttpRequest onreadystatechange function
// For each 'item' element in the RSS document, alert() out the title.
var entries = $("item",xml.responseXML).each(
function() {
var title = $(this).find("title").text();
alert("Title: " + title);
}
The actual demo is quite impressive, I think. I can query through a complex XML
document in only a few lines of code. Select the data you want, use it, go
about your life. So simple!
View the RSS-to-HTML jQuery Demo
Comments: 0 (view comments)
Tags: open source adventures, jquery, xml
Permalink: /geekery/jquery-on-xml-documents
posted at: 23:56
I've been gradually researching interesting ways to go about templating pages for Pimp 4.0 (rewrite in python). I've come to the conclusion that regexp replacement is hackish. Using a big templating toolkit is too much effort for now. However, I've come up with a solution I've yet to test thorougly, but the gist of it is:
Use an XML DOM parser to get a DOM-ified version of the webpage. Use XPath to find elements I want to modify and do so as necessary. Poof, templating.
A sample template is layout.html
The following python will parse it and insert "Testing" into the content div.
#!/usr/local/bin/python
import sys
from xml.dom import minidom
from xml import xpath
if __name__ == '__main__':
foo = minidom.parse("layout.html")
# Append a text node to the element with 'id="content"'
div = xpath.Evaluate("//*[@id='content']", foo.documentElement)
div[0].appendChild(foo.createTextNode("Testing"))
foo.writexml(sys.stdout)
It seems pretty simple. I'm probably going to come up with a simple-ish xml/xpath way of doing templating. We'll see how well it actually works later on, but for now it seems like a pretty simple way of doing templating. Move the complicated parts (complex xpath notions) to a templating class with an "insert text" or somesuch method and poof, simple templating. Even for complex situations where I may need to produce a table it is easy to provide a default node-tree for replicating. The particular DOM implementation I am using provides me a wonderful cloneNode() method with which to do this.
Ofcourse, if you know of any other simpler ways of doing templating in python (or in general) definitely let me know :)
Comments: 0 (view comments)
Tags: xml, web, python
Permalink: /geekery/198
posted at: 03:35
Yet Another Rewrite of Pimp, my music jukebox software, has commenced. This
time, I'm writing it in Python. This was the best excuse I could find to learn
python. I've tinkered with it before but never written an application in it.
Anyway, the interface has moved from telnet-based to web-based and uses
XMLHTTPRequest (AJAX) to perform XMLRPC calls on the purely-python webserver.
Python provides a wonderful standard module called 'xmlrpclib' to
marshall/unmarshall XMLRPC requests and responses to/from python and XML.
JavaScript, howver, lacks these marshalling features.
Some quick googling found jsolait and vcXMLRPC. Both of these are huge
frameworks and are well beyond my particular needs. BOTH of them have "the suck" and
fail to cleanly load into Firefox without warnings. Bah! Back at square-one.
I'm left without a way to marshall xmlrpc requests and responses between
javascript and xml
I spent some time learning about XMLRPC. Turns out it's a very very simple
xml-based protocol for calling methods and getting results. JavaScript has DOM already so parsing XMLRPC messages is very easy.
Take a look at the 'rpcparam2hash' and 'hash2rpcparam' functions in pimp.js and see how I
convert between JavaScript hashes (dictionaries) and XMLRPC messages. If I get
bored I may create my own xmlrpc library specifically for making xmlrpc calls
with javascript. If you want this to get done, please let me know and give me
encouragement ;)
Comments: 0 (view comments)
Tags: javascript, xmlrpc, xml, web, python
Permalink: /geekery/193
posted at: 02:51
I finally got unlazy and found the energy to start on the new revision of this
website. The layout is going to stay the same, but the way the site works is
changing drastically. The changes should allow me to add new features more
quickly aswell as adding cooler features (comments on every page, for example).
I'm still brainstorming how it should all come together, but for the most part
I've got a decent xml- and make-based website framework. Webpages are written
in pure XML/XHTML and HTML is created using XSLT. The whole website is managed
with simple makefiles so when I change one thing, I can simply type 'make' and
it republishes itself. Ideally, this would be done by a cronjob so updates
simply publish themselves.
I've posted more information about it on the new site, check it out:
http://www.csh.rit.edu/~psionic/new/ (this url no longer works)
Hurray XML!
Comments: 0 (view comments)
Tags: xml
Permalink: /geekery/site-move-to-xml
posted at: 04:09
My love for XML as a document format has only been growing over the past
months. I write almost all of my formatted documents using XML these days.
Articles and Project pages are written in XML, as are a number of my projects.
Most notably, my xmlpresenter
project is one of the cooler examples. I can fully publish articles by typing
'make' now, which executes this makefile. No
magic cgi scripts involved. Plain HTML is served: Simple, clean, efficient.
I've been wanting to completely rewrite my website using xml and makefiles
becuase they're just so simple and xslt makes document formatting the easiest
thing in the world. I'm hoping to soon have gathered enough effort points to
want to spend on redesigning the internals of this site. We'll see. I'll post
more probably in a month when I finally get off my lazy bum.
Files of possible interest:
article.xsl
Article Makefile
ssh security article xml
Comments: 0 (view comments)
Tags: xml
Permalink: /geekery/187
posted at: 05:40
|
Search this site
Navigation
Metadata
Home
About
Resume
My Code (SVN Web)
ARP Security
Dynamic DNS with DHCP
OpenLDAP+Kerberos+SASL
PPP over SSH
SSH Security: /bin/false
Week of Unix Tools
Work Efficiency
fex
firefox tabsearch
firefox urledit
grok
keynav
liboverride
newpsm (FreeBSD)
nis2ldap
pam_captcha
poor man's backup
Solaris audio utility
xboxproxy
xdotool
xmlpresenter
xpathtool
misc scripts
Presentations
Yahoo! Hack Day '06
Unix Essentials
Vi/Vim Essentials
Tag Cloud
Calendar
Friends
BarCamp
Kent Brewster
Tantek Çelik
John Resig
Wesley Shields
Tyler Shields
Technorati
|