photo
Jordan Sissel
geek

Sat, 23 Jun 2007

jquerycmd+xpathtool == direction scraping on google

Show the first 3 steps that google maps tells you to take.
./getpath.sh "atlanta to nyc" | head -3
Head southeast on Trinity Ave SW toward Washington St SW        0.2mi
Slight left at Memorial Dr SW   0.3mi
Turn left at Martin St SE       361ft
Pipe that to lpr and you've got printed directions on under 5 seconds.

Why not just do this with plain page scraping? Because there's lots of javascript in google maps that presents the user with the directions. Firefox (Gecko, really) already parses it, so why bother reinventing the wheel? Let's use the wheel that already works.

Download jquery-20070623.1828.tar.gz. The download of jquerycmd comes with the xul app, 'jquerycmd.sh' and 'getpath.sh'.

For the lazy who just want to see the scripts:

Comments: 0 (view comments)
Tags: , , , , ,
Permalink: /geekery/superhappydevhouse18-part2
posted at: 21:16

Mon, 14 May 2007

Command line xpath tool

I've recently found myself needing to do one-off scrapers to pull information from various services. However, using 'w3m -dump' doesn't always suffice when I need to parse html and maybe throw it at awk.

I know XSLT and XPath, but I don't know of a good xpath tool for the commandline. XML::XPath in perl comes with one, but it's not up to my demands.

So, like most problems I come across, I solved it myself. A simple query of "Show all external links" is very cumbersome to do in a one-off manner unless you've got the right tools. Here's how you do basically do it with xpath:

//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]/@href
It's a bit complicated, but whatever, I can express what I want, right? Toss this along with my new xpathtool project, and I can find out where the urls are going:
% GET www.semicomplete.com \
| ./xpathtool.sh --ihtml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]'

http://www.viewvc.org/
http://www.oreillynet.com/sysadmin/
... etc ...
You can specify that the input is html with '--ihtml'. The output can be chosen as text, html, or xml. If you specify html or xml output, it will use <xsl:copy-of> instead of <xsl:value-of> for each node matched.
# output in xml, every anchor tag not obviously pointing locally
GET www.semicomplete.com \
| ./xpathtool.sh --ihtml --oxml \
  '//a[not(contains(@href,"semicomplete.com")) and starts-with(@href, "http://")]' \
| head -4

<?xml version="1.0"?>
<toplevel>
  <a href="http://www.viewvc.org/">ViewVC</a>
  <a href="http://www.oreillynet.com/sysadmin/">http://www.oreillynet.com/sysadmin/</a>

  ... etc ...
Interested? Try xpathtool.

Comments: 1 (view comments)
Tags: , , , ,
Permalink: /geekery/command-line-xpath
posted at: 04:51

Search this site

Navigation

Metadata

Home About Resume My Code

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< June 2007 >
SuMoTuWeThFrSa
      1 2
3 4 5 6 7 8 9
10111213141516
17181920212223
24252627282930

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati