Search this site


[prev]  Page 2 of 4  [next]

Metadata

Articles

Projects

Presentations

Python, event tracking, and bdb.

In previous postings, I've put thoughts on monitoring and data aggregation. Last night, I started working on prototyping a python tool to record arbitrary data. Basically it aims to solve the problem of "I want to store data" in a simple way, rather than having to setup a mysql database, user access, tables, a web server, etc, all in the name of viewing graphs and data reports.

The requirements are pretty simple and generic:

  • Storage will be mapping key+timestamp => value
  • Timestamps should be 64bit values, so we can use microsecond values (52 bits to represent current unix epoch in microsecond)
  • Access method is assumed to be random access by key, but reading multiple timestamp entries for a single key is expected to be sequential.
  • Key names must be arbitrary length
  • Storage must be space-efficient on key names
  • Values are arbitrary.
  • Minimal setup overhead (aka, you don't have to setup mysql)
The goal of this is to provide a simple way to store and retrieve timestamped key-value pairs. It's a small piece of the hope that there can be a monitoring tool set that is trivial to configure, easy to manipulate, easy to extend, and dynamic. I don't want to pre-declare data sources and data types (rrdtool, cacti), or make it difficult to add new collection sources, etc. I'm hoping that relying on the unix methodology (small tools that do one or two things well) that this can be achieved. The next steps in this adventure of a monitoring system are:
  • a graphing system
  • aggregation functions
  • templating system (web, email, etc?)
Space efficiency on key names is achieved with a secondary storage containing a list of key to keyid mappings. Key IDs are 64bit values. The first value is 1. We could use a hash function here, but I want a guarantee of zero collisions. However, this means that keys are specifically stored as key ids in insertion order, not lexigraphical order.

This may become a problem if we want to read keys sequentially. However, if we scan the secondary database (one mapping key => 64bit_keyid) we can get keys in lexigraphical order for free. So iterating over all keys starting with the string 'networking' is still possible, but it will result in random-access reads on the primary database. This may be undesirable, so I'll have to think about whether or not this use case is necessary. There are some simple solutions to this, but I'm not sure which one best solves the general case.

Arbitrary key length is a requirement because I found the limitations of RRDTool annoying, where data source names cannot be more than 19 characters - lame! We end up being more space efficient (8 bytes per name) for any length of data source name at the cost of doing a lookup finding the 64bit key id from the name.

I have some of the code written, and a demo that runs 'netstat -s' a once a second for 5 seconds and records total ip packets inbound. The key is 'ip.in.packets'

ip.in.packets[1195634167358035]: 1697541
ip.in.packets[1195634168364598]: 1698058
ip.in.packets[1195634169368830]: 1698566
ip.in.packets[1195634170372823]: 1699079
ip.in.packets[1195634171376553]: 1699590

Distributed xargs

I like xargs. However, xargs becomes less useful when you want to run in parallel many cpu-intensive tasks with more parallelism than you have cpus cores local to your machine.

Enter dxargs. For now, dxargs is a simple python script that will distribute tasks in a similar way to xargs but will distribute them to remote hosts over ssh. Basically, it's a threadpool of ssh sessions. An idle worker will ask for something to do, letting you get the maximum throughput possible; meaning your faster servers will be given more tasks to execute than slower ones simply because they complete them sooner.

As an example, let's run 'hostname' in parallel across a few machines for 100 total calls.

% seq 100 | ./dxargs.py -P0 -n1 --hosts "snack scorn" hostname | sort | uniq -c
    14 scorn.csh.rit.edu
    86 snack.home

# Now use per-input-set output collating:
% seq 100 | ./dxargs.py -P0 -n1 --hosts "snack scorn" --output_dir=/tmp/t 'uname -a'
% ls /tmp/t | tail -5
535.95.0.snack.1191918835
535.96.0.snack.1191918835
535.97.0.snack.1191918835
535.98.0.snack.1191918835
535.99.0.snack.1191918835
% cat /tmp/t/535.99.0.snack.1191918835
Linux snack.home 2.6.20-15-generic #2 SMP Sun Apr 15 06:17:24 UTC 2007 x86_64 GNU/Linux
Design requirements:
  • Argument input must work the same way as xargs (-n<num>, etc) and come from stdin
  • Don't violate POLA where unnecessary - same flags as xargs.
Basically, I want dxargs to be a drop in replacement for xargs with respect to compatibility. I may intentionally break compatibility later where it makes sense, however. Also, don't consider this first version POLA-compliant.

Neat features so far:

  • Uses OpenSSH Protocol 2's "Control" sockets (-M and -S flags) to keep the session handshaking down to once per host.
  • Each worker competes for work with the goal of having zero idle workers.
  • Collatable output to a specified directory by input set, pid, number, host, and time
  • '0' (aka -P0) for parallelism means parallelize to the same size as the host list
  • Ability to specify multiplicity by machine with notation like 'snack*4' to indicate snack can run 4 tasks in parallel
  • 'stdout' writing is wrapped with a mutex, so tasks can't interfere with output midline (I see this often with xargs)
Desired features (not yet implemented):
  • Retrying of input sets when workers malfunction
  • Good handling of ssh problems (worker connect timeouts, etc
  • More xargs and xapply behaviors

Download dxargs.py

Music sorting

My music collection was fairly randomly sorted. Until today. In the past, I had ripped some of my CDs to random locations which inevitably moved around over the years, for whatever reason. This made finding a particular song annoying without something that already had parsed the ID3 tags.

I tried to use iTunes to 'consolidate' my library into /path/[artist]/[album]/[title].mp3 but for whatever reason it would get some percentage complete and then die. I assume it has something to do with strange non-alphanumeric upper-ascii characters in filenames and iTunes was using files over a samba share.

Whatever the reason, I had to fix this myself. I have two scripts now. One parses the ID3 tags and stores them in a file, and one uses that data to organize songs into the aforementioned directory structure.

Scripts: findsongs.py and migratemusic.py

A letter to Ruby and Python.

Dear Ruby and Python,

Please implement Perl's (?{}) and (??{}) in your regular expression engines so I can do outrageous state machine and pattern matching in your languages. Thank you.

Love,
Jordan

Pyblosxom sorted tag cloud patch

wxs has this neat tag cloud thing on his site. I like it so much, I wanted it here on this site, so I installed plugin last night and put it in place of the 'categories' section of the sidebar.

However, the default tag cloud isn't sorted, so finding things in it is, well, hard.

Here is a patch that will sort your tag cloud alphabetically.

Pulling album covers from Amazon

Amazon provides lots of web services. One of these is it's E-Commerce API which allows you to search it's vast product database (among other things).

In Pimp, the page for any given listening station shows you the current song being played. Along with that, I wanted to provide the album cover for the current track.

You can leverage Amazon's API to search for a given artist and album eventually leading you to the picture of the album cover. To this end, I wrote a little python module that lets you search for an artist and album name combination and will give you a link to the album cover.

So, I wrote albumcover.py as a prototype to turn an artist and album into a url to the album cover image. It works for the 20 or so tests I've put through it.

Python's sad xml business and modules vs packages.

So, I've been reading docs on python's xml stuff, hoping there's something simple or comes-default-with-python that'll let me do xpath. Everyone overcomplicates xml processing. I have no idea why. Python seems to have enough alternatives to make dealing with xml less painful.

Standard python docs will lead you astray:

kenya(...ojects/pimp/pimp/controllers) % pydoc xml.dom | wc -l
643
Clearly, the pydoc for "xml.dom" has some nice things, right? I mean, documentation is clearly an indication that THE THING THAT IS DOCUMENTED BEING AVAILABLE. Right?

Sounds great. Let's try to use this 'xml.dom' module!

kenya(...ojects/pimp/pimp/controllers) % python -c 'import xml; xml.dom'
Traceback (most recent call last):
  File "", line 1, in ?
AttributeError: 'module' object has no attribute 'dom'
WHAT. THE. HELL.

Googling around, it turns out that 'xml' is a fake module that only actually works if you have it the 4Suite modules installed? Maybe?

Why include fake modules that provide complete documentation to modules that do not exist in the standard distribution?

Who's running this ship? I want off. I'll swim if necessary.

As it turns out, I made too-strong of an assumption about python's affinity towards java-isms. I roughly equated 'import foo' in python as 'import foo.*' in java. That was incorrect. Importing foo doesn't get you access to things in it's directory, they have to be imported explicity.

In summary, 'import xml' gets you nothing. 'import xml.dom' gets you nothing. If you really want minidom's parser, you'll need 'import xml.dom.minidom' or a 'from import' variant.

On another note, the following surprised me. I had a module, foo/bar.py. I figured 'from foo import *' would grab it. This means 'from xml.dom import *' doesn't get you minidom and friends.

Perhaps I was hoping for too much, but maybe it's better to import explicitly. If that's the case ,then why push exceptions that allow '*' to be imported only from modules, not packages?

SQLObject is love.

I'm in the process of feeling out my options for the pimp rewrite. I've started with Pylons. Pylons gives me an actual framework and lets me choose implementations.

The database back is going to be SQLObject. I've been playing with it for 10 minutes. I haven't written a single line of SQL, and I've already got objects mapping to an sqlite database. Insertion is so cool. I simply instantiate an object and it gets inserted to the database. The potential to write database-backed systems without having to understand SQL is quite cool.

Neat. More on sqlobject as I learn it?

Error running scapy as non-root

If you get this error:
Traceback (most recent call last):
  File "/usr/local/bin/scapy", line 10647, in ?
    class Conf(ConfClass):
  File "/usr/local/bin/scapy", line 10670, in Conf
    iface = get_working_if()
  File "/usr/local/bin/scapy", line 2067, in get_working_if
    except pcap.pcapc.EXCEPTION:
AttributeError: 'module' object has no attribute 'pcapc'
This is becuase you aren't running scapy as root. Run it as root.

If you still get this error message, it's likely due to pcap failing to find usable network interfaces. This means you have no interfaces in the UP state. It doesn't count lo0 as a real interface, I guess?

Antispam pyblosxom plugin, followup!

REJECT: Comment attempt by 210.113.83.6 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 210.120.79.179 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 200.156.25.4 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 220.125.164.243 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 69.57.136.39 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
...
The list goes on. Well over 50 invalid tokens were found. The 'pleaseDontSpam' was the original secret token I used. Just goes to show that, for the moment, most spam bots don't review the page before submitting.

Admittedly, 2 spams got through, I have not investigated why, yet.