Search this site


Page 1 of 4  [next]

Metadata

Articles

Projects

Presentations

Getting your python as rpms

I was working on a new python 2.6 rpm to push at work and started wondering about how to get python eggs to become rpms. Ruby has a gem package called gem2rpm that aids in generating rpms from ruby gems, but there's not really an egg2rpm project.

We're in luck, though. Python's setuptools supports generating rpms by default, it seems. Those 'python setup.py' invocations you may be accustomed to can trivially generate rpms.

The secret sauce is the 'bdist_rpm' command given to setup.py:

% wget -q http://boto.googlecode.com/files/boto-1.8d.tar.gz
% tar -zxf boto-1.8d.tar.gz
% cd cd boto-1.8d
% python setup.py bdist_rpm
% find ./ -name '*.rpm'
./dist/boto-1.8d-1.noarch.rpm
./dist/boto-1.8d-1.src.rpm
Piece of cake. I've tried this on a handful of python packages (boto, simplejson, etc), and they all seem to produce happy rpms.

However, if you have multiple versions of python available, you'll want to explicitly hardcode the path to python:

% python setup.py bdist_rpm --python /usr/bin/python2.6
% rpm2cpio dist/boto-1.8d-1.noarch.rpm | cpio -it | grep lib | head -3
2745 blocks
./usr/lib/python2.6/site-packages/boto-1.8d-py2.6.egg-info
./usr/lib/python2.6/site-packages/boto/__init__.py
./usr/lib/python2.6/site-packages/boto/__init__.pyc
The default python on this system is python 2.4. Doing the above forces a build against python2.6 - excellent, but maybe we're not quite there yet. What if you need this package for both python 2.4 and 2.6? For this, you'll need separate package names. However, the bdist_rpm command doesn't have a way of setting the rpm package name. One way is to hack setup.py with the new name:
% grep name setup.py
setup(name = "boto",
% sed -re 's/name *= *"([^"]+)"/name = "python24-\1"/'  setup.py > setup24.py
% grep name setup24.py
setup(name = "python24-boto",

# Now build the new rpm with the new package name, python24-boto
% python setup24.py bdist_rpm --python /usr/bin/python2.4
For our boto package, this creates an rpm with a new name: python24-boto. This method is good (hack the setup.py script) because the command to build the rpm stays basically the same. The alternative would be to use 'python setup.py bdist_rpm --spec-only' and edit the spec file, then craft whatever rpmbuild command was necessary. The method above is less effort and trivially automatable with no knowledge of rpmbuild or specfiles. :)

Repeat this process for python26, and now we have two boto rpms for both pythons.

% rpm -Uvh python2?-boto-*noarch.rpm
Preparing...                ########################################### [100%]
   1:python26-boto          ########################################### [ 50%]
   2:python24-boto          ########################################### [100%]

% python2.4 -c 'import boto; print True'
True
% python2.6 -c 'import boto; print True'
True
Excellent.

C++Grok bindings working in Python

% python example.py "%SYSLOGDATE%" < /var/log/messages | head -1
{'MONTH': 'Mar', '=LINE': 'Mar 23 06:47:03 snack syslogd 1.4.1#21ubuntu3: restart.', '=MATCH': 'Mar 23 06:47:03', 'TIME': '06:47:03', 'SYSLOGDATE': 'Mar 23 06:47:03', 'MONTHDAY': '23'}
That's right. I can now use C++Grok from python.

After I saw it work, I immediately ran a time check against the perl version:

% seq 20000 > /tmp/x
% time python example.py "%NUMBER>5000%" < /tmp/x > /tmp/x.python
0.59s user 0.00s system 99% cpu 0.595 total
% time perl grok -m "%NUMBER>5000%" -r "%NUMBER%" < /tmp/x  > /tmp/x.perl
4.86s user 0.94s system 18% cpu 31.647 total
The same basic operation is 50x faster in python with c++grok bindings than the pure perl version. Excellent. Sample python code:
g = pygrok.GrokRegex()
g.add_patterns( <dictionary of patterns> )
g.set_regex("%NUMBER>5000%")
match = g.search("hello there 123 456 7890 pants")
if match:
  print match["NUMBER"]
# prints '7890'
I knew I wasn't doing reference counting properly, so to test that I ran the python code against an input set of 1000000 lines and watched the memory usage, which clearly showed leaking. I quickly read up on ref counting in Python and what functions return new or borrowed references. A few keystrokes later my memory leaks were gone. After that I put python in the test suite and am read to push a new version of c++grok.

Download: cgrok-20080327.tar.gz

Python Build instructions:

% cd pygrok
% python setup.py install

# make sure it's working properly
% python -c 'import pygrok'
There is an example and some docs in the pygrok directory.

Let me know what you think :)

Python C++ Grok bindings

I've gotten quite a bit further tonight on making c++grok's functionality available in python.

Mostly tonight's efforts have been spent learning the python C api and learning how to add new objects and methods. I'm planning to have this ready for BarCampRochester3 in two weeks.

So far I can make new GrokRegex objects and call set_regex() and search() on them. Next time I'll be implementing GrokMatch objects (like in the C++ version) and a few other small things. Fun fun :)

Adventures in SWIG and Boost::Python

I spent much of tonight trying to do the least amount of work and get some kind of python bindings available from the C++ version of Grok.

Fail.

I ran into problem after problem with SWIG, all likely becuase I chose to write c++grok with templates. After failing on that repeatedly, I decided to try out Boost::Python. Also failure. I wasn't able to find docs explaining how to use boost::python without Boost's retarded bjam build system! Fine, so I try to use bjam. After more repeated failing of simply trying to get a hello world example working with bjam, I think I'm giving up for tonight.

Here's a request: Don't make me use your retarded build system.

I fully admit I haven't spent half a lifetime pouring over the Boost::Python documentation, but should I really have to learn an entirely new make(1)-like system just to compile things? With SWIG, atleast the errors were readable and I was able to get things to compile without issue - I just couldn't figure out quickly how to expose the few templated classes c++grok has.

I'm closer to a working python module with SWIG, but the Boost::Python syntax is quite nice and is in pure C++ from what I can tell.

Ugh! Maybe I'll have better luck next time.

I found template instantiation in SWIG:

%template(SGrokRegex) GrokRegex<sregex>;
%template(SGrokMatch) GrokMatch<sregex>;
...
But compiling this breaks because mark_tag in xpressive seems to lack a default constructor, and the swig-generated code wants to use it?
grok_wrap.cpp:3987: error: no matching function for call to 'boost::xpressive::detail::mark_tag::mark_tag()'
/usr/include/boost/xpressive/regex_primitives.hpp:41: note: candidates are: boost::xpressive::detail::mark_tag::mark_tag(int)
/usr/include/boost/xpressive/regex_primitives.hpp:40: note:                 boost::xpressive::detail::mark_tag::mark_tag(const boost::xpressive::detail::mark_tag&)
Several of the above errors are emitted when compiling... I'll try more tomorrow.

Do we need another window manager?

I've been doing various Xlib projects off and on for a few years, but none of them have been window manager projects because I was using a WM that pleased me: Ion. Many years later, after following ion from ion 1, 2, and now 3, the author decided to apply some user unfriendly licensing terms to newer releases of ion-3. This license change combined with the author's efforts to require distributions to comply with this license has resulted in most platforms dropping the ion-3 package from its distribution because nobody wants to deal with assholes.

I'm not going to get into a discussion about my opinions about the license. Just know that it inconveniences me, and if you know me, you know that I tend to solve problems of inconvenience with new software tools. That means I need a new window manager.

I've tested other window managers, but none fit me as well as ion did.

A few weeks ago I started on a window manager project tentatively called tsawm which implements features I like in ion but without the angry-author problems ion has. I started working on it initially in C, since that's where I use xlib, but C has some drawbacks. A nontrivial percentage of what I perceive to be window manager behavior is basically managing some heirarchy of data (frames, client windows, titles, some state). I started looking at Perl's X11::Protocol and Python's xlib module. Python's xlib module is pretty neat, in that it's a pure-python implementation of the X11 protocol.

Somewhat arbitrarily, I started prototyping to see if writing a window manager in python was possible. Yes, it is. So that's where I'm at today.

I've mostly been hacking things together while learning more about window managing in X, but what I have so far is promising: screenshot.

It's not pretty, but finishing this will help me get past the drama and problems that ion and its author bring. Sorry tuomov, I still love ion, but any licenses that keep me (directly or indirectly) from getting shit done aren't acceptable.

C vs Python with Berkeley DB

I've got a stable, threaded version of this fancydb tool I've been working on. However, the performance of insertions is less than optimal.

Then again, how much should insert performance matter on a monitoring tool? For data that comes into it gradually, speed doesn't matter much. For bulk inserts, speed matters if you want to get your work done quickly. I haven't decided if bulk insertions are necessary use case for this tool. Despite that, I'm still interested in what the limits are.

I have experimented with many different implementations of parallelism, buffering, caching, etc in the name of making insertion to a fancydb with 10 rules fast. The fastest I've gotten it was 10000/sec, but that was on an implementation that wasn't threadsafe (and used threads).

My most-recent implementation (which should be threadsafe) can do reads and writes at 30000/sec. With evaluation rules the write rate drops to about 10000/sec.

The next task was to figure out what I was doing wrong. For comparison, I wrote two vanilla bdb accessing programs. One in C and one in Python. The output of these two follows:

# The args for each program is: insertions page_size cache_size
% sh runtest.sh
Running: ./test 2000000 8192 10485760
  => 2000000 inserts + 1 fullread: 209205.020921/sec
Running: ./py-bsddb.py 2000000 8192 10485760
  => 2000000 inserts + 1 fullread: 123304.562269/sec
As expected, C clearly outperforms Python here, but the margin is pretty small (C is 69% faster for this test). Given the 120000/sec rate from Python, the poor input rate of my tool seems to be blamed on me. Is my additional code here really the reason that I can only write at 30000 per second? I may need to revisit how I'm implementing things in python. I'm not clear right now where I'm losing so much throughput.

So I use hotshot (python standard profiler) and I find that most of the time is spent in my iterator method. This method is a generator method which uses yield and loops over a cursor.

It's important to note that my python bdb 'speed test' above did not use generators, it used a plain while loop over the cursor. So, I wrote another test that uses generators. First, let's try just inserts, no reading of data:

Running: ./test 1000000 8192 10485760
  => 1000000 inserts: 261096.605744/sec
Running: ./py-bsddb.py 1000000 8192 10485760
  => 1000000 inserts: 166389.351082/sec
Now let's try with 3 different python reading methods: while loop across a cursor, generator function (using yield), and an iterator class (implementing __iter__):
Running: ./py-bsddb.py 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 8.660000
Running: ./py-bsddb_generator.py 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 9.124000
Running: ./py-bsddb_iterable_class.py 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 13.130000
I'm not sure why implementing an iterator is so much slower (in general) than a yield-generator is. Seems strange, perhaps my testing code is busted. Either way, I'm not really closer to finding the slowness.

get this code here

fancydb performance

Various trials with basically the same input set: 2.5 million row entries, maximum 1 entry per second. The insertion rate drops by 60% if you add rule evaluations, which is an unfortunate performance loss. I'll work on making rules less invasive. Unfortunately, python threading will never run on two processors at once I can't gain significant performance from sharding rule processing to separate threads; most unfortunate. Maybe fork+ipc is necesary here, but I am somewhat loathe to doing that.

The slowdown when rules are present are to the record keeping that is done to notify that a rule should be evaluated again (rule evaluations are queued). Basicaly the loop 'is this row being watched by a rule' is the slowdown. I'll try attacking this first.

With 2 rules (unoptimized rules):
    hits.minute => hits.mean.1hour @ 60*60
    hits.minute => hits.mean.1day @ 60*60*24
  insertion rate = 7600/sec

With 2 rules (optimized chaining)
    hits.minute => hits.mean.1hour @ 60*60
    hits.mean.1hour => hits.mean.1day @ 60*60*24
  insertion rate = 12280/sec

With 9 rules (optimized chaining):
  insertion rate: 10000/sec

With 0 rules:
  trial 1: 40000/sec
  trial 2: 26700/sec

Storage utils, eventdb, etc.

Spent lots of time over thanksgiving playing with bdb in python.

Again, I still don't have releaseworthy code, but here's a snippet of rrdtool-like behavior from this system:

% ./evtool.py create /tmp/webhits.db
% ./evtool.py addrule /tmp/webhits.db http.hit agg.http.hit.daily total $((60*60*24)) time
% time cat webhits.data | ./evtool.py update /tmp/webhits.db -
11.10s user 0.80s system 94% cpu 12.627 total
% time ./evtool.py graph /tmp/webhits.db agg.http.hit.daily  
0.49s user 0.11s system 96% cpu 0.624 total
The result is exactly the same graph as mentioned in my previous post. Speed so far is pretty good. The input was 125000 entries, in 12.6 seconds; which equates roughly to 10000 updates per second. That kind of QPS seems pretty reasonable.

The primary difference today is that the aggregates are computed as data enters the system. 'Addrule' tells the database to schedule an aggregation for specific timestamps.

The goal is to be able to chain rules, and have N:M relationships between rule input and output. Those will happen soon. Chaining would've happened tonight, but I'm having some locking problems due to it being quite late ;)

The database code itself is designed to be reusable elsewhere. There are two primary classes: SimpleDB and FancyDB. SimpleDB lets you store and retrieve data based on row+timestamp => value pairs. FancyDB wraps SimpleDB and gives you operation listeners such as the rule used in the above example.

I've already used SimpleDB elsewhere; in the sms traffic tool I mentioned in my last post, I cache geocode data and traffic requests with this same database tool.

Google Maps Traffic to my phone.

Combining xulrunner, the Google Maps API, procmail, and imagemagick, I now have a way to request traffic data from google maps, all from my phone using only email (sms/mms).

The project itself isn't very polished, so I won't publish its location. However, I forwarded one traffic message from my phone to flickr. View it here. The picture is rotated because my phone's screen is taller than it is wide.

The entire process takes about 20 seconds (grab the map, screencapture, and email back to the phone).

Code for this lives here: https://semicomplete.googlecode.com/svn/traffic

Playing with graphing; matplotlib

webhits.data contains updates of this format:
[email protected]:1
[email protected]:1
[email protected]:1
[email protected]:5
The values are hits seen in a single second to this website. This particular data set includes only the past month's worth of data.

Let's graph "total hits per hour" over time.

% ./evtool.py update /tmp/webhits.db - < webhits.data
% ./evtool.py fetchsum /tmp/webhits.db $((60 * 60)) http.hit
60*60 is 3600, aka 1 hour. hits, 1 hour. I also reran it with 60*60*24 aka 24 hour totals. hits, 1 day.

The data aggregation may be incorrect; not sure if I really got 12K hits on each of the first few days this month. However, using fex+awk+sort on the logfiles themselves shows basically the same data:

 % cat access.* | ~/projects/fex/fex '[2 1:1' | countby 0  | sort -k2 | head -3
 11534 01/Nov/2007
 11488 02/Nov/2007
 11571 03/Nov/2007
Actually looking at the logs shows 5K hits from a single IP on 01/Nov/2007, and it's the googlebot.