photo
Jordan Sissel
geek

Tue, 27 Nov 2007

fancydb performance

Various trials with basically the same input set: 2.5 million row entries, maximum 1 entry per second. The insertion rate drops by 60% if you add rule evaluations, which is an unfortunate performance loss. I'll work on making rules less invasive. Unfortunately, python threading will never run on two processors at once I can't gain significant performance from sharding rule processing to separate threads; most unfortunate. Maybe fork+ipc is necesary here, but I am somewhat loathe to doing that.

The slowdown when rules are present are to the record keeping that is done to notify that a rule should be evaluated again (rule evaluations are queued). Basicaly the loop 'is this row being watched by a rule' is the slowdown. I'll try attacking this first.

With 2 rules (unoptimized rules):
    hits.minute => hits.mean.1hour @ 60*60
    hits.minute => hits.mean.1day @ 60*60*24
  insertion rate = 7600/sec

With 2 rules (optimized chaining)
    hits.minute => hits.mean.1hour @ 60*60
    hits.mean.1hour => hits.mean.1day @ 60*60*24
  insertion rate = 12280/sec

With 9 rules (optimized chaining):
  insertion rate: 10000/sec

With 0 rules:
  trial 1: 40000/sec
  trial 2: 26700/sec

Comments: 0 (view comments)
Tags: , , , , , , ,
Permalink: /geekery/fancydb-performance-20071126
posted at: 03:45

Mon, 26 Nov 2007

Storage utils, eventdb, etc.

Spent lots of time over thanksgiving playing with bdb in python.

Again, I still don't have releaseworthy code, but here's a snippet of rrdtool-like behavior from this system:

% ./evtool.py create /tmp/webhits.db
% ./evtool.py addrule /tmp/webhits.db http.hit agg.http.hit.daily total $((60*60*24)) time
% time cat webhits.data | ./evtool.py update /tmp/webhits.db -
11.10s user 0.80s system 94% cpu 12.627 total
% time ./evtool.py graph /tmp/webhits.db agg.http.hit.daily  
0.49s user 0.11s system 96% cpu 0.624 total
The result is exactly the same graph as mentioned in my previous post. Speed so far is pretty good. The input was 125000 entries, in 12.6 seconds; which equates roughly to 10000 updates per second. That kind of QPS seems pretty reasonable.

The primary difference today is that the aggregates are computed as data enters the system. 'Addrule' tells the database to schedule an aggregation for specific timestamps.

The goal is to be able to chain rules, and have N:M relationships between rule input and output. Those will happen soon. Chaining would've happened tonight, but I'm having some locking problems due to it being quite late ;)

The database code itself is designed to be reusable elsewhere. There are two primary classes: SimpleDB and FancyDB. SimpleDB lets you store and retrieve data based on row+timestamp => value pairs. FancyDB wraps SimpleDB and gives you operation listeners such as the rule used in the above example.

I've already used SimpleDB elsewhere; in the sms traffic tool I mentioned in my last post, I cache geocode data and traffic requests with this same database tool.

Comments: 0 (view comments)
Tags: , , , ,
Permalink: /geekery/storageutils-db-graphs-etc
posted at: 06:55

Sat, 24 Nov 2007

Google Maps Traffic to my phone.

Combining xulrunner, the Google Maps API, procmail, and imagemagick, I now have a way to request traffic data from google maps, all from my phone using only email (sms/mms).

The project itself isn't very polished, so I won't publish its location. However, I forwarded one traffic message from my phone to flickr. View it here. The picture is rotated because my phone's screen is taller than it is wide.

The entire process takes about 20 seconds (grab the map, screencapture, and email back to the phone).

Comments: 1 (view comments)
Tags: , , , , ,
Permalink: /geekery/sms-traffic-reports
posted at: 06:25

Thu, 22 Nov 2007

Playing with graphing; matplotlib

webhits.data contains updates of this format:
http.hit@1193875199000000:1
http.hit@1193875200000000:1
http.hit@1193875213000000:1
http.hit@1193875214000000:5
The values are hits seen in a single second to this website. This particular data set includes only the past month's worth of data.

Let's graph "total hits per hour" over time.

% ./evtool.py update /tmp/webhits.db - < webhits.data
% ./evtool.py fetchsum /tmp/webhits.db $((60 * 60)) http.hit
60*60 is 3600, aka 1 hour. hits, 1 hour. I also reran it with 60*60*24 aka 24 hour totals. hits, 1 day.

The data aggregation may be incorrect; not sure if I really got 12K hits on each of the first few days this month. However, using fex+awk+sort on the logfiles themselves shows basically the same data:

 % cat access.* | ~/projects/fex/fex '[2 1:1' | countby 0  | sort -k2 | head -3
 11534 01/Nov/2007
 11488 02/Nov/2007
 11571 03/Nov/2007
Actually looking at the logs shows 5K hits from a single IP on 01/Nov/2007, and it's the googlebot.

Comments: 0 (view comments)
Tags: , ,
Permalink: /geekery/eventdb-graphing
posted at: 04:18

Wed, 21 Nov 2007

Python, event tracking, and bdb.

In previous postings, I've put thoughts on monitoring and data aggregation. Last night, I started working on prototyping a python tool to record arbitrary data. Basically it aims to solve the problem of "I want to store data" in a simple way, rather than having to setup a mysql database, user access, tables, a web server, etc, all in the name of viewing graphs and data reports.

The requirements are pretty simple and generic:

  • Storage will be mapping key+timestamp => value
  • Timestamps should be 64bit values, so we can use millisecond values (52 bits to represent current unix epoch in milliseconds)
  • Access method is assumed to be random access by key, but reading multiple timestamp entries for a single key is expected to be sequential.
  • Key names must be arbitrary length
  • Storage must be space-efficient on key names
  • Values are arbitrary.
  • Minimal setup overhead (aka, you don't have to setup mysql)
The goal of this is to provide a simple way to store and retrieve timestamped key-value pairs. It's a small piece of the hope that there can be a monitoring tool set that is trivial to configure, easy to manipulate, easy to extend, and dynamic. I don't want to pre-declare data sources and data types (rrdtool, cacti), or make it difficult to add new collection sources, etc. I'm hoping that relying on the unix methodology (small tools that do one or two things well) that this can be achieved. The next steps in this adventure of a monitoring system are:
  • a graphing system
  • aggregation functions
  • templating system (web, email, etc?)
Space efficiency on key names is achieved with a secondary storage containing a list of key to keyid mappings. Key IDs are 64bit values. The first value is 1. We could use a hash function here, but I want a guarantee of zero collisions. However, this means that keys are specifically stored as key ids in insertion order, not lexigraphical order.

This may become a problem if we want to read keys sequentially. However, if we scan the secondary database (one mapping key => 64bit_keyid) we can get keys in lexigraphical order for free. So iterating over all keys starting with the string 'networking' is still possible, but it will result in random-access reads on the primary database. This may be undesirable, so I'll have to think about whether or not this use case is necessary. There are some simple solutions to this, but I'm not sure which one best solves the general case.

Arbitrary key length is a requirement because I found the limitations of RRDTool annoying, where data source names cannot be more than 19 characters - lame! We end up being more space efficient (8 bytes per name) for any length of data source name at the cost of doing a lookup finding the 64bit key id from the name.

I have some of the code written, and a demo that runs 'netstat -s' a once a second for 5 seconds and records total ip packets inbound. The key is 'ip.in.packets'

ip.in.packets[1195634167358035]: 1697541
ip.in.packets[1195634168364598]: 1698058
ip.in.packets[1195634169368830]: 1698566
ip.in.packets[1195634170372823]: 1699079
ip.in.packets[1195634171376553]: 1699590

Comments: 0 (view comments)
Tags: , , , ,
Permalink: /geekery/python-data-bdb-and-friends
posted at: 03:41

Tue, 20 Nov 2007

New fex version available (20071119)

Hop on over to the fex project page and download the new version.

Changelist:

20071119 -
  - Add nongreedy tokenizer. Same semantics of strtok_r(), but doesn't skip
    empty tokens.
  - Renamed tokenizer to split, since really that's what it was doing.
  - You can invoke the nongreedy tokenizer by using '?' as the first character
    of a {} set:
     args: :{?4,6}
     input: one:::four::six
     output: four:six

Comments: 0 (view comments)
Tags: , , ,
Permalink: /geekery/fex-20071119
posted at: 00:15

Sun, 11 Nov 2007

Dublin and MashupCamp 2007 Europe

I spent slightly over a week in Ireland. The weekdays were spent with fellow Googlers at the office, and the weekend was spent at Mashup Camp.

The week was pretty great. I went on the viking splash tour of Dublin. The tour was anything other than informative, and despite that it was a really fun time. The guide mixed facts about historical Dublin with jokes about the shops, area, and Bono (of U2). The difference between the viking splash and other tours was that we wore viking hats, screamed at people on the street, and ended the tour with a ride through one of the canals. The canal ride was made possible because of the busses used in the tour, which were amphibious vehicles from WWII. The Google folks I've met here in Dublin are excellent.

The most recent weekend was Mashup Camp Europe, held in Dublin at the Guinness Store house. The format was a conference/unconference hybrid.

The first day, Saturday, was filled with many presentations about mashup-enabling tools. There was only one track due to the small size of the event.

I must admit I felt drowned in the IBM talks. There were 3 talks on IBM's fancy new mashup-enabling tool, all of which basically restated the same things in nearly the same way. Three hours of the same tool demo doesn't really make for much educational value. I absolutely appreciate IBM helping to sponsor the event, but seriously, there needs to be more content!

Someone from Microsoft Ireland gave a talk and demo about Popfly, which was pretty cool. Both the presentation of and product felt very UnMicrosoft - the inteface was very interactive, animated, and helpful; the presentation and presenter were somewhat modern and informative. I was expecting something with the burdens and weight of an Office product, but I was pleasantly surprised. The only thing I was left questioning was the target of Popfly, which seems to be nontechnical, end users who seem to be the expected target users of this system. I'm not wise to the marketing and demographic data, so I may be wrong in thinking targeting end users is a bad move. Let's hope not: if end users start mashing up content in new and wonderful ways, that'd be great!

I met up with Chad Dickerson from Yahoo!, who I'd met at Yahoo! Hack Day last year, in addition to meeting a dozen or so new folks. I'm a little surprised he remembered me, but I'm always happy to leave an impression upon people. One of the benefits of being at a technical event thousands of miles from home is that you tend to mingle with a set of people who are far outside the set of people who attend bay area tehnical events. Meeting new people is great :)

Half-way through Saturday, I found myself picking up parts of the Irish accent, which was a bit strange and I had to struggle not to lean towards the local accent and language. Lost cause, really.

After boozing with lots of fellow mashup campers at a few bars, I followed Chad and Tom (both of Yahoo!) around the Temple Bar district as they filmed locals asking questions such as "What is a mashup?" The drunk answers to these questions were fantastic.

I walked myself home after acquiring a map of the area.

I arrived on the second day of mashup camp around 11AM (local dublin time). Basically, this was just in time for lunch. I caught the end of a presentation by Serena, which unlike, drowning in IBMs presentations, did not make me nauseous. There was an 8-minute video-keynote recorded by Tim Berners Lee about his recent projects. I'd never seen Tim before and he reminded me much of Kevin Spacey. Then there was lunch, where a met a few more folks. Lunch concluded with a keynote by Chad (mentioned previously) about Yahoo! developer tools and a few other topics.

After Chad's talk was the start of the Mashup Camp open space sessions. I was the first to sign up for a session, which I intended on being a "look at this neat thing" session. I merged my slot with another camper who wanted to talk about scaling.

My talk basically covered Halo 3, Bungie's online player map, and graphing two-dimensional data over time. I played this video. The video was generated using perl, make, Image-Magick, and mencoder. The map images were downloaded with cron, every 15 minutes. I pointed out some interesting data discovered by watching the movie: Someone is playing in Sydney, Japan, New York, London, and a few other places with general coverage 100% of the time. I'll put up the scripts that generated the video soon.

Sunday night started at the Bankers' bar one street south of the Temple Bar district. Someone had volunteered to pay for the food and drinks; a native Irishman put it best, "This is like an Irishman's wet dream!" Free drinks are pretty sweet. I met more people there, too. After the open-bar closed, we wandered towards Temple Bar in search of somewhere with food. After finding many places weren't serving food anymore, we finally settled at some random pub with the kitchen still open. I ordered some chicken thing, but for an appetizer David (the organizer of Mashup Camp) and I split 'black and white pudding' which sounded pretty scary, even when a native described it. Turns out it was just sausages, and they were pretty good.

I leave for the airport in an hour, and I'm quite sad to leave. Thus far, Dublin has been far beyond my expectations. Then again, I've got a fiancee and a dog to come home to, so perhaps leaving isn't so bad after all ;)

Comments: 1 (view comments)
Tags: , , ,
Permalink: /geekery/mashupcamp-ireland-2007
posted at: 22:14

Search this site

Navigation

Metadata

Home About Resume My Code (SVN)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< November 2007 >
SuMoTuWeThFrSa
     1 2 3
4 5 6 7 8 910
11121314151617
18192021222324
252627282930 

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati