photo
Jordan Sissel
geek

Wed, 31 Jan 2007

Session affinity and load distribution with Tomcat and Apache

You can scale tomcat webapps somewhat well using session affinity and load distribution. But how? Apache to the rescue.

For each tomcat server, modify the server.xml and change the value for 'jvmRoute' to the ip address of the tomcat server. Example:

  <Engine name="Standalone" defaultHost="localhost" jvmRoute="192.168.0.10">
This affects the last token in your jsessionid cookie. Visiting my tomcat, my cookie gets set to the following:
C40ABF646B07162A621856F459977E9B.192.168.0.10

Use apache's mod_rewrite to use apache as a frontend to your tomcat servers. That is, use apache as a reverse proxy. In your httpd.conf:

RewriteMap SERVERS rnd:/etc/httpd/conf/frontends.conf
RewriteCond "%{HTTP_COOKIE}"          "(^|;\s*)JSESSIONID=\w*\.([0-9.]+)($|;)"
RewriteRule "^(.*)"                   "http://%2:8080%{REQUEST_URI}"  [P,L]
RewriteRule "^.*;jsessionid=\w*\.([0-9.]+)($|;)"  "http://$1:8080%{REQUEST_URI}"  [P,L]
RewriteRule "^(.*)"                    "http://${SERVERS:ALL}:8080%{REQUEST_URI}" [P,L]
This technique is quite similar to the one on tomcat.apache.org in the docs, but I think it's better. Why? Less files to modify when you add or remove tomcat servers means less complexity, less errors and less effort.

  1. RewriteCond "%{HTTP_COOKIE}" "(^|;\s*)JSESSIONID=\w*\.([0-9.]+)($|;)"
    If a jsessionid cookie is found, go to #2 and store match groups (backreferences) as %1, %2, etc.
  2. RewriteRule "^(.*)" "http://%2:8080%{REQUEST_URI}" [P,L]
    Session Affinity: Redirect everything using an internal proxy request to the 2nd group matched in the previous RewriteCond. Since we use the IP as the jvmRoute, that's what is matched, and your request is proxied to the server that gave you your cookie.
  3. RewriteRule "^.*;jsessionid=\w*\.([0-9.]+)($|;)" "http://$1:8080%{REQUEST_URI}" [P,L]
    Session affinity: Tomcat likes to add (who knows why?) ";jsessionid=blah" to the end of the url when it first sets you up the cookie. In case no cookie is found, this will proxy your request to the proper server just like the previous rule.
  4. RewriteRule "^(.*)" "http://${SERVERS:ALL}:8080%{REQUEST_URI}" [P,L]
    Load distribution: Catch-all for anything that didn't have a cookie or jsessionid thing in the url. "ALL" is just a key from the RewriteMap listed below. A random one is chosen and inserted.

Since the server ip is stored in the cookie, apache (using regular expressions) can pull it out and will internally proxy your request through to the proper tomcat server.

That works great for sessions that already exist, but what about for sessions that don't exist? That's what ${SERVERS:ALL} is for. You need something like this in your frontends.conf file:

ALL 192.168.0.10|192.168.0.11
This would be even better if you only used DNS for this. Then, you wouldn't need to update any config files when you added or removed tomcat servers.

If you had the fallback redirect of:

RewriteRule "^(.*)"       "http://${SERVERS:ALL}:8080%{REQUEST_URI}" [P,L]
RewriteRule "^(.*)"       "http://mytomcats.foo.com:8080%{REQUEST_URI}" [P,L]
Apache should redirect internally to "mytomcats.foo.com" which should result in a dns lookup of that hostname. If you have multiple records in that hostname, you get round-robin balancing across all tomcats for new sessions. When you add or remove tomcat servers, you don't have to update any config files.

No config files to change when you add new servers? That makes for healthy, dynamic scaling.

The best way to solve this would be to have tomcat share it's session data, but it uses multicast, and the network where tomcat lives doesn't have multicast routing enabled, so that doesn't seem like an option.

Comments: 4 (view comments)
Tags: , ,
Permalink: /geekery/session-balancing-across-tomcats-with-apache
posted at: 05:08

Tue, 30 Jan 2007

Fedora's package manager

-bash-3.1# yum install django
No Match for argument: django
Nothing to do

-bash-3.1# yum install Django
Downloading Packages:
(1/1): Django-0.95.1-1.fc 100% |=========================| 1.5 MB    00:02
Ahh. Clearly.

Comments: 0 (view comments)
Tags: , ,
Permalink: /rants/fedora-yum
posted at: 01:48

Mon, 29 Jan 2007

Comment spam that got through

I get emails from this site when someone comments.

This morning, this showed up:

Name: Virtual Pharmacy
Email: [snipped]
URL: [snipped]
Hostname: 114.199.36.72.reverse.layeredtech.com (72.36.199.114)
Entry URL: http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2
Comment location: [snipped]

Everyone repeat, what alcohol should be consumed moderately, but what it means? Why to women
 recommend to drink more moderately than to men? What is the female alcoholism? WBR LeoP
A quick google search for the strange tail token, "WBR LeoP" reveals a clear indication that this is comment spam (as if the content didn't give it away).

The url the spammer used points at pharmacynewsblog.com, which looks like a normal blog.

It's not.

The content is entirely viagra-and-friends related, which is fine. However, examine a simple visible text snippet of the following (this is from the frontpage):

Drug treatment may beat psychotherapy at ...
Google for this phrase and you'll find that it's been plagiarized. But deliciously so:

View source, you'll see:

<p>Drug <b class=ne>joint pain are </b>treatment <BLINK class=ne>of
purchase </BLINK>may <sup class=ne>wellbutrin at </sup>beat <small
class=ne>and paxil vs </small>psychotherapy
The css class 'ne' sets 'display: none' among other properties that make it stay out of the way of the browser.

This is quite clever, and appears automated.

pharmacynewsblog.com seems to be a somewhat autogenerated spam blog that takes news postings about viagara and the like and injects random html into it, with the intention of defeating antispam solutions. Anti-spam engines probably aren't smart enough to know that it should ignore the text pieces that are invisible. Who knows.

But, back to the spam comment. I use javascript to poke parts of the comment form indicating that a javascript-capable browser was used to submit the comment. If javascript is not detected, the comment is denied.

This comment got through, which means that javascript was enabled, which means that it was probably a webbrowser that did it.

Here's the apache log snippet:

72.36.199.114 - - [29/Jan/2007:13:01:17 -0500] "GET /blog/geekery/barcamp-sanfrancisco-2.html HTTP/1.1" 200 15903 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
72.36.199.114 - - [29/Jan/2007:13:01:18 -0500] "GET /style.css HTTP/1.1" 200 2584 "http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
72.36.199.114 - - [29/Jan/2007:13:01:20 -0500] "POST /blog/geekery/barcamp-sanfrancisco-2 HTTP/1.1" 200 16392 "http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
72.36.199.114 - - [29/Jan/2007:13:01:21 -0500] "GET /style.css HTTP/1.1" 200 2584 "http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
It didn't fetch any images, but it did pull style sheets, which is strange behavior if it's a simple spam bot that doesn't care about how a page looks. It also pulled the blog posting page first, then submitted a comment. Further indication that this bot is either really clever, or a person is behind the wheel.

If you search for the ip, 72.36.199.114, the first hit on google is an automagically updated list of known comment spam hosts.

Comments: 3 (view comments)
Tags: , ,
Permalink: /geekery/comment-spam-got-through
posted at: 13:41

Sat, 27 Jan 2007

Mysql prepare'd queries aren't cached, ever.

There once was a database named MySQL.

It had a query cache, becuase caching helps performance.

It also had queries you could "prepare" on the server-side, with the hope that your database server can make some smart decisions what to do with a query you're going to execute N times during this session.

I told mysql to enable it's caching and use a magic value of 1gb for memory storage. Much to my surprise, I see the following statistic after testing an application:

mysql> show status like 'Qcache_%';
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| Qcache_free_blocks      | 1          | 
| Qcache_free_memory      | 1073732648 | 
| Qcache_hits             | 0          | 
| Qcache_inserts          | 0          | 
| Qcache_lowmem_prunes    | 0          | 
| Qcache_not_cached       | 814702     | 
| Qcache_queries_in_cache | 0          | 
| Qcache_total_blocks     | 1          | 
+-------------------------+------------+
8 rows in set (0.00 sec)
Why are so many (all!?) of the queries not cached? Surely I must be doing something wrong. Reading the doc on caching explained what I can only understand as a complete lapse of judgement on the part of MySQL developers:
from http://dev.mysql.com/doc/refman/5.0/en/query-cache.html
Note: The query cache is not used for server-side prepared statements. If you're using server-side prepared statements consider that these statement won't be satisfied by the query cache. See Section 22.2.4, C API Prepared Statements.
Any database performance guide anywhere will tell you to use prepared statements. They're useful from both a security and performance perspective.

Security, becuase you feed the prepared query data and it knows what data types to expect, erroring when you pass something invalid. It also will handle strings properly, so you worry less about sql injection. You also get convenience, in that you don't have to escape your data.

Performance, becuase telling the database what you are about to do lets it optimize the query.

This performance is defeated, however, if you want to use caching. So, I've got a dillema! There are two mutually-exclusive (because MySQL sucks) performance-enhancing options available to me: using prepared statements or using caching.

Prepared statements give you two performance benefits (maybe more?). The first, is the server will parse the query string when you prepare it, and execute the "parsed" version whenever you invoke it. This saves parsing time; parsing text is expensive. The second, is that if your database is nice, it will try to optimize your queries before execution. Using prepared statements will permit the server to optimize query execution once, and then remember it. Good, right?

Prepared statements improve CPU utilization, in that the cpu can work less becuase you're teaching the database about what's coming next. Cached query responses improve disk utilization, and depending on implementation should vastly outperform most (all?) of the gains from prepared statements. This assumption I am making is based on the assumption that disk is slow and cpu is fast.

Cached queries will (should?) cache results of complex queries. This means that a select query with multiple, complex joins should be cached mapping the query string to the result. No amount of statement preparation will improve complex queries becuase they still have to hit disk. Large joins require lots of disk access, and therefore are slow. Remembering "This complex query" returned "this happy result" is fast regardless of whether or not it's stored on disk or in memory. Caching also saves cpu utilization.

I can't believe preparing a query will prevent it from being pulled from the query cache, but this is clearly the case. Thanks, MySQL, for making a stupid design decision.

Maybe there's some useful JDBC (oh yeah, the app I'm testing is written in Java) function that'll give you all the convenience/security benefits of prepare, but without the server-side bits, and thus let you use the query cache.

Comments: 2 (view comments)
Tags: , ,
Permalink: /geekery/mysql-prepare-queries-not-cached
posted at: 21:26

Fri, 26 Jan 2007

One anti-spam effort too easily defeated.

I see lots of times where people put their mailing addresses as "foo at bar dot org" in a hopeful effort to keep spammers from scraping your mailing address. Heck, mail archive systems often have (and are deployed with) options to obfuscate email addresses systematically, using the same pattern: foo at bar dot com.

All it does is hurt usability.

Googlng for "* at * dot *" clearly shows lots of matches. It also matches all of the following variants, due to google searches ignoring brackets and such in words:

  • foo at bar dot com
  • foo [at] bar [dot] com
  • foo (at) bar (dot) com
  • ... etc ...
Query, scrape, replace 'at' and 'dot' as desired. I now have 54 million email addresses. What now?

Seems like this effort only serves to have people fool themselves as well as to impede usability. It certainly won't protect you from spam. Why is this method used?

Comments: 2 (view comments)
Tags: , , ,
Permalink: /geekery/anti-spam-obfuscation-easily-defeated
posted at: 22:55

Mon, 22 Jan 2007

Pulling album covers from Amazon

Amazon provides lots of web services. One of these is it's E-Commerce API which allows you to search it's vast product database (among other things).

In Pimp, the page for any given listening station shows you the current song being played. Along with that, I wanted to provide the album cover for the current track.

You can leverage Amazon's API to search for a given artist and album eventually leading you to the picture of the album cover. To this end, I wrote a little python module that lets you search for an artist and album name combination and will give you a link to the album cover.

So, I wrote albumcover.py as a prototype to turn an artist and album into a url to the album cover image. It works for the 20 or so tests I've put through it.

Comments: 1 (view comments)
Tags: , , , , ,
Permalink: /geekery/pull-album-covers-from-amazon
posted at: 00:52

Sun, 21 Jan 2007

Python's sad xml business and modules vs packages.

So, I've been reading docs on python's xml stuff, hoping there's something simple or comes-default-with-python that'll let me do xpath. Everyone overcomplicates xml processing. I have no idea why. Python seems to have enough alternatives to make dealing with xml less painful.

Standard python docs will lead you astray:

kenya(...ojects/pimp/pimp/controllers) % pydoc xml.dom | wc -l
643
Clearly, the pydoc for "xml.dom" has some nice things, right? I mean, documentation is clearly an indication that THE THING THAT IS DOCUMENTED BEING AVAILABLE. Right?

Sounds great. Let's try to use this 'xml.dom' module!

kenya(...ojects/pimp/pimp/controllers) % python -c 'import xml; xml.dom'
Traceback (most recent call last):
  File "", line 1, in ?
AttributeError: 'module' object has no attribute 'dom'
WHAT. THE. HELL.

Googling around, it turns out that 'xml' is a fake module that only actually works if you have it the 4Suite modules installed? Maybe?

Why include fake modules that provide complete documentation to modules that do not exist in the standard distribution?

Who's running this ship? I want off. I'll swim if necessary.

As it turns out, I made too-strong of an assumption about python's affinity towards java-isms. I roughly equated 'import foo' in python as 'import foo.*' in java. That was incorrect. Importing foo doesn't get you access to things in it's directory, they have to be imported explicity.

In summary, 'import xml' gets you nothing. 'import xml.dom' gets you nothing. If you really want minidom's parser, you'll need 'import xml.dom.minidom' or a 'from import' variant.

On another note, the following surprised me. I had a module, foo/bar.py. I figured 'from foo import *' would grab it. This means 'from xml.dom import *' doesn't get you minidom and friends.

Perhaps I was hoping for too much, but maybe it's better to import explicitly. If that's the case ,then why push exceptions that allow '*' to be imported only from modules, not packages?

Comments: 2 (view comments)
Tags: , ,
Permalink: /geekery/python-and-xml
posted at: 21:23

Sun, 14 Jan 2007

Strip XML comments with sed

sed -ne '/<!--/ { :c; /-->/! { N; b c; }; /-->/s/<!--.*-->//g }; /^  *$/!p;'
You might consider stripping blanklines and/or filtering through xmllint --format to make the xml pretty printed.

Comments: 2 (view comments)
Tags: , ,
Permalink: /geekery/strip-comments-from-xml-with-sed
posted at: 19:48

Fri, 12 Jan 2007

ShmooCon '07

I'm going to be attending ShmooCon '07. If you're going, and I don't know you're going, let me know.

I'm helping out with HoH. Beware.

Comments: 3 (view comments)
Tags: , ,
Permalink: /geekery/going-to-shmoocon-2007
posted at: 00:43

Tue, 02 Jan 2007

Goodbye, 2006!

It's very hard to believe that 2006 is gone. What a year!

Basic life summary: Graduated from RIT and started working for Google.

This year has been great fun for me. I've had a chance to work on a very wide range of projects. Some of them were silly, some of them were serious, and some were useful.

Taking the silly category by storm was my Yahoo! Hack Day '06 demo of TastyDrive. That same day involved my presentation of keynav. If you missed Yahoo!'s event, then you certainly missed the amazing concert Beck put on! My presentation at this event resulted in a kick-ass article about me in the Wall Street Journal.

Runner-up for silly is definitely pam_captcha, a PAM module implementing a text-based captcha system. My favorite captcha was obviously Dance Dance Authentication which received wide angst at the 2006 SPARSA security competition.

However, pam_captcha ended up being useful in that it caused me to study the behavior of brute-force ssh zombies. Good times ;)

Looking over my last year of posts, I am reminded of many project ideas that I never worked on. Grok, my expert-like pattern matching tool, has fallen victim to forgetfulness. Furthermore, many grok-related projects have fallen to the wayside: sysadmin secret sauce and the obvious children temporal (ooh, fancy word!?) data storage, grok and eventdb marraige, and some neat rrdtool tricks.

Hacks were a-plenty this year. Not all of them received written notes, but some of the neater ones are my wakeup script, a hack using squid and selenium to allow you to unit-test webpages by injecting functionality (xss using squid), and a long-forgotten touch-screen keyboard in javascript.

And how can I forget BarCamp? I attended three BarCamp events this year: New York, San Francisco, and Stanford. Many friends made. These kinds of conferences are absolutely my kind of events. Signal-to-noise at BarCamp is bliss by comparison to standard computer conferences.

This year also brought me a new hat, as a FreeBSD src committer so I can further my work on the mouse system changes.

I miss the free time and opportunities granted as a student. I haven't made up my mind about the "real world" quite yet, but I'm glad there's no homework.

With that, I say goodbye to 2006. It was a good year. I'm looking forward to 2007!

Comments: 0 (view comments)
Tags:
Permalink: /geekery/year-in-review-2006
posted at: 10:42

Search this site

Navigation

Metadata

Home About Resume My Code (SVN)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< January 2007 >
SuMoTuWeThFrSa
  1 2 3 4 5 6
7 8 910111213
14151617181920
21222324252627
28293031   

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati