Search this site





SSL handshake latency and HTTPS optimizations.

At work today, I started investigating the latency differences for similar requests between HTTP and HTTPS. Historically, I was running with the assumption that higher latency on HTTPS (SSL) traffic was to be expected since SSL handshakes are more CPU intensive. I didn't really think about the network consequences of SSL until today.

It's all in the handshake.

TCP handshake is a 3-packet event. The client sends 2 packets, the server sends 1. Best case, you're looking at one round-trip for establishing your connection. We can show this empirically by comparing ping and tcp connect times:

% fping -q -c 5 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 112/115/123
Average is 115ms for ping round-trip. How about TCP? Let's ask curl how long tcp connect takes:
% seq 5 | xargs [email protected] -n1 curl -so /dev/null -w "%{time_connect}\n"
There's your best case. This is because when you (the client) receive the 2nd packet in the handshake (SYN+ACK), you reply with ACK and consider the connection open. Exactly 1 round-trip is required before you can send your http request.

What about when using SSL? Let's ask curl again:

% curl -kso /dev/null -w "tcp:%{time_connect}, ssldone:%{time_appconnect}\n"
tcp:0.117, ssldone:0.408

# How about to google?
% curl -kso /dev/null -w "tcp:%{time_connect}, ssldone:%{time_appconnect}\n"
tcp:0.021, ssldone:0.068

3.5x jump in latency just for adding SSL to the mix, and this is before we sent the http request.

The reason for this is easily shown with tcpdump. For this test, I'll use tcpdump to sniff https traffic and then use openssl s_client to simply connect to the http server over ssl and do nothing else. Start tcpdump first, then run openssl s_client.

terminal1 % sudo tcpdump -ttttt -i any 'port 443 and host'

terminal2 % openssl s_client -connect

Tcpdump output trimmed for content:

# Start TCP Handshake
00:00:00.000000 IP snack.home.40855 > Flags [S] ...
00:00:00.114298 IP > snack.home.40855: Flags [S.] ...
00:00:00.114341 IP snack.home.40855 > Flags [.] ...
# TCP Handshake complete.

# Start SSL Handshake.
00:00:00.114769 IP snack.home.40855 > Flags [P.] ...
00:00:00.226456 IP > snack.home.40855: Flags [.] ...
00:00:00.261945 IP > snack.home.40855: Flags [.] ...
00:00:00.261960 IP > snack.home.40855: Flags [P.] ...
00:00:00.261985 IP snack.home.40855 > Flags [.] ...
00:00:00.261998 IP snack.home.40855 > Flags [.] ...
00:00:00.273284 IP snack.home.40855 > Flags [P.] ...
00:00:00.398473 IP > snack.home.40855: Flags [P.] ...
00:00:00.436372 IP snack.home.40855 > Flags [.] ...

# SSL handshake complete, ready to send HTTP request. 
# At this point, openssl s_client is sitting waiting for you to type something
# into stdin.

Summarizing the above tcpdump data for this ssl handshake:
  • 12 packets for SSL, vs 3 for TCP alone
  • TCP handshake took 114ms
  • Total SSL handshake time was 436ms
  • Number of network round-trips was 3.
  • SSL portion took 322ms (network and crypto)
The server tested above has a 2048 bit ssl cert. Running 'openssl speed rsa' on the webserver shows it can do a signature in 22ms:
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.022382s 0.000542s     44.7   1845.4
Anyway. The point is, no matter how fast your SSL accelerators (hardware loadbalancer, etc), if your SSL end points aren't near the user, then your first connect will be slow. As shown above, 22ms for the crypto piece of SSL handshake, which means 300ms of the SSL portion above was likely network latency and some other overhead.

Once SSL is established, though, it switches to a block cipher (3DES, etc) which is much faster and the resource (network, cpu) overhead is pretty tiny by comparison.

Summarizing from above: Using SSL incurs a 3.5x latency overhead for each handshake, but afterwards it's generally fast like plain TCP. If you accept this conclusion, let's examine how this can affect website performance.

Got firebug? Open any website. Seriously. Watch the network activity. How many HTTP requests are made? Can you tell how many of those that go to the same domain use http pipelining (or keepalive)? How many initiate new requests each time? You can track this with tcpdump by looking for 'syn' packets if you want (tcpdump 'tcp[tcpflags] == tcp-syn').

What about the street wisdom for high-performance web servers? HAProxy's site says:

"If a site needs keep-alive, there is a real problem. Highly loaded sites often disable keep-alive to support the maximum number of simultaneous clients. The real downside of not having keep-alive is a slightly increased latency to fetch objects. Browsers double the number of concurrent connections on non-keepalive sites to compensate for this."
Disabling keep-alive on SSL connections means every single http request is going to take 3 round-trips before even asking for data. If your server is 100ms away, and you have 10 resources to serve on a single page, that's 3 seconds of network latency before you include SSL crypto or resource transfer time. With keep alive, you could eat that handshake cost only once instead of 10 times.

Many browsers will open multiple simultaneous connections to any given webserver if it needs to fetch multiple resources. Idea is that parallelism gets you more tasty http resources in a shorter time. If the browser opens two connections in parallel, you'll still incur many sequential SSL handshakes that slow your resource fetching down. More SSL handshakes in parallel means higher CPU burden, too, and ultimately memory (per open connection) scales more cheaply than does CPU time - think: above, one active connection cost 22ms of time (most of which is spent in CPU) and costs much more than that connection holds in memory resources and scales better (easier to grow memory than cpu).

For some data, Google and Facebook both permit keep-alive:

% URL=
% curl  -w "tcp: %{time_connect} ssl:%{time_appconnect}\n" -sk -o /dev/null $URL -o /dev/null $URL
tcp: 0.038 ssl:0.088
tcp: 0.000 ssl:0.000

% URL=
% curl  -w "tcp: %{time_connect} ssl:%{time_appconnect}\n" -sk -o /dev/null $URL -o /dev/null $URL
tcp: 0.054 ssl:0.132
tcp: 0.000 ssl:0.000
The 2nd line of output reports zero time spent in tcp and ssl handshaking. Further, if you tell curl to output response headers (curl -D -) you'll see "Connection: keep-alive". This is data showing that at least some of big folks with massive qps are using keep alive.

Remember that new handshakes are high cpu usage, but existing SSL connections generally aren't as they are using a cheaper block cipher after the handshake. Disabling keep alive ensures that every request will incur an SSL handshake which can quickly overload a moderately-used server without SSL acceleration hardware if you have a large ssl key (2048 or 4096bit key).

Even if you have SSL offloading to special hardware, you're still incuring the higher network latency that can't be compensated by faster hardware. Frankly, in most cases it's more cost effective to buy a weaker SSL certificate (1024 bit) than it is to buy SSL hardware - See Google's Velocity 2010 talk on SSL.

By the way, on modern hardware you can do a decent number of SSL handshakes per second with 1024bit keys, but 2048bit and 4096bit keys are much harder:

# 'openssl speed rsa' done on an Intel X5550 (2.66gHz)
rsa 1024 bits 0.000496s 0.000027s   2016.3  36713.2
rsa 2048 bits 0.003095s 0.000093s    323.1  10799.2
rsa 4096 bits 0.021688s 0.000345s     46.1   2901.5

Fixing SSL latency is not totally trivial. The CPU intensive part can be handled by special hardware if you can afford it, but the only way sure way to solve network round-trip latency is to be closer to your user and/or to work on minimizing the total number of round-trips. You can be further from your users if you don't force things like keep-alive to be off, which can save you money in the long run by letting you have better choices of datacenter locations.

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

C vs Python with Berkeley DB

I've got a stable, threaded version of this fancydb tool I've been working on. However, the performance of insertions is less than optimal.

Then again, how much should insert performance matter on a monitoring tool? For data that comes into it gradually, speed doesn't matter much. For bulk inserts, speed matters if you want to get your work done quickly. I haven't decided if bulk insertions are necessary use case for this tool. Despite that, I'm still interested in what the limits are.

I have experimented with many different implementations of parallelism, buffering, caching, etc in the name of making insertion to a fancydb with 10 rules fast. The fastest I've gotten it was 10000/sec, but that was on an implementation that wasn't threadsafe (and used threads).

My most-recent implementation (which should be threadsafe) can do reads and writes at 30000/sec. With evaluation rules the write rate drops to about 10000/sec.

The next task was to figure out what I was doing wrong. For comparison, I wrote two vanilla bdb accessing programs. One in C and one in Python. The output of these two follows:

# The args for each program is: insertions page_size cache_size
% sh
Running: ./test 2000000 8192 10485760
  => 2000000 inserts + 1 fullread: 209205.020921/sec
Running: ./ 2000000 8192 10485760
  => 2000000 inserts + 1 fullread: 123304.562269/sec
As expected, C clearly outperforms Python here, but the margin is pretty small (C is 69% faster for this test). Given the 120000/sec rate from Python, the poor input rate of my tool seems to be blamed on me. Is my additional code here really the reason that I can only write at 30000 per second? I may need to revisit how I'm implementing things in python. I'm not clear right now where I'm losing so much throughput.

So I use hotshot (python standard profiler) and I find that most of the time is spent in my iterator method. This method is a generator method which uses yield and loops over a cursor.

It's important to note that my python bdb 'speed test' above did not use generators, it used a plain while loop over the cursor. So, I wrote another test that uses generators. First, let's try just inserts, no reading of data:

Running: ./test 1000000 8192 10485760
  => 1000000 inserts: 261096.605744/sec
Running: ./ 1000000 8192 10485760
  => 1000000 inserts: 166389.351082/sec
Now let's try with 3 different python reading methods: while loop across a cursor, generator function (using yield), and an iterator class (implementing __iter__):
Running: ./ 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 8.660000
Running: ./ 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 9.124000
Running: ./ 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 13.130000
I'm not sure why implementing an iterator is so much slower (in general) than a yield-generator is. Seems strange, perhaps my testing code is busted. Either way, I'm not really closer to finding the slowness.

get this code here

Boredom, vmware cpu performance, and /dev/random

These are strictly cpu-bound tests using 'openssl speed'. I didn't compile any of the openssl binaries here, so it's possible that differences in compilationcaused the differences in the numbers.

I've never noticed a performance decrease of the host vs guest systems in vmware, and here's data confirming my suspecions.

guest/solaris10    OpenSSL 0.9.8e 23 Feb 2007
guest/freebsd6.2   OpenSSL 0.9.7e-p1 25 Oct 2004
host/linux         OpenSSL 0.9.8c 05 Sep 2006

'openssl speed blowfish'
                   type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
host/linux         blowfish cbc     72062.94k    77117.35k    78280.70k    78680.96k    79309.48k
guest/freebsd6.2   blowfish cbc     68236.69k    73335.83k    74060.50k    74423.40k    74703.29k
guest/solaris10    blowfish cbc     64182.15k    73944.47k    75952.21k    76199.94k    76931.07k

'openssl speed rsa'
                                      sign    verify    sign/s verify/s
host/linux         rsa  512 bits 0.000308s 0.000020s   3244.3  49418.3
guest/freebsd6.2   rsa  512 bits   0.0003s   0.0000s   3343.5  41600.1
guest/solaris10    rsa  512 bits 0.001289s 0.000116s    775.6   8630.8

host/linux         rsa 1024 bits 0.000965s 0.000049s   1036.7  20409.8
guest/freebsd6.2   rsa 1024 bits   0.0009s   0.0001s   1160.0  18894.2
guest/solaris10    rsa 1024 bits 0.007152s 0.000369s    139.8   2708.1

host/linux         rsa 2048 bits 0.004819s 0.000135s    207.5   7414.4
guest/freebsd6.2   rsa 2048 bits   0.0045s   0.0001s    222.8   6951.1
guest/solaris10    rsa 2048 bits 0.045780s 0.001334s     21.8    749.8

host/linux         rsa 4096 bits 0.028600s 0.000422s     35.0   2371.3
guest/freebsd6.2   rsa 4096 bits   0.0279s   0.0004s     35.8   2271.4
guest/solaris10    rsa 4096 bits 0.317812s 0.004828s      3.1    207.1
It's interesting that the performance on blowfish were pretty close, but rsa was wildly different. The freebsd guest outperformed the linux host in signing by 10%, but fell behind in verification. Solaris peformed abysmally. The freebsd-guest vs linux-host data tells me that the cpu speed differences between guest and host environments is probably zero, which is good.

Again, the compilation options for each openssl binary probably played large parts in the performance here. I'm not familiar with SunFreeware's compile options with openssl (the binary I used came from there).

Either way, the point here was not to compare speeds against different platforms, but to in some small way compare cpu performance between host and guest systems. There are too many uncontrolled variables in this experiment to consider it valid, but it is interesting data and put me on another path to learn about why they were different.

My crypto is rusty, but I recall that rsa may need a fair bit of entropy to pick a big prime. Maybe solaris' entropy system is slower than freebsd's or linux's system? This lead me to poke at /dev/random on each system. I wrote a small perl script to read from /dev/random as fast as possible.

host/linux        82 bytes in 5.01 seconds: 16.383394 bytes/sec
guest/solaris10   57200 bytes in 5.01 seconds: 11410.838461 bytes/sec
guest/freebsd6.2  210333696 bytes in 5.01 seconds: 41947398.850271 bytes/sec
I then ran the same test on the host/linux machine while feeding /dev/random on the host from entropy from the freebsd machine:
% ssh [email protected] 'cat /dev/random' > /dev/random &
% perl                                  
448 bytes in 5.00 seconds: 89.563136 bytes/sec

# Kill that /dev/random feeder, and now look:
% perl
61 bytes in 5.01 seconds: 12.185872 bytes/sec
When speed is a often a trade-off for security, are FreeBSD's and Solaris's /dev/random features more insecure than Linux's? Or, is Linux just being dumb?

Googling finds data indicating that /dev/random on linux will block until entropy is available, so let's retry with /dev/urandom instead.

host/linux        29405184 bytes in 5.01 seconds: 5874687.437817 bytes/sec
guest/solaris10   70579600 bytes in 5.00 seconds: 14121588.405586 bytes/sec
guest/freebsd6.2  208445440 bytes in 5.02 seconds: 41502600.216189 bytes/sec
FreeBSD's /dev/urandom is a symlink to /dev/random, so the same throughput appearing here is expected. FreeBSD's still wins by a landslide. Why? Then again, maybe that's not a useful question. How often do you 40mb/sec of random data?

Back at the rsa question - If solaris' random generator is faster than linux in all cases, then why is 'openssl speed rsa' slower on solaris than linux? Compile time differences? Perhaps it's some other system bottleneck I haven't explored yet.

Mysql prepare'd queries aren't cached, ever.

There once was a database named MySQL.

It had a query cache, becuase caching helps performance.

It also had queries you could "prepare" on the server-side, with the hope that your database server can make some smart decisions what to do with a query you're going to execute N times during this session.

I told mysql to enable it's caching and use a magic value of 1gb for memory storage. Much to my surprise, I see the following statistic after testing an application:

mysql> show status like 'Qcache_%';
| Variable_name           | Value      |
| Qcache_free_blocks      | 1          | 
| Qcache_free_memory      | 1073732648 | 
| Qcache_hits             | 0          | 
| Qcache_inserts          | 0          | 
| Qcache_lowmem_prunes    | 0          | 
| Qcache_not_cached       | 814702     | 
| Qcache_queries_in_cache | 0          | 
| Qcache_total_blocks     | 1          | 
8 rows in set (0.00 sec)
Why are so many (all!?) of the queries not cached? Surely I must be doing something wrong. Reading the doc on caching explained what I can only understand as a complete lapse of judgement on the part of MySQL developers:
Note: The query cache is not used for server-side prepared statements. If you're using server-side prepared statements consider that these statement won't be satisfied by the query cache. See Section 22.2.4, C API Prepared Statements.
Any database performance guide anywhere will tell you to use prepared statements. They're useful from both a security and performance perspective.

Security, becuase you feed the prepared query data and it knows what data types to expect, erroring when you pass something invalid. It also will handle strings properly, so you worry less about sql injection. You also get convenience, in that you don't have to escape your data.

Performance, becuase telling the database what you are about to do lets it optimize the query.

This performance is defeated, however, if you want to use caching. So, I've got a dillema! There are two mutually-exclusive (because MySQL sucks) performance-enhancing options available to me: using prepared statements or using caching.

Prepared statements give you two performance benefits (maybe more?). The first, is the server will parse the query string when you prepare it, and execute the "parsed" version whenever you invoke it. This saves parsing time; parsing text is expensive. The second, is that if your database is nice, it will try to optimize your queries before execution. Using prepared statements will permit the server to optimize query execution once, and then remember it. Good, right?

Prepared statements improve CPU utilization, in that the cpu can work less becuase you're teaching the database about what's coming next. Cached query responses improve disk utilization, and depending on implementation should vastly outperform most (all?) of the gains from prepared statements. This assumption I am making is based on the assumption that disk is slow and cpu is fast.

Cached queries will (should?) cache results of complex queries. This means that a select query with multiple, complex joins should be cached mapping the query string to the result. No amount of statement preparation will improve complex queries becuase they still have to hit disk. Large joins require lots of disk access, and therefore are slow. Remembering "This complex query" returned "this happy result" is fast regardless of whether or not it's stored on disk or in memory. Caching also saves cpu utilization.

I can't believe preparing a query will prevent it from being pulled from the query cache, but this is clearly the case. Thanks, MySQL, for making a stupid design decision.

Maybe there's some useful JDBC (oh yeah, the app I'm testing is written in Java) function that'll give you all the convenience/security benefits of prepare, but without the server-side bits, and thus let you use the query cache.