Search this site


Metadata

Articles

Projects

Presentations

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

C vs Python with Berkeley DB

I've got a stable, threaded version of this fancydb tool I've been working on. However, the performance of insertions is less than optimal.

Then again, how much should insert performance matter on a monitoring tool? For data that comes into it gradually, speed doesn't matter much. For bulk inserts, speed matters if you want to get your work done quickly. I haven't decided if bulk insertions are necessary use case for this tool. Despite that, I'm still interested in what the limits are.

I have experimented with many different implementations of parallelism, buffering, caching, etc in the name of making insertion to a fancydb with 10 rules fast. The fastest I've gotten it was 10000/sec, but that was on an implementation that wasn't threadsafe (and used threads).

My most-recent implementation (which should be threadsafe) can do reads and writes at 30000/sec. With evaluation rules the write rate drops to about 10000/sec.

The next task was to figure out what I was doing wrong. For comparison, I wrote two vanilla bdb accessing programs. One in C and one in Python. The output of these two follows:

# The args for each program is: insertions page_size cache_size
% sh runtest.sh
Running: ./test 2000000 8192 10485760
  => 2000000 inserts + 1 fullread: 209205.020921/sec
Running: ./py-bsddb.py 2000000 8192 10485760
  => 2000000 inserts + 1 fullread: 123304.562269/sec
As expected, C clearly outperforms Python here, but the margin is pretty small (C is 69% faster for this test). Given the 120000/sec rate from Python, the poor input rate of my tool seems to be blamed on me. Is my additional code here really the reason that I can only write at 30000 per second? I may need to revisit how I'm implementing things in python. I'm not clear right now where I'm losing so much throughput.

So I use hotshot (python standard profiler) and I find that most of the time is spent in my iterator method. This method is a generator method which uses yield and loops over a cursor.

It's important to note that my python bdb 'speed test' above did not use generators, it used a plain while loop over the cursor. So, I wrote another test that uses generators. First, let's try just inserts, no reading of data:

Running: ./test 1000000 8192 10485760
  => 1000000 inserts: 261096.605744/sec
Running: ./py-bsddb.py 1000000 8192 10485760
  => 1000000 inserts: 166389.351082/sec
Now let's try with 3 different python reading methods: while loop across a cursor, generator function (using yield), and an iterator class (implementing __iter__):
Running: ./py-bsddb.py 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 8.660000
Running: ./py-bsddb_generator.py 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 9.124000
Running: ./py-bsddb_iterable_class.py 4000000 8192 10485760
  => 1 fullread of 4000000 entries: 13.130000
I'm not sure why implementing an iterator is so much slower (in general) than a yield-generator is. Seems strange, perhaps my testing code is busted. Either way, I'm not really closer to finding the slowness.

get this code here

Boredom, vmware cpu performance, and /dev/random

These are strictly cpu-bound tests using 'openssl speed'. I didn't compile any of the openssl binaries here, so it's possible that differences in compilationcaused the differences in the numbers.

I've never noticed a performance decrease of the host vs guest systems in vmware, and here's data confirming my suspecions.

Versions:
guest/solaris10    OpenSSL 0.9.8e 23 Feb 2007
guest/freebsd6.2   OpenSSL 0.9.7e-p1 25 Oct 2004
host/linux         OpenSSL 0.9.8c 05 Sep 2006

'openssl speed blowfish'
                   type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
host/linux         blowfish cbc     72062.94k    77117.35k    78280.70k    78680.96k    79309.48k
guest/freebsd6.2   blowfish cbc     68236.69k    73335.83k    74060.50k    74423.40k    74703.29k
guest/solaris10    blowfish cbc     64182.15k    73944.47k    75952.21k    76199.94k    76931.07k

'openssl speed rsa'
                                      sign    verify    sign/s verify/s
host/linux         rsa  512 bits 0.000308s 0.000020s   3244.3  49418.3
guest/freebsd6.2   rsa  512 bits   0.0003s   0.0000s   3343.5  41600.1
guest/solaris10    rsa  512 bits 0.001289s 0.000116s    775.6   8630.8

host/linux         rsa 1024 bits 0.000965s 0.000049s   1036.7  20409.8
guest/freebsd6.2   rsa 1024 bits   0.0009s   0.0001s   1160.0  18894.2
guest/solaris10    rsa 1024 bits 0.007152s 0.000369s    139.8   2708.1

host/linux         rsa 2048 bits 0.004819s 0.000135s    207.5   7414.4
guest/freebsd6.2   rsa 2048 bits   0.0045s   0.0001s    222.8   6951.1
guest/solaris10    rsa 2048 bits 0.045780s 0.001334s     21.8    749.8

host/linux         rsa 4096 bits 0.028600s 0.000422s     35.0   2371.3
guest/freebsd6.2   rsa 4096 bits   0.0279s   0.0004s     35.8   2271.4
guest/solaris10    rsa 4096 bits 0.317812s 0.004828s      3.1    207.1
It's interesting that the performance on blowfish were pretty close, but rsa was wildly different. The freebsd guest outperformed the linux host in signing by 10%, but fell behind in verification. Solaris peformed abysmally. The freebsd-guest vs linux-host data tells me that the cpu speed differences between guest and host environments is probably zero, which is good.

Again, the compilation options for each openssl binary probably played large parts in the performance here. I'm not familiar with SunFreeware's compile options with openssl (the binary I used came from there).

Either way, the point here was not to compare speeds against different platforms, but to in some small way compare cpu performance between host and guest systems. There are too many uncontrolled variables in this experiment to consider it valid, but it is interesting data and put me on another path to learn about why they were different.

My crypto is rusty, but I recall that rsa may need a fair bit of entropy to pick a big prime. Maybe solaris' entropy system is slower than freebsd's or linux's system? This lead me to poke at /dev/random on each system. I wrote a small perl script to read from /dev/random as fast as possible.

host/linux        82 bytes in 5.01 seconds: 16.383394 bytes/sec
guest/solaris10   57200 bytes in 5.01 seconds: 11410.838461 bytes/sec
guest/freebsd6.2  210333696 bytes in 5.01 seconds: 41947398.850271 bytes/sec
I then ran the same test on the host/linux machine while feeding /dev/random on the host from entropy from the freebsd machine:
% ssh jls@teabag 'cat /dev/random' > /dev/random &
% perl devrandom.pl                                  
448 bytes in 5.00 seconds: 89.563136 bytes/sec

# Kill that /dev/random feeder, and now look:
% perl devrandom.pl
61 bytes in 5.01 seconds: 12.185872 bytes/sec
When speed is a often a trade-off for security, are FreeBSD's and Solaris's /dev/random features more insecure than Linux's? Or, is Linux just being dumb?

Googling finds data indicating that /dev/random on linux will block until entropy is available, so let's retry with /dev/urandom instead.

host/linux        29405184 bytes in 5.01 seconds: 5874687.437817 bytes/sec
guest/solaris10   70579600 bytes in 5.00 seconds: 14121588.405586 bytes/sec
guest/freebsd6.2  208445440 bytes in 5.02 seconds: 41502600.216189 bytes/sec
FreeBSD's /dev/urandom is a symlink to /dev/random, so the same throughput appearing here is expected. FreeBSD's still wins by a landslide. Why? Then again, maybe that's not a useful question. How often do you 40mb/sec of random data?

Back at the rsa question - If solaris' random generator is faster than linux in all cases, then why is 'openssl speed rsa' slower on solaris than linux? Compile time differences? Perhaps it's some other system bottleneck I haven't explored yet.

Mysql prepare'd queries aren't cached, ever.

There once was a database named MySQL.

It had a query cache, becuase caching helps performance.

It also had queries you could "prepare" on the server-side, with the hope that your database server can make some smart decisions what to do with a query you're going to execute N times during this session.

I told mysql to enable it's caching and use a magic value of 1gb for memory storage. Much to my surprise, I see the following statistic after testing an application:

mysql> show status like 'Qcache_%';
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| Qcache_free_blocks      | 1          | 
| Qcache_free_memory      | 1073732648 | 
| Qcache_hits             | 0          | 
| Qcache_inserts          | 0          | 
| Qcache_lowmem_prunes    | 0          | 
| Qcache_not_cached       | 814702     | 
| Qcache_queries_in_cache | 0          | 
| Qcache_total_blocks     | 1          | 
+-------------------------+------------+
8 rows in set (0.00 sec)
Why are so many (all!?) of the queries not cached? Surely I must be doing something wrong. Reading the doc on caching explained what I can only understand as a complete lapse of judgement on the part of MySQL developers:
from http://dev.mysql.com/doc/refman/5.0/en/query-cache.html
Note: The query cache is not used for server-side prepared statements. If you're using server-side prepared statements consider that these statement won't be satisfied by the query cache. See Section 22.2.4, C API Prepared Statements.
Any database performance guide anywhere will tell you to use prepared statements. They're useful from both a security and performance perspective.

Security, becuase you feed the prepared query data and it knows what data types to expect, erroring when you pass something invalid. It also will handle strings properly, so you worry less about sql injection. You also get convenience, in that you don't have to escape your data.

Performance, becuase telling the database what you are about to do lets it optimize the query.

This performance is defeated, however, if you want to use caching. So, I've got a dillema! There are two mutually-exclusive (because MySQL sucks) performance-enhancing options available to me: using prepared statements or using caching.

Prepared statements give you two performance benefits (maybe more?). The first, is the server will parse the query string when you prepare it, and execute the "parsed" version whenever you invoke it. This saves parsing time; parsing text is expensive. The second, is that if your database is nice, it will try to optimize your queries before execution. Using prepared statements will permit the server to optimize query execution once, and then remember it. Good, right?

Prepared statements improve CPU utilization, in that the cpu can work less becuase you're teaching the database about what's coming next. Cached query responses improve disk utilization, and depending on implementation should vastly outperform most (all?) of the gains from prepared statements. This assumption I am making is based on the assumption that disk is slow and cpu is fast.

Cached queries will (should?) cache results of complex queries. This means that a select query with multiple, complex joins should be cached mapping the query string to the result. No amount of statement preparation will improve complex queries becuase they still have to hit disk. Large joins require lots of disk access, and therefore are slow. Remembering "This complex query" returned "this happy result" is fast regardless of whether or not it's stored on disk or in memory. Caching also saves cpu utilization.

I can't believe preparing a query will prevent it from being pulled from the query cache, but this is clearly the case. Thanks, MySQL, for making a stupid design decision.

Maybe there's some useful JDBC (oh yeah, the app I'm testing is written in Java) function that'll give you all the convenience/security benefits of prepare, but without the server-side bits, and thus let you use the query cache.