Search this site





Insist on better asserts

I never really liked C's assert() feature. If an assertion is violated, it'll tell you what assertion failed but completely lacks any context:
example: example.c:9: main: Assertion `i == 3' failed.
This is better:
Assertion failed insist.c:7 in main(), insist(i == 3): Something went wrong, wanted i == 3, got 4
The main difference here is that there's context about what failed. A message for humans looking to debug this. This is especially important on Linux these days because every distro I've used recently hates sysadmins and hates debugging - all libraries are stripped of debug symbols and coredumps are disabled by default.

What's the usage look like?

#include "insist.h"

int main() {
  int i = 4;
  //assert(i == 3);
  insist(i == 3, "Something went wrong, wanted i == 3, got %d", i);
  return 0;
I also added a special 'return' version of this, 'insist_return' that lets you do error checking and early aborting like this:
insist_return(fd >= 0, START_FAILURE,
              "socket() returned %d, an error: %s", fd, strerror(errno));
Works just like insist() except returns START_FAILURE if 'fd > 0' is false and additionally logs the error formatted above.

Code here: insist.h

Introducing: Ruby Minstrel - a method instrumenter

The following tools are awesome: strace, truss, ltrace, dtrace, and systemtap.

Sometimes I'm trying to debug a ruby library or application, and I end up monkeypatching things just to see what arguments are being passed as a away of sanity checking configuration or correctness. Other times I want to profile the time spent only in a certain class, or method, etc. At a basic level, I'm looking for a simple ltrace equivalent in ruby.

Enter minstrel. There may be projects out there already that do this, but I don't know of one, so it got written tonight.

You can 'gem install minstrel' to get it (or here (rubygems) and here (github))

My standard path of debugging (without other options) is to sanity check my code and then dive into the code for whatever app/library I am using. It often requires root access to modify ruby libs on the system, which sucks for one-off debugging. Writing up monkeypatches guessing at methods that I should inspect is error prone and sucky. Monkeypatching for debugging is common for me and is about as efficient/productive as using LD_PRELOAD to hack in my own method calls. (see liboverride). It sucks.

Minstrel is something better than a bag of the usual hope+monkeypatch+time combination. For example. Let's use minstrel to debug why mcollective's 'mc-ping' is failing:

snack(~) % /usr/sbin/mc-ping
connect failed: Connection refused - connect(2) will retry in 5
Ok, connection refused. To what? Yes, I could use strace or tcpdump to debug this particular issue. But that's not the point, here. After looking at the mcollective code for a few minutes I came up with a few classes I want to instrument.
snack(~) % % RUBY_INSTRUMENT="MCollective::Connector::Stomp,Stomp::Connection,TCPSocket" minstrel /usr/sbin/mc-ping
Wrap of TCPSocket successful
Wrap of Stomp::Connection successful
Wrap of MCollective::Connector::Stomp successful
enter MCollective::Connector::Stomp#connect([])
enter Stomp::Connection#socket([])
class_enter TCPSocket#open(["localhost", 6163])
class_exit_exception TCPSocket#open(["localhost", 6163])
connect failed: Connection refused - connect(2) will retry in 5

Debugging java threads with top(1) and jstack.

At work, we're testing some new real-time bidding on ADX and are working through some performance issues.

Our server is jetty + our code, and we're seeing performance problems at relatively low QPS.

The first profiling attempt used YJP, but when it was attached to our server, the system load went up quickly and looked like this:

load average: 2671.04, 1653.95, 771.93
Not good; the load average while running with the profiler attached jumps to a number roughly equal to the number of application threads (3000 jetty threads). Possible PEBCAK here, but needless to say we couldn't use YJP for any meaningful profiling.

Stepping back and using some simpler tools proved to be much more useful. The standard system tool "top" can, in most cases show per-thread stats instead of just per-process. Unfortunately, top(1) doesn't seem to have a "parent pid" option, but luckily for us, our java process was the only one running as the user 'rfi'. Further, some top implementations can take a single process (top -p PID) which would be useful here.

If we make top output threads and sort by cpu time, we get an idea of the threads that spend the most time being busy. Here's a sample:

top - 22:29:34 up 87 days, 21:45,  9 users,  load average: 10.31, 12.58, 8.45
Tasks: 3237 total,   5 running, 3231 sleeping,   0 stopped,   1 zombie
Cpu(s): 53.9%us,  2.6%sy,  0.0%ni, 42.3%id,  0.0%wa,  0.0%hi,  1.1%si,  0.0%st
Mem:  16432032k total, 10958512k used,  5473520k free,   149100k buffers
Swap: 18474740k total,    77556k used, 18397184k free,  5321140k cached

32149 rfi       16   0 13.7g 6.8g  13m S 50.7 43.3 134:38.08 134:38 java
28993 rfi       15   0 13.7g 6.8g  13m S  5.5 43.3  13:26.85  13:26 java
28995 rfi       16   0 13.7g 6.8g  13m S  3.6 43.3   8:17.84   8:17 java
32151 rfi       15   0 13.7g 6.8g  13m S  1.9 43.3   5:47.59   5:47 java
29143 rfi       16   0 13.7g 6.8g  13m S  0.0 43.3   4:18.51   4:18 java
28990 rfi       15   0 13.7g 6.8g  13m S  1.0 43.3   3:58.86   3:58 java
28989 rfi       15   0 13.7g 6.8g  13m S  0.6 43.3   3:58.83   3:58 java
28992 rfi       15   0 13.7g 6.8g  13m S  1.0 43.3   3:58.72   3:58 java
28986 rfi       16   0 13.7g 6.8g  13m S  1.0 43.3   3:58.00   3:58 java
28988 rfi       16   0 13.7g 6.8g  13m S  0.6 43.3   3:57.94   3:57 java
28987 rfi       15   0 13.7g 6.8g  13m S  0.6 43.3   3:56.89   3:56 java
28991 rfi       15   0 13.7g 6.8g  13m S  1.0 43.3   3:56.71   3:56 java
28985 rfi       15   0 13.7g 6.8g  13m S  1.3 43.3   3:55.79   3:55 java
28994 rfi       15   0 13.7g 6.8g  13m S  1.0 43.3   2:05.35   2:05 java
29141 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   1:09.66   1:09 java
32082 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:56.29   0:56 java
28998 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:52.27   0:52 java
29359 rfi       15   0 13.7g 6.8g  13m S  3.2 43.3   0:45.11   0:45 java
31920 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:43.46   0:43 java
31863 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:41.67   0:41 java
31797 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:41.56   0:41 java
31777 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:41.31   0:41 java
30807 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:41.22   0:41 java
31222 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:40.70   0:40 java
30622 rfi       15   0 13.7g 6.8g  13m S  3.9 43.3   0:40.56   0:40 java
31880 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:39.77   0:39 java
29819 rfi       15   0 13.7g 6.8g  13m S  4.8 43.3   0:39.15   0:39 java
29438 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:39.11   0:39 java
30363 rfi       15   0 13.7g 6.8g  13m R  0.0 43.3   0:39.03   0:39 java
31554 rfi       15   0 13.7g 6.8g  13m S  0.0 43.3   0:38.82   0:38 java

I included lots of threads in the output above so you can get an idea, visually, of the time buckets you could place each thing in.

The important column is "TIME+" here. You can see that there's some uniqueness in the times that probably give you an indication of the similarity in duties between each thread with similar times (all the near 4-hour times are probably of similary duty, etc).

If you grab a stack trace from your java process, you can map each process id above to a specific thread. You can get a stack by using jstack(1) or sending SIGQUIT to the java process and reading the output.

Java stack traces, for each thread, have similar headers:

"thread name" prio=PRIORITY tid=THREADID nid=LWPID STATE ...
The thing we care about here is the LWPID (light weight process id) or also known as the process id.

Find my java process (must be run as the user running the process):

% sudo -u rfi jps
28982 ServerMain
3041 Jps
If I take the 2nd highest pid from the top output above, 28993, what thread does this belong to? If you look at the stack trace, you'll see that java outputs the 'nid' field in hex, so you'll have to convert your pid to hex first.
% printf "0x%x\n" 28993

% grep nid=0x7141 stacktrace
"VM Thread" prio=10 tid=0x000000005e6d9000 nid=0x7141 runnable 

# What about the first pid?
% printf "%0x%x\n" 32149

% grep -A3 nid=0x7d95 stacktrace
2010-06-22 03:15:26.489485500 "[email protected] Thread Pool-2999 - Acceptor0 [email protected]:2080" prio=10 tid=0x00002aad3dd43000 nid=0x7d95 runnable [0x00002aadc62d9000]
2010-06-22 03:15:26.489507500    java.lang.Thread.State: RUNNABLE
2010-06-22 03:15:26.489512500   at Method)
2010-06-22 03:15:26.489518500   at
From the above output, I can conclude that the most cpu-intensive threads in my java process are the Jetty socket acceptor thread and the "VM Thread". A red flag should pop up in your brain if a 4-hour-old process has spent 2.25 hours spinning in accept().

FYI, googling for "VM Thread" and various java keywords doesn't seem to find much documentation, though most folks claim it has GC and other maintenance-related duties. I am not totally sure what it does.

In fact, if you look at the top 10 pids, none of them are actually anything that runs the application code - they are all management threads (thread pool, socket listener, garbage collection) outside of my application.

32149 [email protected] Thread Pool-2999 - Acceptor0 [email protected]:2080
28993 VM Thread
28995 Finalizer
32151 pool-7-thread-1
29143 pool-3-thread-1
28990 GC task thread#5 (ParallelGC)
28989 GC task thread#4 (ParallelGC)
28992 GC task thread#7 (ParallelGC)
28986 GC task thread#1 (ParallelGC)
28988 GC task thread#3 (ParallelGC)
To be fair to the rest of this app, there are 3000 worker threads that each have around 40 minutes (or less) of cpu time, so the application is doing useful work, it's important to observe where you are wasting time - in this case, the jetty accept thread. The GC-related data may be an indication that garbage collection could be tuned or avoided (through thread-local storage, etc).

Lastly, since each of these IDs are LWP IDs, you can use other standard tools (strace/truss, systemtap/dtrace, etc) on them.

Knowing the accept thread is consuming lots of time should make you run tcpdump if you haven't already, and in this case I see a few hundred new connections per second, which isn't very high and leads us further down the road of debugging.

View http headers, or, how lwp-request sucks

I normally use HEAD(1) to view HTTP response headers. However, when developing web applications outside of Apache or any other standard web server, it's often likely that you won't immediately support HEAD requests. This is a problem if all you want to do is see the http headers.

Perl's libwww comes with a tool called lwp-request, which is installed with symlinks GET, HEAD, etc. You can use GET and have it show only the HTTP headers. Or can you?

The response message I am looking for is this:

HTTP/1.0 200 OK
Server: PasteWSGIServer/0.5 Python/2.4.3
Date: Tue, 19 Dec 2006 22:42:20 GMT
content-type: text/html; charset=UTF-8
Connection: close
The response I get using GET -sed is:
200 OK
Connection: close
Date: Tue, 19 Dec 2006 22:43:16 GMT
Server: PasteWSGIServer/0.5 Python/2.4.3
Content-Type: text/html; charset=UTF-8
Client-Date: Tue, 19 Dec 2006 22:43:16 GMT
Client-Response-Num: 1
Link: </css/style.css>; media="all"; rel="stylesheet"; type="text/css"
Link: </css/tabs.css>; media="all"; rel="stylesheet"; type="text/css"
Link: </css/tabs-ie.css>; media="all"; rel="stylesheet"; type="text/css"
Title: pimp: music for you.
What? LWP appears to be doing some html parsing. :(

So, a solution usiung netcat, which is trash, because lwp should just not be stupid by default. All I want are http headers. Here's my netcat solution:

% echo "GET / HTTP/1.0\r\n\r\n" | nc localhost 5000  | col | sed -e '/^$/q'
HTTP/1.0 200 OK
Server: PasteWSGIServer/0.5 Python/2.4.3
Date: Tue, 19 Dec 2006 23:00:12 GMT
content-type: text/html; charset=UTF-8
Connection: close
Oh look! The headers I wanted. I hate you, LWP.


Update: curl seems to be the best solution, here.

% curl -sS -D - -o /dev/null
HTTP/1.1 200 OK
Date: Fri, 16 Jul 2010 21:34:01 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=d8703167b9ea01a8:TM=1279316041:LM=1279316041:S=j5mVVrOv81lHeqpa; expires=Sun, 15-Jul-2012 21:34:01 GMT; path=/;
Set-Cookie: NID=36=plIkUp9cvYQ5Fc6VqIQxBDjjqS_VP1F1EohF0FBv2wW-phBuNTxqm0vduncInbp-MgLd2aikmgRu8JZlZmNnv6LcymNEfsxh33xyUaG8avqwaI8MkAYT6XiiWbxLXnAW; expires=Sat, 15-Jan-2011 21:34:01 GMT; path=/;; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked