Search this site


Metadata

Articles

Projects

Presentations

MITRE's CEE is a failure for profit.

I wrote this post a few months ago, but never got around to publishing it.

Anyway, someone mentioned 'project lumberjack' and I found it was based on CEE: Common Event Expression. CEE is a sort of comedic tragedy of design.

The effort is owned by a "non-profit" (MITRE), but the complexity and obfuscation in CEE can only drive towards one thing: consultant profits. I had a go at explaining what I describe in this post on the 'project lumberjack' mailing list, but I did it quite poorly and got a few foot-stomps in response, so I gave up.

CEE is a failure because, while claiming to be a standards effort, it maximizes incompatibility between implementations by doing the following:

  • poorly describes multiple serialization formats, requires none of them. This ensures maximum incompatibility in event serialization
  • defines requirements for log transport protocols, but does not describe an actual protocol. This ensures maximum protocol incompatibility
  • naming style inconsistencies This ensures confusion

In general, the goal of CEE sounds like, but is actually not, creating a standard for common event expression. Instead, CEE is aimed at ensuring consulting dollars through obfuscation, complexity, and inconsistency.

Inconsistency.

No consistency in naming. Some names are abbreviations like ‘crit’, some are prefixed abbreviations (“p_” prefix), some are full english words like ‘action’.

If the goal was to be inconsistent, CEE has succeeded.

  • Mysterious ‘p_’ prefix. Base fields are abbreviated names like “crit”, “id”, “pri”, yet others are called “p_app”, “p_proc”, and “p_proc_id”.
  • Some fields are abbreviated, like “crit” and “pri”, but others are full english words, like “action” and “domain”

  • there’s “id” which marks the “event type” by identifier, but “uuid” which marks the event instance identifier. You are confusing. I’m still not sure what the purpose of ‘uuid’ is.

ETOOMANYPROTOCOLS

  • Serializations: JSON, RFC3164, RFC5424, and XML serializations
  • 4 conformance levels.

Pick one serialization and one transport (“conformance”) requirements list. Describe those two. Drop all the others.

If I pick JSON, and you pick XML, we can both be CEE-conforming and have zero interoperability between our tools. Nice!

Serialization underspecified

JSON for event serialization is fine, but no message framing is defined. Newline terminated? Length prefixed? You don’t know :(

JSON

  • “Reserved Characters” - I don’t think you have read the JSON spec. Most (all?) of the ‘escaping’ detailed in CEE JSON is already specified in JSON: http://www.json.org/string.gif

Specific comments on the ‘json’ format inline as comments, this example was copied verbatim from the CEE :

{
    # Forget this "Event" crap, just move everything up a level.
    "Event": {
        "p_proc": "proc1",
        "p_sys": "host.vendor.com",
        "time": "2012-01-18T05:55:12.4321-05:00",
        "Type": {
            # Action is like, edge-trigger information.
            # Status is like, line-trigger information.
            # You don't usually have both edge and line triggers in the
            # same event. Confusing.
            "action": "login",
            "status": "ongoing",

            # Custom tag values also MUST start with a colon? It's silly to make
            # the common case (custom tags) the longer form.
            # Also, tags is a plural word, but the value is a string. What?
            "tags": ":hipaa"
        },
        "Profile": {
            # This is a *huge* bloat, seriously. Stop making JSON into XML, guys.
            "CustomProfile": {
                "schema": "http://vendor.com/events/cee-profile.xsd",
                "new_field": "a string value",
                "new_val": 1234,
                "product_host": "source.example.com"
            }
        }
    }
}

If you include the schema (CEE calls it a Profile) in every message, you’re just funding companies who’s business model relies on you paying them per-byte of log transit.

Prior art here for sharing schema can be observed in the Apache Avro project, for example.

CLT Protocol

Just define the protocol, all these requirements and conformance levels make things more confusing and harder to write tooling for.

If you don’t define the protocol, there’s no incentive for vendors to use prior art and are just as likely to invent their own protocols.

Combining this incentive-to-invent with the whole of CEE that underspecifies just about every feature, this guarantees incompatibilities in implementations.

Matplotlib makes me hate.

Let me caveat this rant with the fact that I've only been playing with matplotlib for approximately a week.

All the demos made matplotlib (a python module) look like a great tool that I should want to use to graph things, then I started trying to actually write code and it all went downhill.

Almost all of the functions operate on some mystical global scope, meaning they are by design not threadsafe. Probably not a big deal, I guess, but it certainly feels like an alien world especially given all the object oriented code in use in python.

If this culture shock wasn't bad enough, it went ahead and decided to use inches and ratios as the standard units of measure. You make a figure of a set width and height (in inches) and you can put stuff in that figure given ratio offsets. An offset of '.5' would put your left-bound in the middle. Weird and unexpected. Perhaps not bad. Still, I'm used to pixels, not inches.

Some of the arguments are just looney:

  fig.add_subplot(111)
The docs say this "subplot(211) # 2 rows, 1 column, first (upper) plot". Base10 flag system? What. the. F. I'm at a loss as to why this was ever a good idea. Let's make it hard to add plots? Looks like you can use 'subplot(rows, cols, plotnum)' which is the sensible solution, but all the demos use the integer syntax, and it makes me sad.

You can't easily put the legend outside the graph.

Setting the default font size means you have to set at least 6 things. Make sure you note the excessive use of different tokens for the same freaking setting: labelsize, titlesize, size, and fontsize.

rc("axes", labelsize=10, titlesize=10)
rc("xtick", labelsize=10)
rc("ytick", labelsize=10)
rc("font", size=10)
rc("legend", fontsize=10)

I have code that looks like this:

  fig = figure()
  p = subplot(111)
  line = p.plot_date(dates, values)
  line[0].set_label("foo")
  legend()
  fig.savefig('foo.png', format='png')
Notice my entertaining leaps between OOP and WTF. Other cute nuances are that the docs/examples are littered with:
  ax = subplot(111)
You might think that the name 'ax' means 'axis' and that subplot returns an axis. No. You might ask python with type() and it would say '<type 'instance>'. Helpful. If you just print ax you'll see it is matplotlib.axes.Subplot. I'm trying hard to not get hung up on semantics, but 'axis' to me is very different from a plot. Plot seems like a visual representation, and an axis is a single dimension of a graph (aka a plot).

After several days of playing with this tool, I am frustrated and disheartened. It has such powerful features like tick rules: You can trivially specify "Put one major tick every 3rd week". However, the api is half OO half globally-scoped-procedural. Maybe this is my fault. The docs constantly mix 'matplotlib' and 'pylab' methods. Perhaps you can use just the matplotlib functions by themselves and you don't need pylab? Pylab, by the way, is what provides these awkward global functions and in theory only exists as a pure wrapper on top of matplotlib.

Adventures in mozilla red tape.

So a short while ago I published the tabsearch firefox extension. I thought to myself, "Why not put it up on addons.mozilla.org?"

To publish, you need to submit it to the addons review system. Submitting it puts it in the "sandbox". To leave the sandbox and go public it must be nominated. To pass nomination it must meet a large set of criteria, all of which make some amount of sense with respect to quality assurance, etc.

I've submitted it 3 times. Every time it's been denied for different reasons. The first time was half reasonable, because one of the reasons was "Remove those debugging statmenets". Other reasons have been:

"Document your preferences"
tabsearch doesn't have any options, preferences, or tweakables
"Your extension must have atleast one review from one of your users"
Do I have a QA team who can review this for me? I thought the reason I was publishing it on mozilla addons was to get users. Seems like an awkward bootstrapping problem I'm not going to bother solving.
"Make the key binding configurable"
That's what keyconfig is for :(
While I entirely agree that quality assurance through a review process is a great and useful idea, I think the firefox addons policies and reviewership group have taken it a bit far. There are only so many revisions I'm willing to do for the sake of publishing somewhere else. So, until I can find more time to throw at getting published at mozilla addons, you can expect to only find tabsearch here.

Benjamin Franklin wrote a blurb about perfection, '"Yes," said the man, "but I think I like a speckled axe best."'. Most of the time, perfection isn't worth the effort when something is already good enough.

I don't mean to discourage people from submitting to mozilla addons, but after 3 attempts it's really not worth it. Basically, the fine, nearly-unwritten print in the policy is that you need real people to have submitted very detailed reviews of your extension before it'll be approved.

Fedora's package manager

-bash-3.1# yum install django
No Match for argument: django
Nothing to do

-bash-3.1# yum install Django
Downloading Packages:
(1/1): Django-0.95.1-1.fc 100% |=========================| 1.5 MB    00:02
Ahh. Clearly.

Mysql prepare'd queries aren't cached, ever.

There once was a database named MySQL.

It had a query cache, becuase caching helps performance.

It also had queries you could "prepare" on the server-side, with the hope that your database server can make some smart decisions what to do with a query you're going to execute N times during this session.

I told mysql to enable it's caching and use a magic value of 1gb for memory storage. Much to my surprise, I see the following statistic after testing an application:

mysql> show status like 'Qcache_%';
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| Qcache_free_blocks      | 1          | 
| Qcache_free_memory      | 1073732648 | 
| Qcache_hits             | 0          | 
| Qcache_inserts          | 0          | 
| Qcache_lowmem_prunes    | 0          | 
| Qcache_not_cached       | 814702     | 
| Qcache_queries_in_cache | 0          | 
| Qcache_total_blocks     | 1          | 
+-------------------------+------------+
8 rows in set (0.00 sec)
Why are so many (all!?) of the queries not cached? Surely I must be doing something wrong. Reading the doc on caching explained what I can only understand as a complete lapse of judgement on the part of MySQL developers:
from http://dev.mysql.com/doc/refman/5.0/en/query-cache.html
Note: The query cache is not used for server-side prepared statements. If you're using server-side prepared statements consider that these statement won't be satisfied by the query cache. See Section 22.2.4, C API Prepared Statements.
Any database performance guide anywhere will tell you to use prepared statements. They're useful from both a security and performance perspective.

Security, becuase you feed the prepared query data and it knows what data types to expect, erroring when you pass something invalid. It also will handle strings properly, so you worry less about sql injection. You also get convenience, in that you don't have to escape your data.

Performance, becuase telling the database what you are about to do lets it optimize the query.

This performance is defeated, however, if you want to use caching. So, I've got a dillema! There are two mutually-exclusive (because MySQL sucks) performance-enhancing options available to me: using prepared statements or using caching.

Prepared statements give you two performance benefits (maybe more?). The first, is the server will parse the query string when you prepare it, and execute the "parsed" version whenever you invoke it. This saves parsing time; parsing text is expensive. The second, is that if your database is nice, it will try to optimize your queries before execution. Using prepared statements will permit the server to optimize query execution once, and then remember it. Good, right?

Prepared statements improve CPU utilization, in that the cpu can work less becuase you're teaching the database about what's coming next. Cached query responses improve disk utilization, and depending on implementation should vastly outperform most (all?) of the gains from prepared statements. This assumption I am making is based on the assumption that disk is slow and cpu is fast.

Cached queries will (should?) cache results of complex queries. This means that a select query with multiple, complex joins should be cached mapping the query string to the result. No amount of statement preparation will improve complex queries becuase they still have to hit disk. Large joins require lots of disk access, and therefore are slow. Remembering "This complex query" returned "this happy result" is fast regardless of whether or not it's stored on disk or in memory. Caching also saves cpu utilization.

I can't believe preparing a query will prevent it from being pulled from the query cache, but this is clearly the case. Thanks, MySQL, for making a stupid design decision.

Maybe there's some useful JDBC (oh yeah, the app I'm testing is written in Java) function that'll give you all the convenience/security benefits of prepare, but without the server-side bits, and thus let you use the query cache.

One anti-spam effort too easily defeated.

I see lots of times where people put their mailing addresses as "foo at bar dot org" in a hopeful effort to keep spammers from scraping your mailing address. Heck, mail archive systems often have (and are deployed with) options to obfuscate email addresses systematically, using the same pattern: foo at bar dot com.

All it does is hurt usability.

Googlng for "* at * dot *" clearly shows lots of matches. It also matches all of the following variants, due to google searches ignoring brackets and such in words:

  • foo at bar dot com
  • foo [at] bar [dot] com
  • foo (at) bar (dot) com
  • ... etc ...
Query, scrape, replace 'at' and 'dot' as desired. I now have 54 million email addresses. What now?

Seems like this effort only serves to have people fool themselves as well as to impede usability. It certainly won't protect you from spam. Why is this method used?

Python's sad xml business and modules vs packages.

So, I've been reading docs on python's xml stuff, hoping there's something simple or comes-default-with-python that'll let me do xpath. Everyone overcomplicates xml processing. I have no idea why. Python seems to have enough alternatives to make dealing with xml less painful.

Standard python docs will lead you astray:

kenya(...ojects/pimp/pimp/controllers) % pydoc xml.dom | wc -l
643
Clearly, the pydoc for "xml.dom" has some nice things, right? I mean, documentation is clearly an indication that THE THING THAT IS DOCUMENTED BEING AVAILABLE. Right?

Sounds great. Let's try to use this 'xml.dom' module!

kenya(...ojects/pimp/pimp/controllers) % python -c 'import xml; xml.dom'
Traceback (most recent call last):
  File "", line 1, in ?
AttributeError: 'module' object has no attribute 'dom'
WHAT. THE. HELL.

Googling around, it turns out that 'xml' is a fake module that only actually works if you have it the 4Suite modules installed? Maybe?

Why include fake modules that provide complete documentation to modules that do not exist in the standard distribution?

Who's running this ship? I want off. I'll swim if necessary.

As it turns out, I made too-strong of an assumption about python's affinity towards java-isms. I roughly equated 'import foo' in python as 'import foo.*' in java. That was incorrect. Importing foo doesn't get you access to things in it's directory, they have to be imported explicity.

In summary, 'import xml' gets you nothing. 'import xml.dom' gets you nothing. If you really want minidom's parser, you'll need 'import xml.dom.minidom' or a 'from import' variant.

On another note, the following surprised me. I had a module, foo/bar.py. I figured 'from foo import *' would grab it. This means 'from xml.dom import *' doesn't get you minidom and friends.

Perhaps I was hoping for too much, but maybe it's better to import explicitly. If that's the case ,then why push exceptions that allow '*' to be imported only from modules, not packages?

DNS providers suck.

Happy Halloween, folks. It's been 20 days since my last post. I've been incredibly busy with work and haven't had a chance to write. As a gift, I give you a rant.

I've been through no less than 3 DNS service providers in the past week, and all of them suck. They suck hard.

The first one I looked at was no-ip. No-IP claims they support 'dynamic dns' - they don't. The first thing you must realize about almost all dns providers is that while they claim they support "dynamic dns" and/or "round robin," what they really mean is their support of 'dynamic dns' is based solely around one single use case. One.

What is that use case? The following picture comes from dynu.com:

What is this? This use case of one computer updating it's own hostname with whatever IP it happens to have at that moment. Businesses can't possibly find this useful. It doesn't scale. If you have more than one server you want to put on a single hostname, this use case fails you miserably.

I've looked at no-ip, dyndns, dnspark, and several others. Trash.

Keep in mind, this rant is becuase both free AND pay-for dns providers suck. Both kinds. Free services actually have an excuse - you get what you pay for.

As a precursor, let me explain what I need from a dns provider:

  1. The ability to add and remove dns entries of any record type, at any time.
  2. The ability to add multiple entries for the same record
Many claim these features. Those I tried fail miserably.

If you are in the market for a real dns provider, as I am, you'll find many dns providers claiming what I listed above. "Sure! We support round robin!" they advertise, "We support dynamic dns!"

What they don't tell you in the same paragraph is that you have to use their own HTTP-based means of pushing dns changes. They absolutely don't tell you that their pathetic attempt at providing this "dynamic" service via a cgi-like interface is absolutely crippled.

Several providers allowed you to mutate records dynamically. However, none of them I tried let me add multiple entries for a single record using the dynamic interface.

An important realization is that my definition of dynamic is not the same as these dns providers' notion of dynamic. This so-called dynamic dns ability hinges on customers who want to be able to host crap out of their dynamic-ip-giving ISP. As such, most of the interface is just "Hey DNS provider! Please update www.foo.com with whatever IP this packet is coming from! Thanks!" This is intolerable!

What is my definition of "dynamic dns," exactly? Let's call it RFC 2136. Heck, I don't care if it's not RFC 2136, just that I'm able to do most things that update specification provides.

To quote ZoneEdit customer support regarding my issues with their service and in particular how to properly use their crippled dynamic update interface:

"You can atleast update hourly .

Updating too often with the same IP address gets your account locked up."

WHAT?! Once hourly? Shit. DNS is hard. Let's go shopping instead.

Doing this right is not hard. For example, I recently posted an article on how to setup dynamic dns and make your dhcp server talk sweetly to dns. I use this same configuration in my apartment. MY APARTMENT. My apartment is considerably smaller than, say, a multidatacenter dns provider. Why doesn't anyone at any of these dns providers have a freaking clue about running a dns server? Let me put it plainly:

I will give you money and you will give me a dnssec key and a server on which to use it. That shall be the extent of our relationship
That's all I want. The worst part is that it doesn't matter who you go with. There are plenty of free dns providers who provide you the same crappy service as give-us-your-money providers.

Really. Come on kids.

Look at it this way - To enable dynamic dns updates, you don't need to write any code. A few tiny named.conf changes. To provide a pathetic http interface you label as "dynamic dns" requires lots of lines of code, lots of testing, and $$$ invested in this kind of product.

To further show how stupid this is. Microsoft supports this properly. Microsoft. You know, that company everyone hates-on for proprietary protocols and ignorance of standards? Microsoft DNS will send updates using BIND's update protocol. How do I know this? I've had a primary dns server running BIND and Microsoft DNS running as a secondary. I told Active Directory that it's primary dns was the BIND server. Guess what happened? Active Directory happily submitted updates to my BIND server. Correctly.

You might be thinking to yourself, "Why don't you just host dns yourself?" Because I dont' have any servers on a static IP address. And no, this isn't running out of my apartment.

Am I the only one who can't find a dns provider that doesn't suck?

Forbes.com sucks. Here's one reason why.

I followed a webclip link out of gmail today and it dropped me off at a news story on Forbes.com. I wanted to read this story. However, I was presented with something horrific. I was presented with the results of a tragic effort that I can only presume is a scheme to show as many "punch the monkey" advertisements as possible.

What is this scheme? Well. I landed on the page. This page had two average-length paragraphs. No sooner had I finished reading the first paragraph than the page reloaded and showed me another, new piece of text.

Six seconds later. A new page.

Repeat.

Turns out Forbes.com has some sort of slideshow they try to use to display stories. To make matters worse, there are advertisements everywhere. By the time I figured out what part of the page I was supposed to be looking at, it went to the next page. Sure, you can stop the slideshow, but I only found that out afterwards.

Thanks Forbes. I almost read one of your stories.

Clicky for an example article

Thumbnail screenshot of the page follows. Enjoy the massive amount of whitespace and adspace.

Python is getting on my nerves

Add lacking dynamic assignment ability to my "I wish Python had Foo" list.

Python does not appear to have dynamically assignable arrays. Where are we, C? Assembly? When I assign past the end of the array, I mean resize the god damned array. Thanks.

nightfall(~) % python -c "foo = []; foo[3] = 234"
Traceback (most recent call last):
  File "<string>", line 1, in ?
  IndexError: list assignment index out of range
This is completely unacceptable. Sure, I can use list comprehensions to make an N element array that's empty:
foo = [None for x in range(100)]
foo[44] = "Hi"
That only gets me an array with 100 empty elements. Uh.. Not what I want. If I did this on an array with data in it I didn't want to lose, I'd lose all the data.

Sigh...