Search this site


Page 1 of 2  [next]

Metadata

Articles

Projects

Presentations

When all you have is a hammer, make your own tools?

Clarifying my position from this post:
The "ops folks need coding skills" groupthink is lame. Software requires extra coding because it is shitty, not because people are unskilled

I will lead with this: I want more people who use technology to grow and learn better skills for bending that technology to their needs. An ops guy with programming skills is, to me, more valuable than one who cannot - programming in any language or platform lets you extend an otherwise static system.

Anyway, back at the post in question, I'm not trying to say people (ops or otherwise) shouldn't want stronger programming skills. I'm saying the equipment we use is pretty shitty.

I am part of the generation raised near devices ever blinking "12:00". Devices which have no business caring what time it is, nor any sane reason to make the state of "I don't know what time it is" a high priority alert worth blinking forever.

It's 2012, and this problem persists - my microwave refuses to cook food unless it has the time *and* date from user input. Now I have to program it every time it has a power disruption (which is has frequently due to some bug in the hardware causing it to power off randomly with certain dishes at home).

Now I have to learn to program or configure these devices before they'll stop irritating me. And, damn it I hate that. If, instead, this were enterprise software, I could report these irritations to the vendor who would kindly offer me training and consulting for extortionate piles of money.

I love coding. It's fun, and many times lets me solve problems I couldn't otherwise. Allowing me to abuse an analogy, "When all you have is a hammer, you can sit down and build whatever tool you need to repair the delusion that everything is a nail."

But despite being able to solve my own problems in software, I don't think this is a great pattern of work. I write code, most of the time, because the solutions available are terrible or don't meet my requirements. With a new software popping up every day, I see a strong correlation between software availability and people asking for more programmers.

So, the more software we have, the more programmers we need to work around limitations in the available body of software. I think that's pretty lame :(

And regarding my microwave problems, I want some confidence that the problems being solved are meaningful problems, not programming learned for the sake of working around bugs and misfeatures in software we're suffering with.

Introducing FPM - Effing Package Management

Having become fed up with dealing with rpmbuild, spec files, debian control files, dh_make, debuild, and the whole lot, I automated my way back to sanity.

The result is a tool I call "fpm" which aims to help you make and mangle packages however you choose, all (ideally) without having to care about the internals of your particular native package format.

The goal of this project is not to undermine upstream packaging but to grant everyone the ability to trivially build and edit packages. Why? Not all software is packaged. Not all software of the version you want is packaged. And further, not all users are willing or able to take the time to learn all the ins and outs of their package build tools.

For example, you can package up your /etc/init.d directory as an RPM by doing simply this:

% fpm -s dir -t rpm -n myinitfiles -v 1.0 /etc/init.d
...
Created /home/jls/rpm/myinitfiles-1.0.x86_64.rpm
fpm will create a simple package for you and put it in your current directory. The result:
% rpm -qp myinitfiles-1.0.x86_64.rpm -l
/etc/init.d
/etc/init.d/.legacy-bootordering
/etc/init.d/NetworkManager.dpkg-backup
...

% rpm -qp myinitfiles-1.0.x86_64.rpm --provides
myinitfiles = 1.0-1
% rpm -qp myinitfiles-1.0.x86_64.rpm --requires
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(CompressedFileNames) <= 3.0.4-1
You can package up any directory. But there's more.

Above, I didn't specify a package summary, so how about fixing the rpm to include the description? You can use RPMs as the source (-s flag) in fpm. There's also a helpful '-e' (--edit) flag that'll let you edit the rpm spec (or debian control) file before building.

% rpm -qp myinitfiles-1.0.x86_64.rpm --info | grep Summary
Summary     : no summary given

% fpm -s rpm -t rpm -e myinitfiles-1.0.x86_64.rpm
... this opens up $EDITOR so you can edit the spec file it generated ...
... make some changes to the spec, including adding a proper 'Summary' ...
Created /home/jls/rpm/myinitfiles-1.0-1.x86_64.rpm

% rpm -qp myinitfiles-1.0-1.x86_64.rpm --info | grep Summary
Summary     : my /etc/init.d directory
The '-s dir' flag says the source of the package is a directory. There's also support for other package sources like rubygems, other rpms, debs, and more on the way.

With FPM, you can specify dependencies, architecture, maintainer, etc. All from a simple command line, and never forcing you to learn the pain and suffering that can come with rpm spec files or debian package building.

You can install fpm with: gem install fpm

The project page is here: https://github.com/jordansissel/fpm

The wiki is here (has more examples): https://github.com/jordansissel/fpm/wiki

SysAdvent 2010 now online!

Today starts the 3rd year of the SysAdvent calendar! How time flies!

What is SysAdvent? It's a 24-day event where I publish one excellent sysadmin article each day, starting December 1st. The articles are written by fellow sysadmins around the world.

Planning for this year has been quite a success. Many folks have contributed finished articles or drafts already and more have committed to writing about a topic. This is a huge step for the SysAdvent project.

SysAdvent is for sysadmins, so please share sysadvent with your coworkers, reddit, digg, twitter, and any other places with sysadmin communities.

The first article for this year is about Linux Containers (LXC). Go, read! Add sysadvent to your rss reader, too :)

Also, there are still 50ish articles from the past two years, all quite good. Have a look at the 2008 and 2009 years, too!

Puppet Trick - Exported Resource Expiration

I've finally taken the plunge with puppet's exported resources.

"Exported resources" is a feature of puppet that allows your nodes to export resources to other nodes. You can read more about this feature on puppet's exported resources documentation. Covering how to setup exported resources or storeconfigs is out of scope, but if you need help read the docs and come to #puppet on freenode IRC.

Exported resources are pretty cool, but they lack one important feature - expiration/purging. The storeconfigs database has no idea about nodes that you have decommissioned or repurposed, so it's very possible to leave exported resources orphaned in your database.

I worked around this by making my resources such that I can expire them. This is done by making a custom define that has a 'timestamp' field that defaults to now, when registering each time. If a node has not checked in (and updated its resources) recently, I will consider that resource expired and will purge it.

I made a demo of this and put the code on github: jordansissel/puppet-examples/exported-expiration. More details (and example output of multiple runs with expiration) are available in the README.

The demo is runnable by itself (standalone, no puppet master), so you can test it without needing to mess with your own puppet installations.

Puppet Camp San Francisco 2010

Another puppet camp has come and gone, and I'm certainly glad I went. Puppet, the surrounding ecosystem, and its community has grown quickly since last year.

The conference was the same format as last year. The morning was single-track presentations from various puppet users, and the afternoon was openspace/barcamp-style break out sessions. It was good to see some old faces and also to finally put faces to IRC and twitter names.

One of the bigger announcements was that mcollective would join the Puppet project. Other announcements included new employees and other good news. Beyond that, I picked up a few tricks and learned more about the puppet roadmap.

In no particular order - some thoughts and notes.

Facter 2.0 will be good. Take lessons learned from Facter 1.x and improve things - Make the DSL for making facts simpler, add structured data, add caching, etc.

Puppet supports a "config_version" option that specifies a script that will override how the version of a given catalog is determined. Useful for tagging based on revision control or deployment versions.

Scoped defaults such as 'File { owner => root }' apply downwards in all cases, something I hadn't considered before. That is, if you are class 'foo' and define a default and also include class 'bar', the default in foo will apply to bar as well. This was new to me, and I will be cleaning up some of my manifests as a result (I use defaults in some classes but not others). Best practice here is to either use no class-specific defaults or use class-specific defaults in every class.

Twitter operations (John Adams) gave a talk covering their automation/puppet stuff. John talked about problems with sysadmins trying to hack around puppet by using chattr +i to prevent puppet from modifying certain files - a practice they heavily discouraged. He also mentioned problems with poor cron scheduling and presented the usual sleep $(($RANDOM % 600))-style solution. I didn't get around to sharing my cron practices (sysadvent) with John before the end of the con. He also mentioned having problems with home directory syncing using puppet, which was another solution I'd covered that here and better solved previously on sysadvent.

During some downtime at the conference, I started working on an ssh key authorization module for mcollective. The ruby ssh key code is here and the mcollective fork with the sshkey security plugin is here. It works pretty well:

snack(~) % sudo grep security /etc/mcollective/{server,client}.cfg
/etc/mcollective/server.cfg:securityprovider = sshkey
/etc/mcollective/client.cfg:securityprovider = sshkey
snack(~) % mc-ping                                         
snack.home                               time=97.81 ms
The gist of the key signing pieces is that your ssh agent signing authenticates you as a user for requests, and for responses the server signs messages with its own ssh host key (like /etc/ssh/ssh_host_rsa_key). Validation of you as a user is done through your authorized_keys file, and validation for the reply uses your known_hosts file to verify the host signature.

It was a good conference, though I would've enjoyed a more hackathon-style atmosphere. We tried to do a facter hackathon, but there wasn't enough time, so instead we code reviewed some of the sillier parts of facter and talked about the future.

Random thoughts: Log analytics with open source

Over the past few years, I've tinkered on and off with various projects to help me do log analysis, data aggregation, graphing, etc. Recently, I had a discussion with a coworker about alternatives to Splunk (specifically, free ones). Turns out there aren't any projects, as far as I can tell, that provide most of what Splunk does.

With all the awesome open source projects available to date that focus on tight features and perform well, how much work would it be to tie them together and produce a tool that's able to compete with Splunk?

I hooked grok and Lucene together last night to parse and index logs, and the results were pretty slick. I could query for any keyword I wanted, etc. If I wanted logs involving specific fields like IP address, apache response code, etc, I could do it. Grok does the hard part of eating a log line and outputting key:value pairs while Lucene does the hard part of indexing field values and such.

Indexing logs in Lucene required using it in a somewhat strange way: We treat every log entry as a unique document. This way, each log line can have several key:value pairs (fields) associated with it, and searching becomes easy.

  • Log parsing: grok and other tools have this done.
  • Log indexing: lucene
  • On-demand graph tools: python matlotlib, javascript flot, etc
  • Alerting: nagios
  • Fancy web interface: Ruby on Rails, or whatever
Indexing non-log data, such as SNMP queries, only requires you feed Lucene with the right data.

The hard part, from an implementation perspective, is only as hard as taking output (logs, data, whatever) and feeding your indexer with the fields you want to store.

Parsing all kinds of log formats isn't a trivial task, since different log formats will require new pattern matching. However, grok's automatic pattern discovery could be used to help fill in gaps where you may not yet have defined patterns.

Pending time and energy, I might have time to pursue this project.

Sysadmin Advent Calendar is alive

I've been procrastinating like a champion while avoiding writing advent articles. We've pulled a decent set of ideas together on the mailing list. As if to prove to myself that procrastination is awesome, I've written at the last minute the first day for the sysadmin advent calendar - I hope it doesn't disappoint:

http://sysadvent.blogspot.com/

As you can see, I also slacked on making a proper advent calendar interface. Volunteer writers (or designers, if you're willing to hack up something on short notice) are totally welcome. Shoot me an email ([email protected])

Grok plans and ideas

I'm almost done and graduated from RIT, which is why I haven't posted about anything recently. Finally done with this school. Wee!

Some noise woke me up from my pleasant slumber, and I can't seem to get back to sleep, so I might aswell do some thinking.

I think grok's config syntax is far too cluttered. I would much rather make it more simplified, somehow. Something like:

file /var/log/auth.log: syslog
{blockuser, t=3, i=60} Invalid user %USER% from %IP%

file /var/log/messages: syslog
{tracksudo, prog=su} BAD SU %USER:F% to %USER:T% on %TTY%

reaction blockuser "pfctl -t naughty -T add %IP%"
reaction tracksudo "echo '%USER:F% failed su. (to %USER:T%)' >> /var/log/su.log"
Seems like it's less writing than the existing version. Less writing is good. A more relaxed syntax would be nicer aswell -such as not requiring quotes around filenames.

This new syntax also splits reaction definitions from pattern matches, allowing you to call reactions from different files and other matches. I'll be keepin the perlblock and other features that reactions have, becuase they are quite useful.

I've also come up with a simple solution to reactions that execute shell commands. The current version of grok will run system("your command") every time a shell reaction is run. This is tragicaly suboptimal due to startup overhead. The simple solution is to run "/bin/sh -" so that there is already a shell accepting standard input and waiting for my commands. Simply write the shell command to that program.

I wrote a short program to test the speed of using system() vs printing to the input of a shell. You can view the testsh.pl script and the profiled output.

An abbreviated form of the profiled output follows:

%Time    Sec.     #calls   sec/call  F  name
92.24    1.5127     1000   0.001513     main::sys
 2.04    0.0335     1000   0.000033     main::sh
sys() calls system(), where sh() simply sends data to an existing shell process. The results speak for themselves, even though this example is only running "echo" - the startup time of the shell is obviously a huge factor in runtime. The difference is incredible. I am definitely implementing this in the next version of grok. I've already run into many cases where I am processing extremely large logs post-hoc and I end up using perl block reactions to speed up execution. This shell execution speed-up will help make grok even faster, and it can always use more speed.

Still no time to work on projects right now, perhaps soon! I'm moving to California in less than a week, so this will have to wait until after the move.

Parallelization with /bin/sh

I have 89 log files. The average file size is 100ish megs. I want to parse all of the logs into something else useful. Processing 9.1 gigs of logs is not my idea of a good time, nor is it a good application for a single CPU to handle. Let's parallelize it.

I abuse /bin/sh's ability to background processes and wait for children to finish. I have a script that can take a pool of available computers and send tasks to it. These tasks are just "process this apache log" - but the speed increase of parallelization over single process is incredible and very simple to do in the shell.

The script to perform this parallization is here: parallelize.sh

I define a list of hosts to use in the script and pass a list of logs to process on the command line. The host list is multiplied until it is longer than the number of logs. I then pick a log and send it off to a server to process using ssh, which calls a script that outputs to stdout. Output is captured to a file delimited by the hostname and the pid.

I didn't run it single-process in full to compare running times, however, parallel execution gets *much* farther in 10 minutes than single proc does. Sweet :)

Some of the log files are *enormous* - taking up 1 gig alone. I'm experimenting with split(1) to split these files into 100,000 lines each. The problem becomes that all of the tasks are done except for the 4 processes handling the 1 gig log files (there are 4 of them). Splitting will make the individual jobs smaller, allowing us to process them faster becuase we have a more even work load across proceses.

So, a simple application benefiting from parallelization is solved by using simple, standard tools. Sexy.

RRDTool to graph log-originating data.

I need to relearn rrdtool, again, for this sysadmin time machine project. Today's efforts were spent testing for features I hoped were in RRDTool. So far, my feature needs are met :)

Take something simple, like webserver logs. Let's graph the hits.

Create the RRD:

rrdtool create webhits.rrd --start 1128626000 -s 60 \
   DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
   RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
   RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a start date of sometime in Novemeber. The step is 60, so it expects data every minute. I then specify one data type, hits, which is a gaugue (rate), and ranges from 0 to infinity (U). The rest of the command is RRA's defining how data is stored. The first one says take 5 samples and average them, and store 600,000 of these samples, at a maximum.

Now that we have the database, we need a "hits-per-minute" data set. I wrote a short perl script, parsehttp that will read from standard input and calculate hits-per-minute and output rrdtool update statements. Capture this output and run it through sh:

./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs and store it in our RRD.

Now that we have the data, we can graph it. However, since I want to view trends and compare time periods, I'll need to do something fancier than simple graphs.

RRDTool lets you graph multiple data sets on the same graph. So, I want to graph this week's hits and last week's hits. However, since the data sets are on different time intervals, I need to shift last week's set forward by one week. Here's the rrdtool command that graphs it for us, with last week's and this week's data on the same graph, displayed at the same time period:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two data sets with DEF: hits and lastweek. They both read from the 'hits' data set in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other starts 2 weeks ago and ends last week. I then shift last week's data forward by 7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green), the other for this weeks (red).

That graph looks like this:

That's not really useful, becuase there's so many data points the graph almost becomes meaningless. This is due to my poor creation of RRAs. We can fix that by redoing the database, or using the TREND feature. Change our graph statement to be:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one day (86400 seconds). This creates a sliding average across time. I store these in new data sets called t_hits and t_lastweek and graph those aswell.

The new graph looks like this:

You'll notice the slide values are chopped off on the left, that's becuase it doesn't have enough data points at those time periods to make an average. However, including the raw data makes the graph scale as it did before, making viewing the trend difference awkward. So, let's fix it by not graphing the raw data. Just cut out the LINE1:lastweek and LINE1:hits options.

Fixing the sliding average cutoff, add a title, and a vertical label:

rrdtool graph webhits.png -s "-1 week" \
   -t "Web Server Hits - This week vs Last week" \
   -v "hits/minute" \
   DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days"  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend beyond those boundaries, so that sliding averages can be calculated throughout. The new, final graph, looks like this:

Now I can compare this week's hits against last weeks, quickly with a nice visual. This is what I'm looking for.

This would become truely useful if we had lots of time periods (days, weeks, whatever) to look at. Then we could calculate standard deviation, etc. A high outlier could be marked automatically with a label, giving an instant visual cue that something is potentially novel. It might be simple to create a sort-of sliding "standard deviation" curve. I haven't tried that yet.