Pushing the check states from Xymon to Graphite

Motivation

Chances are, you never heard about Xymon (formely Hobbit), so let me give you some idea. It's actually pretty decent monitoring system - if you still live in 90s. :D But hey, let's give credit where it's due, compared to other systems at that time, it was reasonably fast, has quite easy to understand configuration, comes with many standard checks out of the box and comes with a web interface. Probably that's why some people still use it to this day.

In our company it's one of those legacy systems, that we need to replace, so as a first step, let's see if we can get some of the data out while we're still using it.

Let's get some data out of it

The idea of this short excercise was to get the state of all checks and feed them to graphite where we could do some analysis. Xymon comes with quite powerful protocol that you can access via xymon binary. In fact, Xymon itself is using that protocol to receive status reports from all the clients.

However what we're interested is this command, ("message") that should give you back a summary of all known tests (checks) available to Xymon daemon (which is your central point of metrics collection):

xymondboard

On top of that we can only ask for specific data. In our case we're only interested in three specific values, so let's only ask for that:

xymondboard fields=hostname,testname,color

First two are prety selfexplanatory, but let's see what this color is. Generally speaking, Xymon defines state of test in colors:

  • green: Okay
  • yellow: Okay so far, but some thresholds were triggered (like CPU load getting high)
  • red: Fail (service not running, CPU 100% loaded for too long, you get the idea)
  • purple: Stale - no data received recently, could indicate an issue
  • blue: Disabled - this test has been disabled (most likely by administrator to silence it)
  • clear: I have no idea what this one means (perhaps no data sa far recieved for newly defined check?)

Let's add the actual xymon binary there and host where to fetch the data from and we'll get the final vershion of the whole command:

xymon <server_hostname> 'xymondboard fields=hostname,testname,color'

If you try running the above line manually you'll get back on standartd output hostname, testname and color separated by vertical bar character. ("|" or as we know it, the pipe) One test per line, which is definitely handy.

Sprinkle it with Python magic

So now we know how to get the data out of Xymon, how do we get it to Graphite? Well we'll add couple lines of python:

 #!/usr/bin/python

 import fileinput
 import re
 import socket
 import time

 values = {
         "blue": -1,
         "clear": -1,
         "green": 0,
         "purple": -1,
         "red": 2,
         "yellow": 1,
 }

 sock = socket.socket()
 sock.connect(('127.0.0.1', 2003))

 ts = int(time.time())

 for line in fileinput.input():
         (hostname, metric, color) = line.split("|")
         graph_domain = re.sub('[^a-z0-9.]','_',hostname.lower()).split(".")
         if len(graph_domain) < 2:
                 graph_domain.append("_undefined_")
         graph_path = "{host_path}.{metric}".format(
                 host_path = ".".join(graph_domain[::-1]),
                 metric = re.sub('[^a-z0-9.]','_',metric.lower()))

         sock.sendall(
                 "{path} {value} {timestamp}\n".format(
                                 path = graph_path,
                                 value = values.get(color.strip(), -1),
                                 timestamp = ts
                 ))
 sock.close()

I'm sure you've seen better code, there's no error handling and major cleanup is due, but for a quick 5 minute hack, it should work. Let's have a closer look. First we define mapping from color to numeric values usable in graphite. I went with -1 for unknown statuses, 0 for green, 1 for warning and 2 for error. Then we open connection to graphite.

Now that we're ready to send data, we read one line at the time from stdin, split it to get the values. We're doing some parsing of the hostname here as well - we want to change the hosname in "host1.example.com" to appear as "com.example.host1" in graphite - this way we can group metrics by domain. (obviously different mapping might be better in your case)

We also sanitise the hostnames and test names, so the resulting path is acceptable by graphite. Finally we send all that to graphite with proper timestamp and value represented by number.

Now add we just need to run it every minute via cron and we're done:

* * * * * /bin/bash -c "/bin/xymon xymon.example.com 'xymondboard fields=hostname,testname,color' | /bin/grafeed.py >/dev/null\"

Perhaps even once every 5 minutes should be OK, considering most of the checks won't have better granularity, but let's leave some breathing space, shall we? With proper aggregation set up in Graphite, the required storage will be quite small anyway.

Final words

This is why I love Python. Batteries included philosophy makes it dead simple to write a quick integration script in minutes. On top of that, there are no external dependencies - if you work on legacy systems, you might sometimes find yourself unable to install any for many reasons. (outdated OS, perhaps with limited or no connection, you get the idea) That's the place where even older version of python might come extremely handy with all its modules included.

As an proof of concept we now collect checks statuses in a convenient format, that's easy to browse and it only took us couple minutes.

But still, don use Xymon. Seriously.