Useful Observium alert checker examples

This page is outdated as of Feb 26 2015, as state sensors have been rewritten. I keep an updated document on:
https://github.com/mgmoerman/docs/blob/master/observium-alert-checkers.md

Observium straight out of the SVN repository (if you bought the subscription) doesn’t come with alert-checkers, which is unfortunate, as you need to figure out how this alerting system works by trial and error. Goal of this blog post is to give some examples of generic alert-checkers, and provide some more explanation on Metrics & Attributes, and some of the values that go with it. This document is off course not complete, and can always be improved. Please give me feedback to improve this.

Observium has a very powerful way of using entity types & check conditions to do alerting. But you do need to know how this is implemented.

There is some documentation on the Observium site itself, which is useful to read:

Creating an alert checker

Let’s go through the steps that are involved to actually create/add an alert checker in Observium

Entity type

First of all when you create an alert,you’ll need to pick the ‘entity’ type for what you are building the alert for. An entity type is nothing more than a “thing” for which you would like to see alerts.

These are the ones that are available as of 12/12/2014:

  • Device
  • Memory
  • Storage
  • Processor
  • BGP Peer
  • Netscaler vServer
  • Netscaler Service
  • Toner
  • Port
  • Sensor

They kinda speak for them selves, if you want alerts on things that go on with ports, pick ports, if you want something that has to do with a sensor, pick that one. Device is a very generic one, and will just give you status things on wether it’s up/down and it’s uptime and the response time for ping/snmp, the entity type Device has nothing to do with Ports or Sensor on the device itself, for alerting for that, pick actually Ports or Sensor

Alert Checker details

Once you picked the entity type, there’s a couple of more things that need to be filled in but these are simple, pick a name for the alert, and pick a message you want to be included once an alert is sent out.

Use Alert Delay to set the amount of poller runs that a condition of your alert checker should persist until it actually starts alerting. This could be useful when for example you’re creating a check for processor usage, but you don’t want to be alerted on every CPU spike that is happening. If you set a delay of say, 2, it’ll take 2 poller runs for actually alerting (providing the condition for which you are checking hasn’t changed off course)

Send Recovery button is self explanatory, and the Severity is currently not in use

Checker Conditions

Then we come to the Checker Conditions, this is where you actually implement the check for a specific entity.

It’s important to know what Metrics & Attributes are, see the overview below for a complete list of Metrics & Attributes

When filling in the fields for Checker Conditions, you use the Metrics mentioned in this page.

These need to be single lined entries, you can put as much in there if you want but you usually have one to check for a single condition, or two, for example to check an upper and lower limit. Use the boolean to switch between ANY or ALL of these conditions to match.

A single line consists of three values:

  • the actual metric
  • a “test” (le, ge, lt, gt, ne, match and notmatch)
  • a value

Associations

In these input fields you’ll create the first association rule, in other words, which subset of the entity type you selected needs alerting based on the conditions specified in the previous pane. When initially creating an alert checker, it allows for ony 1 association rule. Once it’s added, you can later on add more association rules to it.

These association rules are made from a “device association” and an “entity association”. First input field you’ll do your device matching, based on the attributes for devices. Second input field you’ll do your entity matching, using the attributes for the entity type you want to associate it with (this can off course be different then the condition you’re checking for)

This works in sort of the same way as the Checker Conditions. It uses the same line method (metric,test,value), however with some exceptions:

  • instead of using metrics, you’ll be using attributes
  • you can’t use a device attribute twice in the same association rule, so for example multiple “hostname match bla” statements with in the same association rule won’t work
  • for a single device association line, you can have multiple entity association lines

That last exception allows for more specific filtering, for example, you would want to match against all sensor classes (sensor_class) that are of type “state”, but when that nets you to many results, you can add a match for it’s description (sensor_descr), or you’d want to match all ports of type (ifType) ethernetCsmacd, but you only want certain ones with a specific description (ifAlias)

Example alerts

If you scrolled down here to just copy/paste some alert-checkers, perfectly fine, but don’t complain if they don’t work, PLEASE read how these work above.

The following is a set of very useful alert checkers:

Alert Entity type Check Conditions Check Conditions boolean Device match Entity match
Device down Device device_status equals 0 ANY * *
Processor usage is above 80% Processor processor_usage greater 80 ALL * processor_descr match processor
Memory usage is above 70% Memory mempool_perc greater 70 ALL * *
State sensor is in ALERT state! Sensor sensor_event equals alert ANY * sensor_class equals state
Fanspeed is above or under treshold Sensor sensor_value greater @sensor_limit
sensor_value less @sensor_limit_low
ANY * sensor_class equals fanspeed
Temperature is higher then 50 degrees Sensor sensor_value gt 50 ANY * sensor_class equals temperature
Traffic exceeds 85% Port ifInOctets_perc ge 85
ifOutOctets_perc ge 85
ANY * ifType equals ethernetCsmacd
BGP Session down BGP Peer bgpPeerState notequals established ANY * bgpPeerRemoteAs equals 41552
Storage exceeds 85% of disk capacity Storage storage_perc ge 85 ANY * storage_type equals hrStorageFixedDisk
Port has encountered errors or discards Port ifInErrors_rate gt 1
ifOutErrors_rate gt 1
ANY * ifType equals ethernetCsmacd
Port is enabled, but operationally down Port ifAdminStatus equals up
ifOperStatus notequals up
ALL * ifType equals ethernetCsmacd

Per entity overview of Attributes , Metrics and their values (if any)

Device

Metrics Values
device_status 0 = down, 1 = up
device_status_type reason for down, ‘snmp’/’ping’
device_ping response in ms
device_snmp response in ms
device_uptime in seconds
device_duration_poll in seconds
Attributes Values
hostname Self explanatory, this is the hostname for the device
os  cisco,asa,junos,linux,printer, generic, etc.
For an up-to-date list see /opt/observium/includes/definitions/os.inc.php
type network,server,workstation,storage,voip,firewall
sysName Derived through SNMP
sysDescr Derived through SNMP
sysContact Derived through SNMP
hardware Derived through SNMP
serial Derived through SNMP

Port

Metrics Values
ifInOctets_rate & ifOutOctets_rate number
ifInOctets_perc & ifOutOctets_perc 0-100 percentage
ifInUcastPkts_rate & ifOutUcastPkts_rate number
ifInErrors_rate & ifOutErrors_rate number
rx_ave_pktsize & tx_ave_pktsize
ifOperStatus up/down
ifAdminStatus up/down
ifSpeed interface speed derived through SNMP in mbit
ifMtu number
ifDuplex full/half
Attributes Values
ifSpeed interface speed in a mbit number
ifAlias the interface description
ifDescr Location of the interface, (blade, slot, etc)
ifName
ifType name of interface as described by IANA, see https://www.iana.org/assignments/ianaiftype-mib/ianaiftype-mib
ifPhyAddress MAC address of the interface
port_descr_type
port_descr_descr
port_descr_speed
port_descr_circuit
port_descr_notes

Memory

Metrics Values
mempool_free
mempool_perc 0-100 percentage
mempool_used
Attributes Values
mempool_descr
mempool_mib
mempool_index

Processor

Metrics Values
processor_usage 0-100 percentage
Attributes Values
processor_descr
processor_type
processor_oid

Storage

Metrics Values
storage_free
storage_perc 0-100 percentage
storage_used
Attributes Values
storage_descr
storage_type
storage_mib
storage_index

BGP Peer

Metrics Values
bgpPeerState established
bgpPeerAdminStatus
bgpPeerFsmEstablishedTime
Attributes Values
as_text
bgpPeerRemoteAs
bgpPeerRemoteAddr
bgpPeerLocalAddr
bgpPeerIdentifier

Sensor

Metrics Values
sensor_value number
sensor_event up, warning, alert, down
Attributes Values
sensor_descr
sensor_class voltage, current, power, frequency, humidity, fanspeed, temperature, dbm, state
sensor_type
sensor_index
poller_type possible types: snmp, agent, ipmi

Toner

Metrics Values
toner_current
Attributes Values
toner_descr

Netscaler vServers

Metrics Values
vsvr_state
vsvr_bps_in
vsvr_bps_out
Attributes Values
vsvr_name this matches vsvr_fullname except when longer then 32chars, it becomes a randomstring
vsvr_fullname
vsvr_label
vsvr_ip
vsvr_ipv6
vsvr_port
vsvr_type
vsvr_entitytype

Netscaler Services

Metrics Values
svc_state
svc_bps_in
svc_bps_out
Attributes Values
svc_name this matches vsvr_fullname except when longer then 32chars, it becomes a randomstring
svc_fullname
svc_label
svc_ip
svc_port
svc_type