Category Archives: Monitoring

Graphing performance metrics

I’ve spent a┬ácouple of days researching options to graph performance metrics. We’re trying to get all our metrics from all our services into a single interface. Here’s the pieces I found which seemed to fit best.

  • dygraphs – Javascript timeseries graphing library, looks excellent
  • InfluxDB + Grafana – Appears to be the best in class for metric storage + display
  • Graphite (+ Carbon) – Looks to be more complex to setup and a little less powerful than InfluxDB
  • StatsD – Pre-process high volume metrics before feeding to influx / carbon

Next step is to play with Influx + Grafana, either their hosted service, or our own, and see what we can push into it. More to follow…

Advertisements

First useful boomerang graph

Today we’ve produced our first useful graphs from the 770k boomerang data points we have collected. This is one of the graphs we produced, and I’ll post it here only because it’s the first one I personally produced. At last, after 4 months we’re actually seeing data.

What does it tell us? That page load time is not very uniform. Next step, linear regression comparing page load time against user’s available bandwidth.

Monit

We had our first outage since moving to Rackspace on 27 December. I came online to emails saying the site was down. I freaked out. An outage within the first few days on my watch. Crikey.

Looking into the issue, memory usage started spiking around 4:40am. By 12:20pm the server become unresponsive. All available memory and swap space had been filled. It took almost 8 hours for the server to crash. I should have been warned of the problem in that window and fixed it before it ever happened. I set out to sort that, and monit appears to be the best tool for the job.

Issues

I hit a couple of issues. The first one had me stumped for quite a while. On Ubuntu, mysql does not create a pid file by default. This led monit to think it wasn’t running, try to start it, fail, and then freak out. The solution turned out to be simple, add “pid-file = /var/run/mysqld/mysqld.pid” to te mysqld section of my.cnf, then restart mysql.

Second, I used the request /index.html from one of the existing configs on a WordPress domain which does not have an /index.html file, so it returned a 404, monit thought apache was down. Make sure your apache monit config references a url that exists!

Otherwise, monit was a breeze to setup. sudo apt-get install monit, edit /etc/default/monit, create the config files, and then sudo service monit start. I’d recommend keeping an eye on the web interface for the first few minutes, I went away and came back to find monit had killed and restarted apache and mysql a few times because of issues in my config.

Config files

I collated a few resources to build our monit config. I had a real issue figuring out multiple mailservers, but I got there in the end. Here’s a summary of our monit config files. I used the format /etc/monit/conf.d/service.mon as was the default on Ubuntu.

###############################################################################
## Monit control file
###############################################################################
##
## Comments begin with a '#' and extend through the end of the line. Keywords
## are case insensitive. All path's MUST BE FULLY QUALIFIED, starting with '/'.
##
## Below you will find examples of some frequently used statements. For
## information about the control file, a complete list of statements and
## options please have a look in the monit manual.
##
##
###############################################################################
## Global section
###############################################################################
##
## Start monit in the background (run as a daemon):
#
set daemon 60
# set daemon 120 # check services at 2-minute intervals
# with start delay 240 # optional: delay the first check by 4-minutes
# # (by default check immediately after monit start)
#
#
## Set syslog logging with the 'daemon' facility. If the FACILITY option is
## omitted, monit will use 'user' facility by default. If you want to log to
## a stand alone log file instead, specify the path to a log file
#
set logfile /var/log/monit.log
# set logfile syslog facility log_daemon
#
#
### Set the location of monit id file which saves the unique id specific for
### given monit. The id is generated and stored on first monit start.
### By default the file is placed in $HOME/.monit.id.
#
# set idfile /var/.monit.id
#
### Set the location of monit state file which saves the monitoring state
### on each cycle. By default the file is placed in $HOME/.monit.state. If
### state file is stored on persistent filesystem, monit will recover the
### monitoring state across reboots. If it is on temporary filesystem, the
### state will be lost on reboot.
#
# set statefile /var/.monit.state
#
## Set the list of mail servers for alert delivery. Multiple servers may be
## specified using comma separator. By default monit uses port 25 - this
## is possible to override with the PORT option.
set mailserver
smtp.sendgrid.net
port 587
username "%%%USERNAME%%%"
password "%%%PASSWORD%%%"
using tlsv1
,
smtp.gmail.com
port 587
username "%%%USERNAME%%%"
password "%%%PASSWORD%%%"
using tlsv1
# The timeout and hostname are after all mailserver definitions
# with timeout 30 seconds
hostname "%%%SERVER.FQDN.COM%%%"
#
# set mailserver mail.bar.baz, # primary mailserver
# backup.bar.baz port 10025, # backup mailserver on port 10025
# localhost # fallback relay
#
#
## By default monit will drop alert events if no mail servers are available.
## If you want to keep the alerts for a later delivery retry, you can use the
## EVENTQUEUE statement. The base directory where undelivered alerts will be
## stored is specified by the BASEDIR option. You can limit the maximal queue
## size using the SLOTS option (if omitted, the queue is limited by space
## available in the back end filesystem).
#
set eventqueue
basedir /var/monit # set the base directory where events will be stored
slots 100 # optionaly limit the queue size
#
#
## Send status and events to M/Monit (Monit central management: for more
## informations about M/Monit see http://www.tildeslash.com/mmonit).
#
# set mmonit http://monit:monit@192.168.1.10:8080/collector
#
#
## Monit by default uses the following alert mail format:
##
## --8 ## From: monit@$HOST # sender
## Subject: monit alert -- $EVENT $SERVICE # subject
##
## $EVENT Service $SERVICE #
## #
## Date: $DATE #
## Action: $ACTION #
## Host: $HOST # body
## Description: $DESCRIPTION #
## #
## Your faithful employee, #
## monit #
## --8 ##
## You can override this message format or parts of it, such as subject
## or sender using the MAIL-FORMAT statement. Macros such as $DATE, etc.
## are expanded at runtime. For example, to override the sender:
#
set mail-format { from: %%%MONIT@FQDN.COM%%% }
# set mail-format { from: monit@foo.bar }
#
#
## You can set alert recipients here whom will receive alerts if/when a
## service defined in this file has errors. Alerts may be restricted on
## events by using a filter as in the second example below.
#
set alert %%%USER-EMAIL@FQDN.com%%%
# set alert sysadm@foo.bar # receive all alerts
# set alert manager@foo.bar only on { timeout } # receive just service-
# # timeout alert
#
#
## Monit has an embedded web server which can be used to view status of
## services monitored, the current configuration, actual services parameters
## and manage services from a web interface.
#
set httpd port 2812 and
use address localhost
allow %%%USER%%%:%%%PASSWORD%%%
# set httpd port 2812 and
# use address localhost # only accept connection from localhost
# allow localhost # allow localhost to connect to the server and
# allow admin:monit # require user 'admin' with password 'monit'
# allow @monit # allow users of group 'monit' to connect (rw)
# allow @users readonly # allow users of group 'users' to connect readonly
#
#
###############################################################################
## Includes
###############################################################################
##
## It is possible to include additional configuration parts from other files or
## directories.
#
include /etc/monit/conf.d/*.mon

/etc/monit/conf.d/apache2.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
# RHEL httpd, Ubuntu apache2
check process apache2 with pidfile /var/run/apache2.pid

# New style
start program = "/usr/sbin/service apache2 start" with timeout 90 seconds
stop program  = "/usr/sbin/service apache2 stop"

# Old style
#    start program = "/etc/init.d/apache2 start" with timeout 90 seconds
#    stop program  = "/etc/init.d/apache2 stop"

# If Apache is using > 80% of the cpu for 5 checks, restart it
if cpu > 80% for 5 cycles then restart

# Could be used to control apache's spawning of threads
#    if children > 50 then alert
#    if children > 60 then restart

# Check if apache is responding on port 80
if failed host %%%PUBLIC_IP%%% port 80 protocol http
and request "/" # Some smallish page that should be available when server is up
# This page has to exist or the check will fail. Avoid index.html on WordPress for example.
with timeout 10 seconds

# Sometimes Apache doesn't respond right away, so give it two chances
# before forcing a restart.
for 2 cycles
then restart

# Apache requires mysql to be running
# Disable this on web-only nodes.
depends on mysql

# If apache is restarting all the time, timeout.
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 8 cycles then timeout

/etc/monit/conf.d/crond.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
check process crond with pidfile "/var/run/crond.pid"

# New style
start program = "/usr/sbin/service crond start"
stop program = "/usr/sbin/service crond stop"
# Old style
#    start = "/etc/init.d/crond start"
#    stop = "/etc/init.d/crond stop"

# If crond is found to be stopped, it will be started automatically
if 5 restarts within 5 cycles then timeout

/etc/monit/conf.d/disk-usage.mon:

# CHECK FILESYSTEM <unique name> PATH <path>
# Use the block device, not the mounted path, if filesystem is not mounted the
# mountpoint still exists, so checks may pass.

check filesystem rootfs with path /dev/xvda1

if space usage > 80% for 5 cycles then alert
if inode usage > 80% for 5 cycles then alert

# It might make sense to stop other services upon certain conditions here.
# For example:
#    if space usage > 95% then exec "/etc/init.d/apache2 stop ; /etc/init.d/mysql stop"
#    if inode usage > 95% then exec "/etc/init.d/apache2 stop ; /etc/init.d/mysql stop"

/etc/monit/conf.d/mysql.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
# MySQL does not create a pid file by default on debian / ubuntu it requires
# the addition of "pid-file = /var/run/mysqld/mysqld.pid" to my.cnf
check process mysql with pidfile /var/run/mysqld/mysqld.pid

# New style
start program = "/usr/sbin/service mysql start"
stop program  = "/usr/sbin/service mysql stop"

# Old style
#    start program = "/etc/init.d/mysql start"
#    stop program  = "/etc/init.d/mysql stop"

# If mysql is using too much cpu, restart it
if cpu > 80% for 5 cycles then restart

# If mysql is consuming too much memory, restart it
# This needs to be adjusted for the server.
if totalmem > 64.0 MB for 5 cycles then restart

# Check mysql responds
if failed unixsocket /var/run/mysqld/mysqld.sock protocol mysql
# If you use the network instead of a UNIX socket, adjust settings
with timeout 15 seconds
then restart

# If we're constantly restarting, timeout
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 5 cycles then timeout

/etc/monit/conf.d/ssh.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
check process sshd with pidfile /var/run/sshd.pid

# New style
start program = "/usr/sbin/service ssh start"
stop program = "/usr/sbin/service ssh stop"
# Old style
#    start program = "/etc/init.d/ssh start"
#    stop program  = "/etc/init.d/ssh stop"

# If ssh is using 80% of cpu, something has gone wrong, restart it
if cpu > 80% for 5 cycles then restart

# If ssh is using > 50MB of memory, we have serious problems
if totalmem > 200.0 MB for 5 cycles then restart

# Check that ssh is responsive on port 22
if failed host %%%PUBLIC_IP%%% port 22 protocol ssh 2 times within 2 cycles
then restart

# If we're continually restarting, tiemout
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 8 cycles then timeout

/etc/monit/conf.d/system.mon:

# As I understand it, according to Rackspace, CPU is allocated per server
# according to the size whereby a 1G server = loadavg 1, 0.5G = 0.5 load, etc.
# If the cpu is available, it can be utilised over that, but it will cause
# issues in the long term.
check system localhost

# This is a 512Mb slice so sustained load above 0.5 will be problematic
if loadavg (1min) > 6 then alert
if loadavg (5min) > 4 then alert
if loadavg (15min) > 0.5 then alert

# Alert if memory usage hits 80% or higher
if memory usage > 80% then alert

# Don't fully understand these numbers, but they seem sensible
if cpu usage (user) > 70% for 2 cycles then alert
if cpu usage (system) > 50% for 2 cycles then alert
if cpu usage (wait) > 50% for 2 cycles then alert

# If the machine is under enormous load, reboot
if loadavg (1min) > 20 for 3 cycles then exec "/sbin/shutdown -r now"
if loadavg (5min) > 15 for 5 cycles then exec "/sbin/shutdown -r now"

# If memory usage is sustained above 97%, something is wrong, reboot
if memory usage > 97% for 3 cycles then exec "/sbin/shutdown -r now"

I’ve removed any sensitive values and replaced them with %%% markers.

References / resources / thanks

http://wiki.mediatemple.net/w/%28dv%29_4.0_-_Making_It_Better_::_Installing_Monit
http://forum.linode.com/viewtopic.php?t=6942
http://mmonit.com/monit/documentation/monit.html
http://rushthinking.com/using-monit/
http://serverfault.com/questions/337742/
http://1000umbrellas.com/?p=1109
http://mightyvites.com/blog/?p=1051
http://www.pythian.com/news/3794/mysql-on-debian-and-ubuntu/