More detailed CloudFlare analysis

Following my last post about CloudFlare, I ran some further benchmarks in response to the feedback from their team. Here’s the summary:

  • CloudFlare only combines our Javascript, not our CSS files, despite what it said on the tin (the site has been updated, when we signed it said JS & CSS)
  • This only happens on some user agents on some operating systems, CloudFlare will not give me a list of user agents.
  • On browsers where this is enabled, we see a marked improvement, where it’s not, we see no gain or a small loss.

Graphs

I ran two sets of tests (using only browsers where RocketLoader is enabled).

Graph showing marked improvement with cloudflare on

Second graph showing marked improvement with cloudflare on

Conclusion

We probably won’t implement CloudFlare across all our sites. I might still experiment on one of our higher traffic sites now that we’re running boomerang and gathering real user data to compare. However, the black box nature of CloudFlare fundamentally leaves me feeling uneasy.

The product appears to be in beta, which wasn’t clear when we signed up. I thought it was a polished product ready for production use. But much of the support chat is about RocketLoader being in beta, no list of user agents at this time, etc.

Bottom line, CloudFlare hasn’t done what I expected. I’ll test mod_pagespeed and we’ll probably go with that, pending any major roadblocks.

Benchmarking Rackspace dedicated vs cloud

People keep telling me that Magento performs better on dedicated hardware. I haven’t been able to find any numbers to support this, but I’ve heard it so often it’s either a very popular myth, or it’s true.

Now that our Rackspace dedicated box is online, I’m trying some benchmarks toput some numbers against the comparison. I wanted to test different operating systems and server sizes, so I booted 10 servers, one of each Ubuntu 10.04 LTS and RHEL 5.5 at 0.5, 1, 2, 4, 8 GB.

In order to automate the setup, I ran the following command on the Ubuntu boxes:

mkdir .ssh && chmod 700 .ssh && echo ssh-rsa <<snip>> > .ssh/authorized_keys && chmod 600 .ssh/authorized_keys && locale-gen en_GB.UTF-8 && update-locale LANG=en_GB.UTF-8 && apt-get update && apt-get --yes dist-upgrade && apt-get --yes install sysbench screen && reboot now && exit

Then on the RHEL boxes, something vaguely similar:

mkdir .ssh && chmod 700 .ssh && echo ssh-rsa <<snip>> > .ssh/authorized_keys && chmod 600 .ssh/authorized_keys && yum -y update && rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm && yum -y install screen sysbench && reboot && exit

To automate the actual testing, I created a script bench.sh and uploaded it to each of the servers. It’s a simple nested for loop to run each test 3 times.

#!/bin/bash

for threads in 1 4 8 16 32 64
do
	for r in 1 2 3
	do
		sysbench --num-threads=$threads --test=cpu run > sysbench_cpu_${threads}_threads_$(date +%Y-%m-%d_%H-%M-%S).log
		sleep 30
	done

	for r in 1 2 3
	do
		sysbench --num-threads=$threads --test=memory run > sysbench_memory_${threads}_threads_$(date +%Y-%m-%d_%H-%M-%S).log
		sleep 30
	done

	sysbench --num-threads=$threads --test=fileio --file-test-mode=rndrw prepare

	for r in 1 2 3
	do
		sysbench --num-threads=$threads --test=fileio --file-test-mode=rndrw run > sysbench_fileiorndrw_${threads}_threads_$(date +%Y-%m-%d_%H-%M-%S).log
		sleep 30
	done
done

Then I connected to each server, uploaded the script (actually copied / pasted into vim, seemed quicker) chmod +x and ran it. The scripts are running now on 10 machines…

Results

I started writing this post about 2 months ago and haven’t yet published it. The bottom line was that memory comparisons were roughly even between virtual and physical environments. However, disk IO was hugely variable on the virtual hardware, at the top end it was comparable to the dedicated hardware, at the bottom end about 10% of that.

I didn’t notice any difference between operating systems, but I didn’t look for it very hard either. The results for the cloud servers were all over the place, while the dedicated box was very consistent.

My takeaway result was that disk is unpredictable in the cloud. If you’d like to see the actual results to make a more detailed analysis, let me know in the comments and I’ll dig out the numbers. For now I’m going to finally publish this! ūüôā

CloudFlare slowed down our site

I deployed CloudFlare onto one of our sites today. I wanted to see hands on exactly how it works. I ran a simple benchmark. I ran 3 test 4 times each. The three tests were all from the same location, at 3 different network speeds. I ran the test 4 times, once in each configuration. I had CloudFlare enabled and disabled, and static assets coming from 3 domains or 1 domain. Here’s the results:

In every case, with CloudFlare was slower than without. Every single comparison, I ran the numbers. Some comparisons were very close, 0.02 or 0.03 seconds in it. But CloudFlare did not come out ahead one single time.

It could be a network issue. Maybe the test server (WebPageTest.org / Gloucester) is very close to our server and far from the CloudFlare servers. But even so, I’d have expected some performance gain from all the “magic” CloudFlare is supposed to do.

I’ll update with further tests. I’m also going to email CloudFlare and ask for their comments. I’ll post anything salient here.

[ Update: I’ve published some further test results here. ]

Postfix relayhost and aliases

I spent a few hours today figuring out why the /etc/aliases file was ignored by postfix. Mail to root was being delivered to root@exampl.tld instead of being rewritten by the /etc/aliases file. The solution is to use virtual_alias_maps instead.

In /etc/postfix/main.cf remove the alias_maps and alias_database like so:

#alias_maps = hash:/etc/aliases
#alias_database = hash:/etc/aliases
virtual_alias_maps = hash:/etc/postfix/virtual

Then add something like this to /etc/psotfix/virtual:

root    realuser@example.tld
user    realuser@example.tld

To make that active run:

sudo postmap /etc/postfix/virtual
sudo service postfix reload

I’m sure I figured all this out years ago, but if I did, I’d forgotten it all today!

Monit

We had our first outage since moving to Rackspace on 27 December. I came online to emails saying the site was down. I freaked out. An outage within the first few days on my watch. Crikey.

Looking into the issue, memory usage started spiking around 4:40am. By 12:20pm the server become unresponsive. All available memory and swap space had been filled. It took almost 8 hours for the server to crash. I should have been warned of the problem in that window and fixed it before it ever happened. I set out to sort that, and monit appears to be the best tool for the job.

Issues

I hit a couple of issues. The first one had me stumped for quite a while. On Ubuntu, mysql does not create a pid file by default. This led monit to think it wasn’t running, try to start it, fail, and then freak out. The solution turned out to be simple, add “pid-file = /var/run/mysqld/mysqld.pid” to te mysqld section of my.cnf, then restart mysql.

Second, I used the request /index.html from one of the existing configs on a WordPress domain which does not have an /index.html file, so it returned a 404, monit thought apache was down. Make sure your apache monit config references a url that exists!

Otherwise, monit was a breeze to setup. sudo apt-get install monit, edit /etc/default/monit, create the config files, and then sudo service monit start. I’d recommend keeping an eye on the web interface for the first few minutes, I went away and came back to find monit had killed and restarted apache and mysql a few times because of issues in my config.

Config files

I collated a few resources to build our monit config. I had a real issue figuring out multiple mailservers, but I got there in the end. Here’s a summary of our monit config files. I used the format /etc/monit/conf.d/service.mon as was the default on Ubuntu.

###############################################################################
## Monit control file
###############################################################################
##
## Comments begin with a '#' and extend through the end of the line. Keywords
## are case insensitive. All path's MUST BE FULLY QUALIFIED, starting with '/'.
##
## Below you will find examples of some frequently used statements. For
## information about the control file, a complete list of statements and
## options please have a look in the monit manual.
##
##
###############################################################################
## Global section
###############################################################################
##
## Start monit in the background (run as a daemon):
#
set daemon 60
# set daemon 120 # check services at 2-minute intervals
# with start delay 240 # optional: delay the first check by 4-minutes
# # (by default check immediately after monit start)
#
#
## Set syslog logging with the 'daemon' facility. If the FACILITY option is
## omitted, monit will use 'user' facility by default. If you want to log to
## a stand alone log file instead, specify the path to a log file
#
set logfile /var/log/monit.log
# set logfile syslog facility log_daemon
#
#
### Set the location of monit id file which saves the unique id specific for
### given monit. The id is generated and stored on first monit start.
### By default the file is placed in $HOME/.monit.id.
#
# set idfile /var/.monit.id
#
### Set the location of monit state file which saves the monitoring state
### on each cycle. By default the file is placed in $HOME/.monit.state. If
### state file is stored on persistent filesystem, monit will recover the
### monitoring state across reboots. If it is on temporary filesystem, the
### state will be lost on reboot.
#
# set statefile /var/.monit.state
#
## Set the list of mail servers for alert delivery. Multiple servers may be
## specified using comma separator. By default monit uses port 25 - this
## is possible to override with the PORT option.
set mailserver
smtp.sendgrid.net
port 587
username "%%%USERNAME%%%"
password "%%%PASSWORD%%%"
using tlsv1
,
smtp.gmail.com
port 587
username "%%%USERNAME%%%"
password "%%%PASSWORD%%%"
using tlsv1
# The timeout and hostname are after all mailserver definitions
# with timeout 30 seconds
hostname "%%%SERVER.FQDN.COM%%%"
#
# set mailserver mail.bar.baz, # primary mailserver
# backup.bar.baz port 10025, # backup mailserver on port 10025
# localhost # fallback relay
#
#
## By default monit will drop alert events if no mail servers are available.
## If you want to keep the alerts for a later delivery retry, you can use the
## EVENTQUEUE statement. The base directory where undelivered alerts will be
## stored is specified by the BASEDIR option. You can limit the maximal queue
## size using the SLOTS option (if omitted, the queue is limited by space
## available in the back end filesystem).
#
set eventqueue
basedir /var/monit # set the base directory where events will be stored
slots 100 # optionaly limit the queue size
#
#
## Send status and events to M/Monit (Monit central management: for more
## informations about M/Monit see http://www.tildeslash.com/mmonit).
#
# set mmonit http://monit:monit@192.168.1.10:8080/collector
#
#
## Monit by default uses the following alert mail format:
##
## --8 ## From: monit@$HOST # sender
## Subject: monit alert -- $EVENT $SERVICE # subject
##
## $EVENT Service $SERVICE #
## #
## Date: $DATE #
## Action: $ACTION #
## Host: $HOST # body
## Description: $DESCRIPTION #
## #
## Your faithful employee, #
## monit #
## --8 ##
## You can override this message format or parts of it, such as subject
## or sender using the MAIL-FORMAT statement. Macros such as $DATE, etc.
## are expanded at runtime. For example, to override the sender:
#
set mail-format { from: %%%MONIT@FQDN.COM%%% }
# set mail-format { from: monit@foo.bar }
#
#
## You can set alert recipients here whom will receive alerts if/when a
## service defined in this file has errors. Alerts may be restricted on
## events by using a filter as in the second example below.
#
set alert %%%USER-EMAIL@FQDN.com%%%
# set alert sysadm@foo.bar # receive all alerts
# set alert manager@foo.bar only on { timeout } # receive just service-
# # timeout alert
#
#
## Monit has an embedded web server which can be used to view status of
## services monitored, the current configuration, actual services parameters
## and manage services from a web interface.
#
set httpd port 2812 and
use address localhost
allow %%%USER%%%:%%%PASSWORD%%%
# set httpd port 2812 and
# use address localhost # only accept connection from localhost
# allow localhost # allow localhost to connect to the server and
# allow admin:monit # require user 'admin' with password 'monit'
# allow @monit # allow users of group 'monit' to connect (rw)
# allow @users readonly # allow users of group 'users' to connect readonly
#
#
###############################################################################
## Includes
###############################################################################
##
## It is possible to include additional configuration parts from other files or
## directories.
#
include /etc/monit/conf.d/*.mon

/etc/monit/conf.d/apache2.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
# RHEL httpd, Ubuntu apache2
check process apache2 with pidfile /var/run/apache2.pid

# New style
start program = "/usr/sbin/service apache2 start" with timeout 90 seconds
stop program  = "/usr/sbin/service apache2 stop"

# Old style
#    start program = "/etc/init.d/apache2 start" with timeout 90 seconds
#    stop program  = "/etc/init.d/apache2 stop"

# If Apache is using > 80% of the cpu for 5 checks, restart it
if cpu > 80% for 5 cycles then restart

# Could be used to control apache's spawning of threads
#    if children > 50 then alert
#    if children > 60 then restart

# Check if apache is responding on port 80
if failed host %%%PUBLIC_IP%%% port 80 protocol http
and request "/" # Some smallish page that should be available when server is up
# This page has to exist or the check will fail. Avoid index.html on WordPress for example.
with timeout 10 seconds

# Sometimes Apache doesn't respond right away, so give it two chances
# before forcing a restart.
for 2 cycles
then restart

# Apache requires mysql to be running
# Disable this on web-only nodes.
depends on mysql

# If apache is restarting all the time, timeout.
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 8 cycles then timeout

/etc/monit/conf.d/crond.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
check process crond with pidfile "/var/run/crond.pid"

# New style
start program = "/usr/sbin/service crond start"
stop program = "/usr/sbin/service crond stop"
# Old style
#    start = "/etc/init.d/crond start"
#    stop = "/etc/init.d/crond stop"

# If crond is found to be stopped, it will be started automatically
if 5 restarts within 5 cycles then timeout

/etc/monit/conf.d/disk-usage.mon:

# CHECK FILESYSTEM <unique name> PATH <path>
# Use the block device, not the mounted path, if filesystem is not mounted the
# mountpoint still exists, so checks may pass.

check filesystem rootfs with path /dev/xvda1

if space usage > 80% for 5 cycles then alert
if inode usage > 80% for 5 cycles then alert

# It might make sense to stop other services upon certain conditions here.
# For example:
#    if space usage > 95% then exec "/etc/init.d/apache2 stop ; /etc/init.d/mysql stop"
#    if inode usage > 95% then exec "/etc/init.d/apache2 stop ; /etc/init.d/mysql stop"

/etc/monit/conf.d/mysql.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
# MySQL does not create a pid file by default on debian / ubuntu it requires
# the addition of "pid-file = /var/run/mysqld/mysqld.pid" to my.cnf
check process mysql with pidfile /var/run/mysqld/mysqld.pid

# New style
start program = "/usr/sbin/service mysql start"
stop program  = "/usr/sbin/service mysql stop"

# Old style
#    start program = "/etc/init.d/mysql start"
#    stop program  = "/etc/init.d/mysql stop"

# If mysql is using too much cpu, restart it
if cpu > 80% for 5 cycles then restart

# If mysql is consuming too much memory, restart it
# This needs to be adjusted for the server.
if totalmem > 64.0 MB for 5 cycles then restart

# Check mysql responds
if failed unixsocket /var/run/mysqld/mysqld.sock protocol mysql
# If you use the network instead of a UNIX socket, adjust settings
with timeout 15 seconds
then restart

# If we're constantly restarting, timeout
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 5 cycles then timeout

/etc/monit/conf.d/ssh.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
check process sshd with pidfile /var/run/sshd.pid

# New style
start program = "/usr/sbin/service ssh start"
stop program = "/usr/sbin/service ssh stop"
# Old style
#    start program = "/etc/init.d/ssh start"
#    stop program  = "/etc/init.d/ssh stop"

# If ssh is using 80% of cpu, something has gone wrong, restart it
if cpu > 80% for 5 cycles then restart

# If ssh is using > 50MB of memory, we have serious problems
if totalmem > 200.0 MB for 5 cycles then restart

# Check that ssh is responsive on port 22
if failed host %%%PUBLIC_IP%%% port 22 protocol ssh 2 times within 2 cycles
then restart

# If we're continually restarting, tiemout
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 8 cycles then timeout

/etc/monit/conf.d/system.mon:

# As I understand it, according to Rackspace, CPU is allocated per server
# according to the size whereby a 1G server = loadavg 1, 0.5G = 0.5 load, etc.
# If the cpu is available, it can be utilised over that, but it will cause
# issues in the long term.
check system localhost

# This is a 512Mb slice so sustained load above 0.5 will be problematic
if loadavg (1min) > 6 then alert
if loadavg (5min) > 4 then alert
if loadavg (15min) > 0.5 then alert

# Alert if memory usage hits 80% or higher
if memory usage > 80% then alert

# Don't fully understand these numbers, but they seem sensible
if cpu usage (user) > 70% for 2 cycles then alert
if cpu usage (system) > 50% for 2 cycles then alert
if cpu usage (wait) > 50% for 2 cycles then alert

# If the machine is under enormous load, reboot
if loadavg (1min) > 20 for 3 cycles then exec "/sbin/shutdown -r now"
if loadavg (5min) > 15 for 5 cycles then exec "/sbin/shutdown -r now"

# If memory usage is sustained above 97%, something is wrong, reboot
if memory usage > 97% for 3 cycles then exec "/sbin/shutdown -r now"

I’ve removed any sensitive values and replaced them with %%% markers.

References / resources / thanks

http://wiki.mediatemple.net/w/%28dv%29_4.0_-_Making_It_Better_::_Installing_Monit
http://forum.linode.com/viewtopic.php?t=6942
http://mmonit.com/monit/documentation/monit.html
http://rushthinking.com/using-monit/
http://serverfault.com/questions/337742/
http://1000umbrellas.com/?p=1109
http://mightyvites.com/blog/?p=1051
http://www.pythian.com/news/3794/mysql-on-debian-and-ubuntu/

Traffic profile by page

I’ve been looking at the profile of our traffic. On our biggest site, the top 50 pages account for more than 50% of our traffic. Across all four sites, our top 100 pages account for 50% of total pageviews.

Getting this data out of Google Analytics was easy. To start with I went to the Content > Site Content > Pages report, switched the view to percentage, and added the percentages for our top 10 then top 50 pages with a calculator.

To get data for all our sites, I used this trick to export the data for all our sites. I had to use the old interface to do this, then Content > Top Content. I combined all the data into a single sheet in LibreOffice Calc, added a row for the site domain and one for the total pageviews on that site. That allowed me to graph percentage of total traffic for our most popular pages.

I graphed our traffic with a log scale, and it looks a lot like the graphs from the Long Tail.

Our pages are around 10K. So a quick calculation suggests that (allowing 100% extra for Varnish overhead) we could serve 50% of our total traffic from 2MiB of cached data, or 80% of our traffic from 12.5MiB (640 pages, 20KiB per page). That puts an interesting spin on the idea of using a varnish cache.

I wonder how our traffic profile compares to that of other sites. I’d imagine some sites are more niche, with less traffic in the “head” and more in the “tail”. In those cases, caching might provide less of a performance boost.

This also got me thinking about benchmarking cache performance. I’ll think about it and write more in a later post.

Update: Another couple of perspectives. We have fewer than 500 pages out of a total 12k which get more than a 100 pageviews per month (3 per day). If we can serve 80% of traffic from 12.5MiB of cached data, in economic terms that’s ¬£7/month!

Xeround eu-west-1a from Rackspace UK

I love the idea of Xeround‘s offering. A fully managed, scalable, pay for what you use, MySQL compatible cloud based database. It’s a great pitch. Zero configuration, massive concurrency, and totally compatible with Magento.

Xeround launched on Rackspace in the US with plans, but no date, to launch in the UK. Their only current EU Presence is Amazon’s eu-west-1a region. I ran some benchmarks to compare performance between a Rackspace cloud server running MySQL and Xeround in Ireland. The ping times tell the story.

ping -c4 ec2-46-137-176-72.eu-west-1.compute.amazonaws.com
PING ec2-46-137-176-72.eu-west-1.compute.amazonaws.com (46.137.176.72) 56(84) bytes of data.
64 bytes from ec2-46-137-176-72.eu-west-1.compute.amazonaws.com (46.137.176.72): icmp_req=1 ttl=50 time=19.1 ms
64 bytes from ec2-46-137-176-72.eu-west-1.compute.amazonaws.com (46.137.176.72): icmp_req=2 ttl=50 time=18.7 ms
64 bytes from ec2-46-137-176-72.eu-west-1.compute.amazonaws.com (46.137.176.72): icmp_req=3 ttl=50 time=18.7 ms
64 bytes from ec2-46-137-176-72.eu-west-1.compute.amazonaws.com (46.137.176.72): icmp_req=4 ttl=50 time=18.6 ms
--- ec2-46-137-176-72.eu-west-1.compute.amazonaws.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 18.657/18.835/19.153/0.233 ms

ping -c4 ec2-46-137-74-79.eu-west-1.compute.amazonaws.com
PING ec2-46-137-74-79.eu-west-1.compute.amazonaws.com (46.137.74.79) 56(84) bytes of data.
64 bytes from ec2-46-137-74-79.eu-west-1.compute.amazonaws.com (46.137.74.79): icmp_req=1 ttl=53 time=16.1 ms
64 bytes from ec2-46-137-74-79.eu-west-1.compute.amazonaws.com (46.137.74.79): icmp_req=2 ttl=53 time=16.2 ms
64 bytes from ec2-46-137-74-79.eu-west-1.compute.amazonaws.com (46.137.74.79): icmp_req=3 ttl=53 time=16.2 ms
64 bytes from ec2-46-137-74-79.eu-west-1.compute.amazonaws.com (46.137.74.79): icmp_req=4 ttl=53 time=16.1 ms
--- ec2-46-137-74-79.eu-west-1.compute.amazonaws.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 16.161/16.200/16.238/0.130 ms

ping -c4 ec2-46-137-175-244.eu-west-1.compute.amazonaws.com
PING ec2-46-137-175-244.eu-west-1.compute.amazonaws.com (46.137.175.244) 56(84) bytes of data.
64 bytes from ec2-46-137-175-244.eu-west-1.compute.amazonaws.com (46.137.175.244): icmp_req=1 ttl=53 time=12.5 ms
64 bytes from ec2-46-137-175-244.eu-west-1.compute.amazonaws.com (46.137.175.244): icmp_req=2 ttl=53 time=12.3 ms
64 bytes from ec2-46-137-175-244.eu-west-1.compute.amazonaws.com (46.137.175.244): icmp_req=3 ttl=53 time=12.4 ms
64 bytes from ec2-46-137-175-244.eu-west-1.compute.amazonaws.com (46.137.175.244): icmp_req=4 ttl=53 time=12.3 ms
--- ec2-46-137-175-244.eu-west-1.compute.amazonaws.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 12.365/12.432/12.586/0.143 ms

ping -c4 31-222-141-238.static.cloud-ips.co.uk
PING 31-222-141-238.static.cloud-ips.co.uk (31.222.141.238) 56(84) bytes of data.
64 bytes from 31-222-141-238.static.cloud-ips.co.uk (31.222.141.238): icmp_req=1 ttl=64 time=0.317 ms
64 bytes from 31-222-141-238.static.cloud-ips.co.uk (31.222.141.238): icmp_req=2 ttl=64 time=0.307 ms
64 bytes from 31-222-141-238.static.cloud-ips.co.uk (31.222.141.238): icmp_req=3 ttl=64 time=0.300 ms
64 bytes from 31-222-141-238.static.cloud-ips.co.uk (31.222.141.238): icmp_req=4 ttl=64 time=0.320 ms
--- 31-222-141-238.static.cloud-ips.co.uk ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.300/0.311/0.320/0.008 ms

ping -c4 10.177.6.181
PING 10.177.6.181 (10.177.6.181) 56(84) bytes of data.
64 bytes from 10.177.6.181: icmp_req=1 ttl=64 time=0.340 ms
64 bytes from 10.177.6.181: icmp_req=2 ttl=64 time=0.308 ms
64 bytes from 10.177.6.181: icmp_req=3 ttl=64 time=0.333 ms
64 bytes from 10.177.6.181: icmp_req=4 ttl=64 time=0.330 ms
--- 10.177.6.181 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2997ms
rtt min/avg/max/mdev = 0.308/0.327/0.340/0.025 ms

Pretty pictures

The ping times actually tell the whole story, but before I tried that I ran the example dbench config, and it produced pretty pictures, which I’ll share now. The important thing to look at in these graphs is the scale along the left hand side. I ran tests on the internal and public IPs of the Rackspace cloud server and the three IPs made available for xeround.

ec2-46-137-74-79.eu-west-1.compute.amazonaws.com

ec2-46-137-175-244.eu-west-1.compute.amazonaws.com

ec2-46-137-176-72.eu-west-1.compute.amazonaws.com

31-222-141-238.static.cloud-ips.co.uk

10.177.6.181

Conclusion

Maybe it was a crazy idea from the beginning, but I think it was worth the couple of hours I put into testing the theory. When Xeround launch on Rackspace UK, I’ll be interested to compare it against our own MySQL server. Unfortunately, for the time being, it’s a non starter.

I did find it interesting that there was a noticeable difference between the different IPs to connect to xeround. I’m not sure what the cause of that is. It might simply be that the connection is unpredictable because it’s going across networks. Another reason to keep our database and web servers in the same place.

Benchmark details

I have no benchmarking experience, so my approach was quite primitive. I commissioned two Rackspace cloud servers. One to run the test with 512MB memory (benchie), one to host MySQL with 1024MB (benchdb). I downloaded dbench, installed php5-cli and php5-gd (for the pretty pics!) and then set up the example/config.php to connect to xeround and let it rip.

While that was running, I installed mysql-server on benchdb. I changed the bind_address to 0.0.0.0 in my.cnf so MySQL would listen on all IPs. Other than that, I made no config changes from the default.

benchie was the Rackspace Ubuntu 11.04 image, benchdb the Ubuntu 10.04 LTS. I ran apt-get update / upgrade, created a new user, and otherwise left the machines stock.

Magento hosting providers

I’ve spent the last couple of days looking at various quotes for Magento hosting. Prices for managed, dedicated servers vary considerably. At the lower end on the price spectrum are ForLinux, who seem to have great Magento experience. Peer1 were mid range, and Rackspace were the top end.

There were a few other Magento specific offers in the mix, but they fell out because of professionalism issues. Quotes taking forever, being sloppy, unclear, or unresponsive (or in some cases rude) sales people. Bottom line, we’re buying peace of mind, not hardware, and that starts from the first interaction. If a company is too busy or insufficiently motivated to reply or send us quotes timeously, they take themselves out of the running.

ForLinux look like a great value, very responsive option. They have a cloud offering, but there’s no online pricing, and it’s not clear to me if they provide the cloud products themselves or provide a management service on top of AWS.

Peer1 have recently launched Zunicore, which is a highly configurable cloud service. It’s possible to increase memory or CPU without hard drive space, or vice versa. That’s an appealing option as it allows more careful tailoring of services required. For example, more memory and less disk for database servers, more disk less memory for static hosting, and so on. The Zunicore service launched last month and so it’s not fully battle tested at this point.

Rackspace have the most advanced cloud offering of all three. They offer servers (managed or unmanaged), load balancers, and cloud files linked to a CDN. It’s this breadth of offering that has me leaning towards Rackspace at the moment.

The idea of starting with a dedicated server backed by a management service, with the flexibility to add cloud servers for couchbase (memcache), reverse proxies, and so on, is very appealing. Plus, in theory, the ability to scale up resources for higher load times, and scale back after, is also an appealing idea.

In reality, I’m not sure how much difference this will make in practice. But having the option available is a significant factor in our decision.

Measuring web site speed

My first challenge in the quest for ultra high performance magento is to measure current performance. How fast does our site load? How long does it take from the moment a user clicks our link to the time they can see our web page?

There is no simple answer to this relatively simple question. There are some great tools that help to answer the question, but none of them are quite perfect. I’d like to get a regular snapshot of our page load times across a range of pages, taken at regular intervals. So far I’ve found a range of tools that will give me instant data on the page load time, only one that will check it regularly, and the slowest interval is every 20 minutes for $5 per check location per page per month. My research continues…

I found a great question on serverfault (via this) that linked to most of the monitoring services I could find. There’s a trove of information in there. The providers I looked at in some detail are:

We are already using the free tier on Pingdom to monitor uptime. One url gets hit every minute, free of charge, and notifications get sent if the site is down. For $10 a month, we can increase that to 5 urls, plus 50c/month for each url thereafter. I think that’s likely to be our most sensible option.

Simplicity

If there was a service that was 37 signals simple, checked that our site was online regularly, sent alerts, and monitored the site speed a few times a day, for $10 or $20 a month, I’d sign up immediately. Pingdom or Monitis looks like the closest thing.

Two categories of speed

Speed testing services seem to fall into two categories. Services like Pingdom’s monitoring track the time it takes to load the html page. Then browser tools give an indication of how long it takes to “render” the page in a browser. So the first is measuring straight server response time, the second trying to measure the user’s load time.

Personally, I’d like both to be measured and recorded regularly, I want all the data I can get, all the time! I haven’t yet found such a service, it seems like all the page load measurement tools are aimed at giving advice, and benchmarking, rather than ongoing monitoring.

There are lots of benchmarking tools around.

Monitis

I’d forgotten about monitis while I was writing most of this. They offer the services I’m looking for, but the real stickler is the minimum frequency of every 20 minutes. I’d much rather than 10 pages on our site measured for speed every 4 hours, than 1 measured every 20 minutes. Under their pricing model the cost per page speed test is $5/month, every 20 minutes, per location.

Update, I just called Monitis and spoke to a very friendly chap who said they would make a custom plan for my requirements. Once I’m clear on how many pages I want to check and how often, I’ll send them a request for a quote.

Update: I forgot to include this link, I’m guessing this is the article that put site speed on many people’s radar.

Hello world!

Welcome to Pergento, a blog chronicling my adventures in pursuit of ultra high performance magento.

I’m engaging with my brother with the aim of radically improving the reliability and dramatically increasing the speed of his magento based ecommerce sites. I’ll share stories of my adventures as the journey unfolds. I hope to be able to share tales of what works, what doesn’t work, and most significantly, how the changes I make affect conversion. The goal of this whole exercise is to help drive higher revenue through magento powered sites.

Amazon and Google seem to both agree that page performance affects sales. I’ve seen the figure repeated from Amazon that a 100ms increase in page load times reduces sales by 1%. I’ll try to find a source for that. Likewise, I’ve read that increasing the speed of a site can dramatically boost search result placement. Again, I’ll try to find the references.

If you’d like to stay informed of the adventures, please subscribe to the feed, or sign up to receive email updates by clicking the “Follow” box in the bottom right hand corner.

Ultra High Performance Magento