Category Archives: General

ModPagespeedLoadFromFile all CDN domains

While investigating a serious performance issue over the last 24 hours I discovered an issue whereby some of our CSS files were being combined but were not being rewritten by mod_pagespeed. After much hair pulling, far too many hours spent in front of curl, vim, less and friends, I finally tracked down the solution. I added a ModPagespeedLoadFromFile directive for every CDN domain and now all our resources are being properly rewritten.

I think, but I’m not sure, that mod_pagespeed tries to retrieve from the same domain as the request arrives on. So if you’re rewriting resources from example.tld onto cdn.xmpl.tld, when the request arrives at mod_pagespeed with a Host header of cdn.xmpl.tld, mod_pagespeed tries to look up every resource on the cdn.xmpl.tld hostname, instead of doing a reverse lookup through the ModPagespeedShardDomain directive.

Interestingly, this only seemed to affect our primary CSS file. Our print.css and javascript appeared to be unaffected. Strange. Very glad I can put this one to bed.

Graphing performance metrics

I’ve spent a couple of days researching options to graph performance metrics. We’re trying to get all our metrics from all our services into a single interface. Here’s the pieces I found which seemed to fit best.

  • dygraphs – Javascript timeseries graphing library, looks excellent
  • InfluxDB + Grafana – Appears to be the best in class for metric storage + display
  • Graphite (+ Carbon) – Looks to be more complex to setup and a little less powerful than InfluxDB
  • StatsD – Pre-process high volume metrics before feeding to influx / carbon

Next step is to play with Influx + Grafana, either their hosted service, or our own, and see what we can push into it. More to follow…

mod_rpaf and intermittent port errors

We’re using mod_rpaf and trying to use it for SSL offloading so we can cache all HTTPS requests with Varnish. This has worked well in testing, but on production, we’re seeing intermittent port errors. So making 1’000 requests to the same URL over HTTP, we get several hundred requests showing in the apache logs as port 443.

This causes all sorts of unexpected side effects. Particularly with mod_pagespeed which serves 404 when the port has been set incorrectly. Nightmare.

Bottom line, we’ve take this out of our architecture until we can find a solution. The hardest part is that we can’t replicate the issue on staging. I’ve opened an issue.

SPDY, Googlebot, Alternate-Protocol and SEO

We deployed SPDY on our production sites on Monday. It’s hard to tell precisely, but it looks like our pages are being removed from Google search results.

Some background. We used the Alternate-Protocol: 443:npn-spdy/2. Our http pages are INDEX,FOLLOW and our https / SPDY pages are NOINDEX,NOFOLLOW.

Could it be that Googlebot takes follows the Alternate-Protocol header, loads the https version of the page, and then doesn’t index it because of the noindex tags?

Can’t find anything in Google about this issue. Anyone else have experience? I’ll try to post back here if we find anything more definitive than pure speculation…

All Magento pages from Varnish

I just realised I haven’t updated the site since our last big development. We’re now serving almost all of our pages from Varnish. Crude research suggests around 90% of our pageviews are now coming from Varnish. In simple terms, we’re doing did the following:

  • Render the cart / account links from a cookie with javascript
  • Ajax all pages, so everything can be cached (with a few exceptions we manually exclude)
  • Cache page variants for currencies and tax rate

We’re also warming / refreshing the cache using a bash script parsing the sitemap and hitting every url with a PURGE then a GET.

The hardest part of the whole performance space has been to measure the impact of our changes. But our TTFB was previously in the 300-500ms range for most pages, and now it’s in the 20-30ms range for pages that come from Varnish. I’m very confident that it’s impacting our bottom line.

All category pages from varnish

It’s a glorious day in the pursuit of ultra high performance on Magento. Today, we serve all our category pages from varnish. Plus, we artificially warm the cache to ensure that all category pages across all sites are already in the cache when a real user hits the sites.

Varnish typically takes our time to first byte from around 300ms – 400ms to 20ms – 30ms. We were previously serving 80% of landing pages from varnish, but this changes should improve overall performance by a noticeable margin. Happy days. 🙂

The implementation is fairly custom. Essentially, we’re adding a header to all pages which tells varnish whether the page can be cached or not. So on category pages that header says yes, on product pages that header says no. We also did some custom coding to dynamically the header links (My Cart, Login, Logout, etc) from a cookie. We set that cookie on login, add to cart, etc.

varnishd: Child start failed: could not open sockets

I was banging my head against a wall trying to figure out this error:

varnishd: Child start failed: could not open sockets

I checked netstat -tlnp but nothing was listening on the target port or IP. Turns out, the IP was simply missing. ifconfig didn’t show that IP being up at all. DOH! Simple solution once I found the actual problem. Posting here because I couldn’t find much on this one online.

SSD partition realignment

If you want a great value host in the UK, OVH is pretty good. Their SSD based machines are hard to beat on price. Sure, the service sucks, but you get what you pay for.

There’s a bug with their auto installer, it automatically partitions the whole disk even if you leave a chunk free to reduce write amplification. By default we’re setup with software RAID. This allows for the on-the-fly repartitioning of the disks to align them with the 4k pages on the disk. The OVH auto installer leaves them out of alignment, bad.

Here’s a step by step, please be sure you understand before attempting to run these commands, or you could easily destroy all your data.

Assuming you have a few logical volumes (say 80Gs worth) on a 120G disk with a 10G root partition, leaving in theory 30G free. First, reduce the physical volume to 83G (just a little over 80G to be on the safe side), next reduce the raid partition to 85G (again, a little over, better safe than sorry). Then take one drive out of the raid array, delete the partition, create a new properly aligned partition, add it back to the raid array. Now let it resync and repeat for the second drive. Finally, resize up the raid partition and the physical volume.

This assumes /dev/sda|b3 is an extended partition containing 1 logical partition /dev/sda|b5 which is in raid partition /dev/md5.

Make a copy of your partition tables, raid layout, etc, before you start. Backup.

# Resize the physical volume and raid array
pvresize --setphysicalvolumesize 83G /dev/md5
mdadm --grow /dev/md5 --size 89216000

# Take this partition out of the RAID array
mdadm --manage /dev/md5 --fail /dev/sda5
mdadm --manage /dev/md5 --remove /dev/sda5
mdadm --zero-superblock /dev/sda5

# Remove the partition
parted /dev/sda rm 5
parted /dev/sda rm 3

# Moved everything down the drive to align properly with 512k rows
parted /dev/sda mkpart extended 23068672s 209715200s
parted /dev/sda mkpart logical 23072768s 209715200s
parted /dev/sda set 5 raid on

# Add the partition back into the RAID array
mdadm -a /dev/md5 /dev/sda5

# Let the resync finish, then repeat for the other drive

# Resize the raid and lvm back to the full size of the new partition
mdadm --grow /dev/md5 --size=max
pvresize /dev/md5

WordPress and memcached

Taking Web Performance Optimisation into my personal life, and partly egged on by my bro, I’ve been looking at my site’s performance over the last few weeks.

As with any performance optimisation, the starting point is the traffic profile. The site sees between 300 and 1’000 pageviews a day, the top 8 pages account for 50% of traffic, the rest is one or two pageviews per page per day.

Given this spread of traffic, full page caching will take a long time to warm up, and will only improve the second hit to any given page, if the cached page still exists. I’d like to boost performance across the board, so I looked at using memcache. Several plugins exist which leverage WordPress’s in-built caching and store the data in memcache so it persists between requests.

Couchbase

The folks behind membase and couchdb merged into Couchbase. They produce couchbase server, which is memcache compatible (via a proxy known as moxi I learned) out of the box, with the added benefit of persistence to disk. One of my long term goals is to store Magento sessions for a long time in a mostly persistent cache, so I was keen to experiment with couchbase.

At first, I installed couchbase, fired up the memcached-redux plugin, and my load times went from ~200ms to >5s. Turns out couchbase doesn’t work out of the box, it needs to be configured via the web interface on localhost:8091. Done. Now load times are in the ~400ms range. Slower. I learn that the proxy is slower at getMulti() requests. So I installed the memcached plugin, which implements its own getMulti() in PHP. Load times improve slightly to ~350ms.

Memcached

I then uninstall couchbase, install memcached, and try again. The memcached-redux plugin showed load times ~350ms, memcached plugin ~300ms with a couple of 4s responses thrown in for good mesaure.

Site was slower

Bottom line, using memcache was slower, whichever backend or plugin I tried.

On this server, we have plenty of spare memory / CPU and mysql has been given a generous amount of memory to play with. My guess is that for reasonably simple queries, when serving from memory, mysql performs about the same as memcache. Some old reading suggests mysql might even perform slightly better under the right circumstances.

Here, mysql is connecting via a unix socket while memcache is over TCP/IP. That alone might account for the performance difference.

Memcache has its place

Memcache has a whole lot of properties that make it useful in a wide range of circumstances. In fact, WordPress.com serve their cached pages from memcache via batcache. In an environment without a shared filesystem, memcache provides a distributed cache, which is the key to its success with WordPress.com. In fact the batcache literature specifically says that file based caching is faster on a single node.

Conclusion

On a single server with plenty of capacity, memcache is the wrong tool. I’m seriously considering varnish for sidebar and/or full page caching, it could really help with the busiest pages and I have some experience with it. But I think the next step will be to test APC. It’s a single machine, in-memory cache, so it could work well in this situation. Plus, the bytecode caching might have a positive impact.