Pergento no more

I’ve moved on from the world of Magento performance optimisation and am now working with meteor on ultra fast building of mobile and web apps with our new agency, superlumen. Check out the web site if you’re a startup looking for an interim CTO, or you have a business idea you want to validate either with a landing page campaign or an MVP.


Terrible reliability with Linode

Yet another incident with Linode today. We’re all about getting the most bang for buck in hosting, but apparently choosing Linode was a bad decision. We’ve had more outages with Linode in the last year than across all our other machines ever. Time to find another supplier for US nodes. ūüė¶

It doesn’t help that at some point they halved their pricing and didn’t bother letting us know, so we spent several months with half as much RAM as we were paying for!

ModPagespeedLoadFromFile all CDN domains

While investigating a serious performance issue over the last 24 hours I discovered an issue whereby some of our CSS files were being combined but were not being rewritten by mod_pagespeed. After much hair pulling, far too many hours spent in front of curl, vim, less and friends, I finally tracked down the solution. I added a ModPagespeedLoadFromFile directive for every CDN domain and now all our resources are being properly rewritten.

I think, but¬†I’m not sure, that mod_pagespeed tries to retrieve from the same domain as the request arrives on. So if you’re rewriting resources from example.tld onto cdn.xmpl.tld, when¬†the request arrives at mod_pagespeed with a Host header of cdn.xmpl.tld, mod_pagespeed tries to look up every resource on the cdn.xmpl.tld hostname, instead of doing a reverse lookup through the¬†ModPagespeedShardDomain directive.

Interestingly, this only seemed to affect our primary CSS file. Our print.css and javascript appeared to be unaffected. Strange. Very glad I can put this one to bed.

Graphing performance metrics

I’ve spent a¬†couple of days researching options to graph performance metrics. We’re trying to get all our metrics from all our services into a single interface. Here’s the pieces I found which seemed to fit best.

  • dygraphs – Javascript timeseries graphing library, looks excellent
  • InfluxDB + Grafana – Appears to be the best in class for metric storage + display
  • Graphite (+ Carbon) – Looks to be more complex to setup and a little less powerful than InfluxDB
  • StatsD – Pre-process high volume metrics before feeding to influx / carbon

Next step is to play with Influx + Grafana, either their hosted service, or our own, and see what we can push into it. More to follow…

Avoid StartSSL

We switched to using StartSSL some years ago for our SSL certificates. Their pricing model is attractive, pay $60 for personal identity validation and issue as many certificates in whatever configuration you like. Pay another $140 for EV validation. It all went great in the beginning. However, in recent interactions they’ve become increasingly antagonistic.

At first they refused to issue a replacement for an expired certificate, saying that we needed organisation validation (an extra $60) first. We were able to issue the same certificate 2 years ago without this extra $60. Then they became downright obnoxious by email, deciding they were now unhappy to discuss our account with me.

A little research into SSL certificates threw up two much cheaper alternatives from $5 1 domain and namecheap $30 3 domains.

mod_rpaf and intermittent port errors

We’re using mod_rpaf and trying to use it for SSL offloading so we can cache all HTTPS requests with Varnish. This has worked well in testing, but on production, we’re seeing intermittent port errors. So making 1’000 requests to the same URL over HTTP, we get several hundred requests showing in the apache logs as port 443.

This causes all sorts of unexpected side effects. Particularly with mod_pagespeed which serves 404 when the port has been set incorrectly. Nightmare.

Bottom line, we’ve take this out of our architecture until we can find a solution. The hardest part is that we can’t replicate the issue on staging. I’ve opened an issue.