Pergento no more

I’ve moved on from the world of Magento performance optimisation and am now working with meteor on ultra fast building of mobile and web apps with our new agency, superlumen. Check out the web site if you’re a startup looking for an interim CTO, or you have a business idea you want to validate either with a landing page campaign or an MVP.

Terrible reliability with Linode

Yet another incident with Linode today. We’re all about getting the most bang for buck in hosting, but apparently choosing Linode was a bad decision. We’ve had more outages with Linode in the last year than across all our other machines ever. Time to find another supplier for US nodes. ūüė¶

It doesn’t help that at some point they halved their pricing and didn’t bother letting us know, so we spent several months with half as much RAM as we were paying for!

ModPagespeedLoadFromFile all CDN domains

While investigating a serious performance issue over the last 24 hours I discovered an issue whereby some of our CSS files were being combined but were not being rewritten by mod_pagespeed. After much hair pulling, far too many hours spent in front of curl, vim, less and friends, I finally tracked down the solution. I added a ModPagespeedLoadFromFile directive for every CDN domain and now all our resources are being properly rewritten.

I think, but¬†I’m not sure, that mod_pagespeed tries to retrieve from the same domain as the request arrives on. So if you’re rewriting resources from example.tld onto cdn.xmpl.tld, when¬†the request arrives at mod_pagespeed with a Host header of cdn.xmpl.tld, mod_pagespeed tries to look up every resource on the cdn.xmpl.tld hostname, instead of doing a reverse lookup through the¬†ModPagespeedShardDomain directive.

Interestingly, this only seemed to affect our primary CSS file. Our print.css and javascript appeared to be unaffected. Strange. Very glad I can put this one to bed.

Graphing performance metrics

I’ve spent a¬†couple of days researching options to graph performance metrics. We’re trying to get all our metrics from all our services into a single interface. Here’s the pieces I found which seemed to fit best.

  • dygraphs – Javascript timeseries graphing library, looks excellent
  • InfluxDB + Grafana – Appears to be the best in class for metric storage + display
  • Graphite (+ Carbon) – Looks to be more complex to setup and a little less powerful than InfluxDB
  • StatsD – Pre-process high volume metrics before feeding to influx / carbon

Next step is to play with Influx + Grafana, either their hosted service, or our own, and see what we can push into it. More to follow…

Avoid StartSSL

We switched to using StartSSL some years ago for our SSL certificates. Their pricing model is attractive, pay $60 for personal identity validation and issue as many certificates in whatever configuration you like. Pay another $140 for EV validation. It all went great in the beginning. However, in recent interactions they’ve become increasingly antagonistic.

At first they refused to issue a replacement for an expired certificate, saying that we needed organisation validation (an extra $60) first. We were able to issue the same certificate 2 years ago without this extra $60. Then they became downright obnoxious by email, deciding they were now unhappy to discuss our account with me.

A little research into SSL certificates threw up two much cheaper alternatives SSLs.com from $5 1 domain and namecheap $30 3 domains.

mod_rpaf and intermittent port errors

We’re using mod_rpaf and trying to use it for SSL offloading so we can cache all HTTPS requests with Varnish. This has worked well in testing, but on production, we’re seeing intermittent port errors. So making 1’000 requests to the same URL over HTTP, we get several hundred requests showing in the apache logs as port 443.

This causes all sorts of unexpected side effects. Particularly with mod_pagespeed which serves 404 when the port has been set incorrectly. Nightmare.

Bottom line, we’ve take this out of our architecture until we can find a solution. The hardest part is that we can’t replicate the issue on staging. I’ve opened an issue.

SPDY, Googlebot, Alternate-Protocol and SEO

We deployed SPDY on our production sites on Monday. It’s hard to tell precisely, but it looks like our pages are being removed from Google search results.

Some background. We used the Alternate-Protocol: 443:npn-spdy/2. Our http pages are INDEX,FOLLOW and our https / SPDY pages are NOINDEX,NOFOLLOW.

Could it be that Googlebot takes follows the Alternate-Protocol header, loads the https version of the page, and then doesn’t index it because of the noindex tags?

Can’t find anything in Google about this issue. Anyone else have experience? I’ll try to post back here if we find anything more definitive than pure speculation…

Firefox SPDY alternate-protocol

Firefox does not support the alternate-protocol header part of SPDY. I’m not 100% confident of this, but from scanning this and reading this, that’s my understanding.

I couldn’t find a definitive answer to this question, so I’m posting this in the hope of saving others the search time. If you have information to the contrary, or if the situation changes, please let me know in the comments and I’ll update this post.

This raises the question, how do we deploy SPDY for Firefox users? Do we redirect all traffic to SSL anyway? Redirect only Firefox browsers that we think support SPDY? Only use SPDY for Chrome users? I’ll post more once we make a decision…

Track performance by user?

Thanks to the awesome folks at SOASTA we’re now using their mPulse system instead of our own boomerang install. This gives us 2 major wins. First, we’re including the tracking code in the non-blocking asynchronous iframe method, which gives the best possible performance at this point. Second, we can actually see the data. Previously, we just weren’t getting visibility into our boomerang data. We had the data, but weren’t using it, which was a total waste.

Looking at our stats today, mPulse tracks the median page load time. I was looking at the data and thinking, I wonder what it looks like per user. For example, I wonder if users with faster connections typically hit more pages. If they do, that means our median average user load time is actually lower than our median page load time.

Take two users, Alice and Bob. Alice is on her desktop in London with a 100Mbps line. She visits 8 pages. The average load time for her 8 pageviews is 1.2s. Bob is on his iPad over 3G in Alabama. (We’re in the UK, so London is closer!) Bob visits 4 pages. The average load time for the 4 pages is 2.3s. Now our arithmetic mean is somewhere in between the two, but our median, in this case, is one of Alice’s pageviews.

What would be really interesting, is to group pageviews by users. To count up all the Alices and Bobs, and then calculate the median (and 95th, 99th percentile) on their averages. That actually tells me, 50% of our users saw a page load of <1.4s, 95% <8s.

Having said all that, the data might actually look very similar to what we’re currently seeing. I’ll try to dig out some of our archived boomerang data, do some analysis, and post an update once I have more info.

Ultra High Performance Magento