The Census website is finally back online. After being offline for around 44 hours, the site was switched on again this afternoon, however due to DNS propagation issues, many users still experienced problems, receiving ‘this site cannot be reached’.
This morning the Prime Minister was angry, he started pointing fingers and firmly placed the blame with the ABS and the company they employed to run the service, IBM. Despite the international coverage of the meltdown, #CensusFail had no response from IBM, one of the largest technology companies in the world.
Technology reporter at The Australian, David Swan posted an official statement from IBM in the last hour.
So what the hell happened?
The events of August 9th are technical facts and should be easy to explain. From the growing list of press conferences, interviews and social media posts, politicians and ABS staff are busy pointing fingers, speaking about technical details they don’t understand and doing a serious job of ass covering. On some level, they are actually trying to prevent sensitive infrastructure details from being revealed to avoid a repeat situation.
The details are not clear, they are contradictory and they are unlikely to be revealed without an inquiry many have already called for.
With something as important and contentious as the Census, it was always going to draw the attention of those seeking notoriety, so a Denial of Service (DOS) was indeed expected. The problem was the right mitigation strategy wasn’t implemented.
Not all traffic to the site is created equal. One of the hardest things to do is differentiate between legitimate website requests and ones that are artificially generated. One of the first tiers of prevention is to block traffic from locations that you can be assured isn’t valid. As the Census was after contributions from people in Australian, an easy way to block international threats is to employ the services of a geo-blocker. This is where the problem started. This IBM service was not up to the task and during the peak time on Tuesday night, it crashed and adequate redundancy wasn’t in place.
The next level you need to guard against is DOS attacks from within Australia. This is harder, but if you can determine the source traffic is originating from a specific ISP, you can ask them to shut down this source, remembering that you want to stop traffic upstream or before it hits the site wherever possible, that way the traffic to the site is minimised. To achieve this, you need to have an agreement with ISPs, so when you make the call, they’re ready and willing to help. This also didn’t happen in the way it should have.
The hardware infrastructure required to run a large website like the ABS/Census website, is not an easy task, hence employing the services and paying millions of dollars to IBM. They use SoftLayer, an IBM company that delivers the many components, including firewalls, load balancers, threat mitigation etc. On the night they had a piece of hardware infrastructure fail that again didn’t feature an automatically fail-over and lacked the redundancy that it should have.
The final piece of the puzzle was erroneous traffic inside the network that appeared to ABS staff like a hack was occurring. This is when the panic button was pressed and the site was taken offline. That was absolutely the right thing to do given the circumstances, however the data was interrogated offline in the minutes and hours afterwards and it was determined that it was a false positive in monitoring software, designed to detect the breach. Again, this is the responsibility of IBM and with so many weak points that were susceptible to failure, the perfect storm become an unmitigated disaster.
There has been a lot of conjecture about whether this was a hack or not. At this stage, it doesn’t look like it was. What many have incorrectly reported is that a DDOS attack is designed to disrupt. While this can be true, it can also be used as a technique to overwhelm a server with traffic and as it crashes (it wasn’t allowed to occur in this instance), there’s a window where the server may expose fail poorly, which opens it up to attacks before it falls over completely, or the cord is pulled.
What we don’t know is why it took 2 days to hear from IBM, while the world learnt about a critical outage for a client as high-profile as the Australian Bureau of Statistic. If IBM were instructed not to communicate publicly, this should also be investigated.
The Government, Census and IBM have still not spoken on why they believe the site is now ready, secure and adequately ready for what is now an even bigger prize for a want to be DDOS attacker. If the past 48 hours were well spent in crisis meetings, the infrastructure would have 13 levels of redundancy, mitigation deals signed and the capacity overbuilt to ensure a flawless collection of the rest of Australia’s important Census data.
Now that you can, you should complete your Census online and those who think you’re more secure by completing it on paper should realise your data ends up in the same database, the input is just less efficient.