16 Feb

Reason and Explanation For Recent Outages

Dear customers,
Please find below some information regarding the issues in our Sydney 1 network between the 6th and 9th February. I have taken this seriously and am dedicating a lot of time and resources into ensuring that these issues are taken care of and not repeated. I’ve added improvements that are needed to be made at the end of this document. I am also available for a chat on 02 8115 8801 or [email protected] if anyone would like to discuss anything in this piece.

Friday 6th February
15:48 - 17:00 (5 x 2-3 minute outages)

On Friday the 6th February we were alerted to a very targeted and large Distributed Denial of Service attack towards the BGP connections that connect our internet session from Servers Australia to our providers. These IP’s are not generally known and are not a common target of a DDoS attack, these IP’s cannot be sent via a scrubbing system such as Black Lotus due to the role they play in delivering the BGP information to our core network.

The risk of these IP’s being attacked should be very low, as the only people who would attack this is someone who has very extensive knowledge of BGP and how a network is designed. We also needed to be the target of such attack for this to be an issue.

The attacks started to hit us harder and harder and eventually took us offline, therefore we had no choice but to come up with a plan to resolve this. The solution we used is not a common one as these attacks are very rare, hence we had to spend some time implementing the changes, working with our providers to make the target IP’s ‘disappear’ from the internet and be hidden. This was a mammoth task, but needed to be done.

We started to successfully deploy these plans to our old core network, as the new core network was on its way. Unfortunately this caused us a lot of pain and further issues, as the new core was not able to handle the type of DDoS traffic destined to our routers, a temporary solution was put into place and all traffic stabilized at around 17:00 hours.

Saturday 7th February
15:26 - 15:50

Issues from Friday’s temporary change started to affect the flow of traffic, essentially our core network became under intense load during a peak time and started to crash, the equipment could not handle the extra route processing that we had put into it on Friday.

We have known that there is a route limitation and we have been working towards a new core network as fast as possible, however these things take time and planning. We had set to make the changes later in the next week, but had to make the decision to cut over to the new core today to stop these issues.

The network was again temporarily restored and a plan was under way.

The afternoon of the 7th was spent planning and migrating around 10% of the traffic in our Sydney network from the old core to the new one as a small trial, all went as planned and all was successful.

Sunday 8th February
All Day (24 hours)

During early hours of Sunday morning and all through the day we started to migrate customers from the old SY1 core network to the upgraded core network. There was very little disruption to customers and people were seeing a vast improvement in reliability and stability from the new network.

Due to the size of the network we were not able to migrate everyone in one go. As such there were customers still left on the old network well into the Sunday night.

Late Sunday evening we started to see capacity issues again. This was a concern and we had to direct our attention back to the old core network and stop migrating to the new one. This meant that most gaming customers and customers with BGP to us were still on the old core network. This was stable and all was working, a plan to migrate the gaming and downstream customers was underway.

Monday 9th February
All Day

Plans to physically move our provider connections from our old core to our new core were happening rapidly and were being done in a way as to not disturb customer traffic.

At around 19:00 on the 9th there was a major issue with an upstream provider. They were the target of a DDoS attack, and as such this affected us. This time the attack was not destined to us and we were just unfortunate to have received the loss on our network due to them being attacked, we quickly shifted traffic and all returned to normal.

At around 20:00 hours we had detected a DDoS attack to our gaming BGP IP’s. This network was not moved to the new core and the provider BGP was still vulnerable, this caused a huge amount of loss on the network and eventually took out the entire gaming network.

We decided that we needed to act fast and migrate the gaming network to the new core. Our engineers contacted our providers and we started physically moving cables and transit links to the new core to make this happen. Once all was migrated we instantly saw the stop of attacks.

These attacks were aimed towards Servers Australia and were a deliberate attack against our core infrastructure.

The fix?
Unfortunately it was a very long weekend for us all here at Servers Australia and there are some big learning curves for us. We are a rapid growing company and we have overlooked some things that we need to do during an outage, these things are below;

Move our phone system off our network and have a better messaging service when there is major service issues.
Send faster outage updates from the www.status.mysau.com.au system.
Send SMS alerts to customers that opt in with updates on outages.

As for a long term solution, we are going to continue upgrading the network, and we believe that we have combated the final hole in our network that was allowing the DDoS attacks to enter. Having a perfect network is hard, but knowing what to do during an outage is something which will be our key focus,including keeping our customers informed and up to date.

We have also increased network capacity by 10 times to increase the network performance, we are also in the process of linking all data centres with larger interconnections so that data will flow in and out of every data centre across all states and New Zealand This really will increase the capacity and the reliability of the network.

I appreciate all the support and patience whilst we iron out these issues in the Sydney network. This has been one of the hardest growth issues that myself and our company has had to deal with, and it is the understanding from our customers that allows us to improve and grow this network.

Regards,

Jared Hirst
Chief Executive Officer

Written by Servers Australia

Helping you understand the best Cloud Infrastructure options for your business is our business. For performance, reliability and solutions designed to meet your needs, we won’t be beaten.

Reason and Explanation For Recent Outages

Written by Servers Australia

Leave a Reply Cancel reply

SEE MORE

Top 5 Reasons Why You Should Choose Managed WordPress Hosting

Benefits and Concerns of Hyper-Converged Storage

Types of Cloud Storage and Which is Best for Your Business