On Friday afternoon, Clearbit was down for approximately 24 minutes, from 2:14pm to 2:38pm PDT. This is the longest outage that we've had in the past two years, and also the worst (given that every API was down). For that, we’re extremely sorry. We understand that Clearbit plays a big part of many sales & marketing processes and we deeply apologize for any inconvenience caused.
We’d like to tell you what we understand about the outage and the steps we are taking to avoid a similar incident in the future.
We're in the process of adding an additional AWS datacenter us-west-2 to our existing cluster on us-west-1. We were working with a third-party vendor to help us link VPCs.
At 2:14pm, while attempting to add a route to the VPC routing table in us-west-1, the third-party vendor accidentally instead overwrote an existing route for 0.0.0.0/0 to the internet gateway. This effectively cut off all access from the VPC to the Internet.
Clearbit's team was paged after the outage begun, but it wasn't immediately clear to us what was the cause of the problem. We were also locked out of our VPN, compounding matters. It was only at 2:29pm that we pinged the third-party vendor, who started looking into the issue.
At 2:31pm, the third-party vendor removed the route that they had added to the table, but that didn't fix the problem.
At 2:38pm, the third-party vendor realized that there was no route to the Internet gateway and added it back. This resolved the issue.
While edits to the VPC routes clearly need to be made more carefully, the real failure was the breakdown of communication and uptime status between us and the third-party vendor. This resulted in a much longer downtime than necessary - for this we take full responsibility. In the future, we'll take steps to make sure any collaboration we do is much closer, and make sure everyone is aware of our uptime status.