Monday's incident report
On Monday we experienced outages across all our services. The Clearbit APIs went from very slow, down, up, down, very slow, and then back up again for a good part of the day.
For that, we’re extremely sorry. We understand that Clearbit plays a large role in many sales, marketing, and product processes and we deeply apologize for any inconvenience caused.
Requests to the Clearbit APIs started responding with HTTP TIMEOUT errors. We were alerted of this quickly through our Runscope integration, and after some investigation found the root cause was our NGINX routing layer; NGINX was segfaulting which caused a huge backlog of HTTP requests and timeouts.
To add fuel to the fire, once our routing layer stabilized, requests from our internal Batch Enrichment Service started retrying the failed batches and flooded our Company API with duplicate lookups.
Timeline (all times UTC)
- 5:30 PM NGINX issues identified
- 8:30 PM Træfik routing layer rolled out
- 9:00 PM System processing HTTP requests normally
- 11:00 PM Company lookup queues flooded with Retries from Batch
- 2:00 AM Sporadic timeouts on Company API
- 4:05 AM Redis failover to follower DB
- 4:40 AM Batch disabled
- 7:00 AM Batch enabled in stages
- 8:00 AM All systems fully operational
NGINX failures, aggressive Batch retry logic, and Redis failover due to surge in retry traffic.
Resolution and recovery
We've been running Træfik in our dev and staging clusters for a few months now (and have been planning on making the switch from NGINX to Træfik), so when upgrading NGINX to most recent (stable) release didn't help with the segfault issues we decided to make the switch to Træfik.
While this wasn't the upgrade path we'd imagined for ourselves, the transition was smooth and our production routing layers have been behaving well thus far.
Corrective and Preventative Measures
We also pushed a fix to Batch Service which implements an exponential backoff algorithm to protect us against aggressive retries in future.