Monday's incident report

Monday's incident report

September 14, 2017

On Monday we experienced outages across all our services. The Clearbit APIs went from very slow, down, up, down, very slow, and then back up again for a good part of the day.

For that, we’re extremely sorry. We understand that Clearbit plays a large role in many sales, marketing, and product processes and we deeply apologize for any inconvenience caused.

Issue Summary

Requests to the Clearbit APIs started responding with HTTP TIMEOUT errors. We were alerted of this quickly through our Runscope integration, and after some investigation found the root cause was our NGINX routing layer; NGINX was segfaulting which caused a huge backlog of HTTP requests and timeouts.

total_response_time

To add fuel to the fire, once our routing layer stabilized, requests from our internal Batch Enrichment Service started retrying the failed batches and flooded our Company API with duplicate lookups.

Timeline (all times UTC)

  • 5:30 PM NGINX issues identified
  • 8:30 PM Træfik routing layer rolled out
  • 9:00 PM System processing HTTP requests normally
  • 11:00 PM Company lookup queues flooded with Retries from Batch
  • 2:00 AM Sporadic timeouts on Company API
  • 4:05 AM Redis failover to follower DB
  • 4:40 AM Batch disabled
  • 7:00 AM Batch enabled in stages
  • 8:00 AM All systems fully operational

Root Cause

NGINX failures, aggressive Batch retry logic, and Redis failover due to surge in retry traffic.

Resolution and recovery

We've been running Træfik in our dev and staging clusters for a few months now (and have been planning on making the switch from NGINX to Træfik), so when upgrading NGINX to most recent (stable) release didn't help with the segfault issues we decided to make the switch to Træfik.

While this wasn't the upgrade path we'd imagined for ourselves, the transition was smooth and our production routing layers have been behaving well thus far.

Corrective and Preventative Measures

We also pushed a fix to Batch Service which implements an exponential backoff algorithm to protect us against aggressive retries in future.


Introducing the Data Activation Platform

Companyby Andrew O'Neal on February 15, 2022

The Clearbit Data Activation Platform brings together our industry-leading B2B data, flexible integrations, and new capabilities to help you create demand, capture intent, and optimize pipeline.

Drive growth with Clearbit for Startups

Companyby Rachel Lord on January 12, 2022

We’re excited to announce Clearbit for Startups, a new solution for high-velocity growth at an accessible price.

Join our newsletter

Engaging stories and exclusive data, designed for our best customers. One useful issue each month.