Between 12:13 and 18:55 PT on October 19th, 2022 the Invoca platform experienced disruptions to the following services:
While reporting data was delayed, no data was lost. There were no impacts to call routing or other Invoca services not listed. Availability and functionality varied per service and are detailed further below.
Sequence of events
Our metrics quickly alerted us to a problem and our engineering team was able to begin coordination and mitigation right away.
The timeline below provides a high-level overview of key events, service availability, and mitigation efforts:
12:25: Automated monitoring notifies us of a problem in the data processing pipeline.
12:27: Engineering teams start our incident response process.
12:31: Isolation to a specific component in the data processing pipeline.
13:32: First change deployed to attempt mitigation. Service for Invoca portal & Transaction Events API restored. Reports are still failing and new report data is delayed.
15:03: Second change deployed for mitigation. This change enabled reports to be displayed on the portal, but new report data was still delayed.
16:01: Incident fully mitigated. Final change deployed for mitigation. This change allowed the system to fully recover and report data started to catch up from this point.
18:55: All reporting data was caught up, and the incident was fully resolved.
Why it happened
The main failure occurred in the backend used to move data for post-call processing and reporting. A recent change to a software library used in the backend introduced a defect in the logic used to re-queue failed data processing jobs. The retry logic would work for a typical amount of expected failures but could not slow down or stop retrying failed jobs when they accumulated.
What we did and what we’re doing
During the incident, we made a change to how the retry logic operates. This was key to resolving this incident both now and in the future.
Post-incident, we’re planning additional changes: