Invoca Platform Reporting Issues

Incident Report for Invoca

Postmortem

Summary

Between 12:13 and 18:55 PT on October 19th, 2022 the Invoca platform experienced disruptions to the following services:

Reporting availability
Reporting data freshness
Signal processing
Post-call webhooks
Invoca web portal login
Transaction Events API requests and ordering

While reporting data was delayed, no data was lost. There were no impacts to call routing or other Invoca services not listed. Availability and functionality varied per service and are detailed further below.

‌

Sequence of events

Our metrics quickly alerted us to a problem and our engineering team was able to begin coordination and mitigation right away.

The timeline below provides a high-level overview of key events, service availability, and mitigation efforts:

‌

12:25: Automated monitoring notifies us of a problem in the data processing pipeline.

12:27: Engineering teams start our incident response process.

12:31: Isolation to a specific component in the data processing pipeline.

13:32: First change deployed to attempt mitigation. Service for Invoca portal & Transaction Events API restored. Reports are still failing and new report data is delayed.

15:03: Second change deployed for mitigation. This change enabled reports to be displayed on the portal, but new report data was still delayed.

16:01: Incident fully mitigated. Final change deployed for mitigation. This change allowed the system to fully recover and report data started to catch up from this point.

18:55: All reporting data was caught up, and the incident was fully resolved.

‌

Why it happened

The main failure occurred in the backend used to move data for post-call processing and reporting. A recent change to a software library used in the backend introduced a defect in the logic used to re-queue failed data processing jobs. The retry logic would work for a typical amount of expected failures but could not slow down or stop retrying failed jobs when they accumulated.

‌

What we did and what we’re doing

During the incident, we made a change to how the retry logic operates. This was key to resolving this incident both now and in the future.

Post-incident, we’re planning additional changes:

Improvements in our ability to detect and troubleshoot issues in the data processing pipeline, allowing us to find problems sooner and resolve incidents faster
Removing the potential for a “snowball effect” to overwhelm the system with additional changes to retry logic for data processing
Speeding up our ability to deploy changes to the Invoca platform during incidents while maintaining our commitments to software lifecycle best practices

Posted Oct 27, 2022 - 06:26 PDT

Resolved

This incident has been resolved.

Posted Oct 19, 2022 - 16:33 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 19, 2022 - 16:12 PDT

Update

The change has been deployed and is currently updated. New data is still delayed and we're continuing to work on this update.

Posted Oct 19, 2022 - 15:39 PDT

Investigating

The change has been deployed and is currently updated. New data is still delayed and we're continuing to work on this update.

Posted Oct 19, 2022 - 14:06 PDT

Update

The change has been deployed and we are currently assessing the results. Thank you for your patience.

Posted Oct 19, 2022 - 13:44 PDT

Identified

We have identified the issue and are prepping to deploy a fix. Targeting to deploy in the next 5-10 minutes.

Posted Oct 19, 2022 - 13:17 PDT

Investigating

We are currently investigating issues with the Invoca platform reporting not loading as expected. Call records and associated data are being captured, but are taking longer than normal to become available in Invoca reporting after the call is completed. Updates will be provided as they become available.

Posted Oct 19, 2022 - 13:11 PDT

This incident affected: Platform Accessibility (Platform Accessibility) and Reporting (Reporting).