Issue routing phone calls
Incident Report for Invoca
Postmortem

Summary

On July 23rd an error occurred that caused the telecom portion of the Invoca platform to stop connecting most calls from 10:35am to 11:51am PDT.

As part of a routine software deploy, a portion of our telecom servers was taken out of service while they were being updated. Per standard automated deploy processes, call traffic from the out-of-service telecom servers shifted to the remaining in-service servers, bringing them closer to their maximum call capacity. At that time, a software bug unrelated to the deploy created an error with our telecom carriers that further reduced our servers’ call capacity and caused them to not connect most incoming calls.

We apologize for the impact to you, our customers. We are working diligently to ensure this doesn’t happen again. Below is more information about exactly what happened and what we are doing to prevent future incidents.

Why did it happen?

A combination of events resulted in an excessive load on our servers that impacted our ability to connect calls.

  • A number of recently-added features had reduced the call capacity of our telecom servers since the last time we measured that capacity. Therefore, when deploying our software — which takes a portion of our servers out of service — we were closer to our maximum call capacity than we had intended.
  • Without our knowledge, our telecom carriers had changed configurations during equipment migrations, which caused the carriers to retry unanswered calls many times more than expected.
  • The error handling behavior of an Invoca call routing service was flawed. The above call data errors caused it to send messages that further increased the load on the telecom servers.

What are we doing to ensure this does not happen again?

During the past several days we have load tested our telecom servers to determine their current maximum call capacity and added telecom servers to ensure we can handle peak loads during deploys and failure scenarios.

We are also in the process of:

  • Implementing a global “bypass” in our dynamic number insertion system to prevent phone numbers on our customers’ websites from being replaced if we experience widespread telecom problems. This will allow our customers’ website visitors to see and call the original number on the page, bypassing Invoca entirely.
  • Increasing the frequency and scope of load testing to proactively detect changes to our telecom servers’ capacity
  • Adding rate-limiting features to prevent our telecom servers from becoming overloaded by excessive traffic from our carriers
  • Developing procedures to test carrier retry logic and run them on a regular basis
  • Changing the error handling behavior of the call routing service
  • Reviewing and streamlining our internal escalation and communication procedures
  • Reviewing our telecom servers and their interaction with other services for potential weaknesses

We know that customer calls are a critical part of your business, and our commitment to service reliability is one of the reasons you choose to work with Invoca. We’re proud of the last 36 months of telecom availability (100% before this event), and you can be assured that we will do everything in our power to prevent incidents like this in the future.

Posted Aug 02, 2019 - 09:02 PDT

Resolved
Update July 26, 4:22pm:
All systems continue to operate normally. We’ve identified underlying issues, implemented corrective measures to prevent a re-occurrence, and are working through the remainder of the root cause analysis process. We will post a postmortem to this page by August 2.

Update July 23:
The operations team is conducting an ongoing investigation to identify the root cause and take the appropriate corrective actions. We will be sharing more information here, on our status page, as it becomes available. In the meantime, you can contact our Customer Success team if you have questions.
Posted Jul 23, 2019 - 23:01 PDT
Monitoring
The Operations team has addressed the issue and calls are routing as expected. The team will continue to monitor and updates will be provided as they become available.
Posted Jul 23, 2019 - 11:59 PDT
Update
The Operations team is still actively addressing the issue and updates will be provided as they become available.
Posted Jul 23, 2019 - 11:33 PDT
Identified
The Invoca Operations team identified an issue where calls to some promo numbers are not routing as expected. The Operations team is actively addressing the issue and updates will be provided as they become available.
Posted Jul 23, 2019 - 11:06 PDT
This incident affected: Calls.