Webhooks provide two types of retry mechanisms: one based on the number of retries and one based on the time to live of a failed event. Using the first retry mechanism one can specify a number of tries a failed event will be resent. The latter allows one to define the time a failed event will be kept in the retry mechanism. Both the retry mechanisms guarantee the order of events within conversations, no event loss and no event duplicated delivered.
Retry based on numbers
Retry based on numbers configuration:
- Retry policy based on attempts : maximum number of attempts in the retry [0..5]
- Backoff policy with fixed amount of time between each retry [1 second]
For each event of type in the app install, one can configure up to 5 retry attempts. Between each attempt, Webhooks will wait for 1 second. Maximum waiting time to be considered added by retry is 5 seconds. For example, consider the following configuration:
{
"client_name": "Example retry configuration",
"capabilities": {
"webhooks": {
"ms.MessagingEventNotification.ContentEvent": {
"endpoint": "https://www.application.endpoint.com/",
"max_retries": 3
},
"ms.MessagingEventNotification.RichContentEvent": {
"endpoint": "https://www.application.endpoint.com/"
}
}
}
}
If an event of type ms.MessagingEventNotification.ContentEvent fails, it will be resent up to 3 times until it is either dropped or properly delivered. On the other hand, an event of type ms.MessagingEventNotification.RichContentEvent will be dropped immediately if the first attempt was unsuccessful because no retry is configured.
Retry based on time to live
Retry based on time to live configuration:
- Retry policy based on timeout : maximum time an event stays in the retry [2..259200]
-
Backoff policy exponential backoff retry with :
- initial backoff delay : initial delay for the first retry [1 second]
- retry factor : delay between two attempts is increased exponentially by factor [2]
- max backoff delay : the maximum delay between two following retry attempts [30 seconds]
- initial max jitter : initial jitter calculated as a random number between 0 and the given value as upper bound [0]
Retry based on time to live was introduced as an alternative to retry based on numbers. There is main driver for this mechanism: significantly longer recoverability. Having a recover time of up to 3 days enables one to fix the root cause of a failing endpoint without fearing data loss. However, it is important to use the right configuration specific for the traffic flow otherwise broken endpoint can cause events get outdated internally in webhooks. The configuration in the app install looks as follows:
{
"client_name": "Example retry configuration",
"capabilities": {
"webhooks": {
"retry": {
"retention_time": 86400
},
"ms.MessagingEventNotification.ContentEvent": {
"endpoint": "https://www.application.endpoint.com/"
},
"ms.MessagingEventNotification.RichContentEvent": {
"endpoint": "https://www.application.endpoint.com/"
}
}
}
}
If any event of type ms.MessagingEventNotification.ContentEvent or ms.MessagingEventNotification.RichContentEnvent cannot be sent, all events belonging to the same conversation will be resent multiple times within the next 24 hours (86400 seconds) until they are either delivered or dropped.
The time between retries increases exponentially. The first retry attempt is made after 1 second, the second after 2 and the third after 4, the fourth after 8 seconds and so on until hitting the threshold of 30 seconds. Indeed the maximum gap between two consecutive retries is 30 seconds (max backoff delay ). The retention time is configured in seconds with a minimum of 2 seconds and a maximum of 3 days (259200 seconds). When an endpoint recovered from a failure, events in the retry mechanism are sent in the order. If an endpoint recovers mid-conversation all events of the conversation are send in the right order. That is, in the order they were received from Conversational Cloud.
Common consideration
Limitations
-
Applications should consider that data (event) loss is possible. For example, when a failed event is retried, once the retry policy is exhausted (e.g. all the retry attempts fail) the event will be dropped.
-
Applications should consider that as long as the retry policy is not exhausted, events can be received multiple time to the application side. For example, if an event was received at the endpoint but the response takes more than 5 seconds, then Webhooks will consider that event as failed and will apply the retry policy — resulting in the same event being sent more than once.
Benefits
-
Applications should consider that events order is guaranteed and webhooks strives to provide "Exactly once" delivery semantics: messages are processed by LP webhooks service once and only once, ensuring no data loss and no duplicates, even during failures.
-
Internally the events are spread among configured number of queues per account and app install. So in case one the event delivered return an error from the endpoint, only the event stored in one specific queue triggers the retry whereas other queues continue seamless delivering events to the endpoint offering high level of throughput and efficiency.
Numbers vs. time to live
Retry based on time to live takes precedence over retry based on numbers. When the retention_time is set, the retry based on time to live will be triggered regardless of the configuration for the retry based on numbers. For example, consider the following configuration:
{
"client_name": "Example retry configuration",
"capabilities": {
"webhooks": {
"retry": {
"retention_time": 86400
},
"ms.MessagingEventNotification.ContentEvent": {
"endpoint": "https://www.application.endpoint.com/",
"max_retries": 1
},
"ms.MessagingEventNotification.RichContentEvent": {
"endpoint": "https://www.application.endpoint.com/"
}
}
}
}
According to the retry mechanism based on numbers, events of type ms.MessagingEventNotification.ContentEvent should be retried once. However, as the retention_time is set, the retry mechanism based on time to live will be applied.
Prevent failing endpoints to trigger Time To Live mechanism
While using the retry mechanism based on time to live, one can prevent the retry mechanism to be triggered when an endpoint fails. Consider the following configuration:
{
"client_name": "Example retry configuration",
"capabilities": {
"webhooks": {
"retry": {
"retention_time": 86400
},
"ms.MessagingEventNotification.ContentEvent": {
"endpoint": "https://www.application.endpoint.com/",
"max_retries": 0
},
"ms.MessagingEventNotification.RichContentEvent": {
"endpoint": "https://www.application.endpoint.com/"
}
}
}
}
Having a retention_time results in the application of the retry mechanism based on time to live. However, as the max_retries for events of type ms.MessagingEventNotification.ContentEvent is set to 0, failing events of this type will not cause the retry mechanism to be triggered and will be dropped instead. If the corresponding app install is already failing, these events will be put in the retry mechanism and will be resent when the endpoints recover.