Serverless error handling

It’s a while i’m challenging Serverless where there is no permanent server or VM to manage , where all resiliency is handled by cloud provider, seems nice? yeah, mostly we use serverless in event sourcing and the pipelines are asynchronously chained, focusing on best practices some times it was not the best solution as you got lots of components in your architecture.

Let’s deep dive:

In our first example there is 3 components, a load balancer , a .NetCore Api and SQL database.

1- The Requests are balanced

2- forwarded to your api

3- Registered in database

Everything OK? Yeah sure , in the best case we are all done by the business case.

If any unexpected failure happens in the process you will lost your entity and the value depends on your business case, so loosing the entity is not desired and can be frustrating in most of scenarios.

We can keep our case synchronous as the requester wait for the response and accept the treatment time , it’s possible but not in all case, (It’s not at all a good solution neither average one, it’s the worst in my idea)

Surely, you can handle storing the request in a temporary storage and delete them as they are successfully treated. but saving to a datastore and deleting at the end is just adding latency and some extra unnecessary needs and dependencies in your architecture.


In the following architecture which is composed of a simple Gateway , Lambda and Dynamodb.

1- The request gets in

2- Forwarded to lambda

3- saved to dynamodb table

In this case there is not VM or instance so we forget ASAP the local storage and other solutions as well will be the same as the first scenario.

We can take advantage of the message bus systems and Poll the message as we are ready and break the system asynchronous, but if any failure interrupts your treatment? Oh, we leave the message in message bus to be re-polled , but for how many times ? if there is some domain problem with the entity as you will never validate that. The answer is a DLQ system as the messages get inside after any pre-defined number of retries, the question raises here, What we do with the message in a DLQ? leaving them there is just loosing them, we need actions , we need to solve the problem , alerting the requester or … .

So What To Do?

In the first scenario: Guarding our solution by an asynchronous design will be a jump to the quality and modern design but you are closing to the second scenario which has also lots of problems.

The following image will break the design to asynchronous but again the VMs , Servers and etc.

And if i change that as bellow:

We are a bit more automated as we trigger automatically the lambda when the item arrives in Dynamodb, so we don’t pay unnecessary VM for batch processing and also we do some tries to process the message and finally it keeps failing , the Queue will keep that message.

The failure cases could be :

  • Service unavailability
  • Message is not conform with domain needs

As you see we are getting closer to the second scenario. but also we need a huge optimization over our design to be able to do the actions. till now we could avoid loosing the messages by keeping them some where, a reliable place.

The first solution:

If there will be another Lambda function with the only occupation to verify the messages status and decide to retry them but if a service is unavailable there will be no need to retry the message as it arrives in failed message we keep it delayed by adding a queue in the middle as bellow:

Now the message is analyzed and will be delayed in another queue for 10 minutes for example, so we leave the service to became available.

If the failure is kind of domain reason and message is unacceptable we will archive the message into the storage and trigger the alerts.


Note: By my knowledge in Azure and AWS the messages which are retried by message bus has an attribute which shows the number of retried already reached, so you can avoid passing the max threshold by this attribute.

The Problem: if there is lots of functions you need to add a huge number of queues which adds complexity to your solution, but we need a reliable solution which is easily maintainable in the future .

Our design get better by separating the failure analysis from domain code, but it’s attached to our design yet, and this solution will not be cloud native totally.

The Second Solution:

you can trigger snippets of code based on the filtered events and keep the monitoring and failures totally separate form your design.

In this design we trigger a function when any failure arrives, so we leave the message in DLQ with a delay in message availability , so your back end will be left to be established in situation and you will rerun your message.

Note: we focused on the error handling part of distributed and async design , we could understand the needs of retry and reevaluating the messages but this is not enough yes , we will talk about another methods i the next articles , have fun and viva serverless

i'm a technial lead , solution/softwatre architect with more than 20 years of experience in IT industry,, i m a fan of cloud and serverless in practice