A few tips for resilient asynchronous workflows « Notes

A few tips for resilient asynchronous workflows

An application, that doesn’t need to communicate with the external world, is a rarity. Often a single business process may require an interaction of several systems. For example, to fulfill an order in e-Commerce, one may have to reserve stock in a warehouse, create an invoice in a billing system and finally send an e-mail notification to a customer. And each of these steps may fail because of multiple reasons.

When we are trying to model complex workflows and integrate with other systems, taking an asynchronous approach, choosing implementation based on job queues and messaging, often seems to be a reasonable solution. Unfortunately, it’s a bit tricky path. I will try to provide a few suggestions that may help to avoid at least some of the troubles.

Break down the process

Effectful interactions with external systems are usually irreversible. APIs rarely support transactions, not even thinking about something that would spread across different unrelated systems.

If the mentioned above fulfillment process would be implemented as one chunk of code, the troubles would start, when something would go wrong in the middle of action. And often it’s not matter of an exceptional situation or failures, sometimes it’s just a business constraint, and we don’t know, when it’s possible to carry on with some operation successfully (so we have to retry every a while). We would have a lot of “if” conditions and code branches. Storing and updating the state would be a bit tricky too.

The obvious solution to these issues is breaking down the whole process into several smaller separate steps, around natural boundaries. One reason to fail = one action, that can be queued, retried without affecting other parts and easily tracked.

Traceability

This is one of these things, which doesn’t seem to be difficult or too exciting, but it’s important part of every system, cheap when baked-in from the beginning, and can be troublesome to retrofit.

It’s crucial to be able to say what, when and why anything has happened. We should be able to verify whether the queue is operating as expected and to discover quickly when it has stopped.

If it’s possible, it’s good to store incoming and outgoing messages, so it’s easier later to verify whether a problem appears already in the input data, or it’s caused by our malfunctioning system.

Seek idempotence

Simply put, when possible, next steps of the pipeline should be so implemented, that they can be rerun multiple times without undesirable effects.

It can be achieved in different ways: with additional queries verifying the progress (e.g. by checking status or existence of some document) or proper support for thrown exceptions and errors announced by external systems (like “Invoice already exists.”).

It will make the pipeline more resilient and help overcoming automatically some unexpected failures without even knowing them.

Don’t fail a big thing for small goal

This is more a rule of thumb than an extra code to write. You should be aware of the importance of executed tasks and consequences of their fails.

Neither sending twice a notification about the new order to the customer or not sending it at all, isn’t such a big problem comparing to charging customer’s card twice.

Imperfect may be good enough

Spending a week covering an edge case, making your system able to automatically respond to an event that may occur twice during a five years long lifetime of the application and which can be easily dealt with manually, may not be the best allocation of resources.

You can’t predict everything. That kind of systems tend to require some warm up time, a period for adjustments and tuning it up.

Other things to consider

And few more bits to think about.

Beware parallelism. Going crazy can easily lead to exhaustion of resources and thus have negative impact upon performance of other parts of the system.
You may have to use locking mechanism: to avoid doing same job twice parallelly (worth considering when evaluating different queue engines) or putting a subject of operation into inconsistent state.
And when you have locking, you may wish to have as well some locks autocleaning process. For example if something started 24hrs ago and hasn’t yet been completed, then clearly something went wrong, and there is no need for further locking.
Clean up after yourself. Free the no more used resources of any type. It’s important especially when your program operates continuously (e.g. looped indefinitely). Manually delete temporary files, deconstruct large objects. Sometimes mechanisms like garbage collection don’t work as we would expect or can’t do their job because of our mistakes.