Handling errors in Step Functions workflows

To Nha Notes | Nov. 18, 2024, 11:32 a.m.

Types of Errors in AWS Step Functions
  1. States.All Errors: Catch-all for any error not explicitly caught by other patterns.

  2. States.Timeout: Triggered when a state exceeds its allowed execution time.

  3. States.TaskFailed: Raised when a task state fails.

  4. States.Permissions: Occurs due to IAM permission issues.

  5. States.ResultPathMatchFailure: When the result path doesn't match.

  6. States.BranchFailed: Raised if a parallel state fails.

  7. States.NoChoiceMatched: No match found for a Choice state.

  8. States.ParameterPathFailure: When a parameter path evaluation fails.

Error Handling Strategies
  • Retry: Automatically retry a failed state.

  • Catch: Capture errors and redirect execution to a recovery path.

  • Timeout: Specify a maximum time a state should run.

Retry Definition Block

You will notice that at each step in the State Machine you have incorporated a Retry statement. This statement includes a list of conditions that should trigger a retry in a step, along with specific parameters that go along with the retry. To break down the different parameters in the Retry block:

  1. ErrorEquals – this key expects a list of error conditions that should trigger a retry. In our example, we are going with a blanketed error condition called States.ALL. This will capture all possible error events that could potentially occur during the execution of the step. We will discuss how to make the error conditions to handle even more granular.
  2. IntervalSeconds – the value of this key is the number of seconds to wait to attempt a retry after the first failure occurs. For example, if our step fails, the state machine will wait for 3 seconds before attempting the first execution retry.
  3. MaxAttempts – this value signifies how many times the State Machine should attempt a retry. In this case, we have set the number of retries equal to 2. This means that the state machine execution will attempt a retry up to 2 times and will fail after the 3rd failure occurs.
  4. BackoffRate – the value of this key signifies the multiplier by which the retry interval (IntervalSeconds) increases after each retry attempt. For example, the first retry attempt will wait 3 seconds, and the second retry attempt will wait 4.5 seconds.

Examples of retrying after an error

"X": {
   "Type": "Task",
   "Resource": "arn:aws:states:us-east-1:123456789012:task:X",
   "Next": "Y",
   "Retry": [ {
      "ErrorEquals": [ "ErrorA", "ErrorB" ],
      "IntervalSeconds": 1,
      "BackoffRate": 2.0,
      "MaxAttempts": 2
   }, {
      "ErrorEquals": [ "ErrorC" ],
      "IntervalSeconds": 5
   } ],
   "Catch": [ {
      "ErrorEquals": [ "States.ALL" ],
      "Next": "Z"
   } ]
}

This task fails four times in succession, outputting these error names: ErrorA, ErrorB, ErrorC, and ErrorB. The following occurs as a result:

  • The first two errors match the first retrier and cause waits of one and two seconds.

  • The third error matches the second retrier and causes a wait of five seconds.

  • The fourth error also matches the first retrier. However, it already reached its maximum of two retries (MaxAttempts) for that particular error. Therefore, that retrier fails and the execution redirects the workflow to the Z state through the Catch field.

References

https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html

https://aws.amazon.com/blogs/developer/handling-errors-retries-and-adding-alerting-to-step-function-state-machine-executions/

https://community.aws/content/2k58xHJ6Maw6BpXmI4pwwliaHk5/mastering-aws-step-functions-error-handling?lang=en