Retry with Resilience4j
Author Ter-Petrosyan Hakob
In this article, we first give a short introduction to Resilience4j. Then we look closely at its Retry module. You will learn when to use retries, how to set them up, and which features Resilience4j` offers.
Introduction to Resilience4j
When applications talk to each other over a network, many things can go wrong. A request might time out, connections can break, or an upstream service may be down. Sometimes, a sudden load of requests can slow or crash a service.
Resilience4j is a Java library that helps you make your applications more reliable. It gives you tools to handle failures and keep your system running smoothly.
Let’s have a quick look at the modules and their descriptions:
Resilience4j Modules and Their Descriptions
| Module | Description | maven artifactId |
|---|---|---|
| Retry | Automatically try a failed call again | resilience4j-retry |
| RateLimiter | Limit how often a call can be made in a given time | resilience4j-ratelimiter |
| TimeLimiter | Set a maximum time for a call to finish | resilience4j-timelimiter |
| Circuit Breaker | Stop calling or use a fallback when failures keep happening | resilience4j-circuitbreaker |
| Bulkhead | Limit how many calls can run at the same time | resilience4j-bulkhead |
| Cache | Save and reuse results of expensive calls | resilience4j-cache |
Let’s set up a Maven project. Please note that I’ve added the resilience4j-all dependency,
but if you prefer, you can include only the modules you need.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.hakobtp.blog</groupId>
<artifactId>blog</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<java.version>17</java.version>
<lombok.version>1.18.38</lombok.version>
<resilience4j.version>2.3.0</resilience4j.version>
<spring-cloud.version>2024.0.1</spring-cloud.version>
</properties>
<dependencies>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-all</artifactId>
<version>${resilience4j.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
<annotationProcessorPaths>
<path>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
</path>
</annotationProcessorPaths>
</configuration>
</plugin>
</plugins>
</build>
</project>
What Is a Remote Operation?
A remote operation is any request your application makes over a network. Common examples include:
- Sending an HTTP request to a REST API
- Calling a remote procedure (RPC) or web service
- Reading or writing data in a database or object store
- Sending or receiving messages from a broker (RabbitMQ, Kafka, etc.)
What Happens When It Fails?
If a remote operation fails, you have two choices:
- Fail fast – immediately return an error to the client.
- Retry – try the operation again.
Retrying can hide temporary issues so that clients don’t notice a hiccup.
Which Option to Choose?
Deciding between failing fast and retrying depends on:
- Error type
- Transient errors are short‑lived. A retry often succeeds.
- Example: request throttled, network timeout, temporary service outage
- Permanent errors cannot be fixed by retrying.
- Example: hardware failure, HTTP 404 Not Found
- Transient errors are short‑lived. A retry often succeeds.
- Operation type
- Idempotent operations can run multiple times safely (e.g., reading data).
- Non‑idempotent operations may cause unwanted side effects if repeated (e.g., money transfers).
- Client type
- Applications (cron jobs, background processes) can wait longer.
- People usually expect a quick response.
- Use case
- Some tasks need high reliability more than speed (e.g., booking flights, transferring money).
Why Idempotency Matters
If a service processed your request but failed to send back the answer, a retry might be treated as a brand‑new request. For safe retries, make sure the operation is idempotent—the system can handle the same request more than once without side effects.
Balancing Speed and Reliability
- Faster feedback is better for human users. Failing quickly lets you show an error message right away.
-
More reliability can matter more for critical tasks. In those cases, you might:
- Acknowledge the request immediately (so the user knows it was received).
- Perform retries in the background.
- Notify the user when the work is done.
This way, you keep both reliability and a good user experience.
Resilience4j Retry Module
Resilience4j’s retry feature uses three simple parts:
- RetryConfig: Defines retry rules, such as how many attempts to make and how long to wait between them.
- RetryRegistry: Holds one or more
RetryConfigobjects and creates namedRetryinstances from them. - Retry: Uses its
RetryConfigto wrap your code (a lambda, method reference, or functional interface) so that Resilience4j will automatically retry the operation on failure.
With these building blocks, you can easily add retry logic around any remote call.
Simple Retry
In a simple retry, the operation is retried if a RuntimeException is thrown during the remote call. We can configure the number of attempts, how long to wait between attempts etc.:
private static void demonstrateRetryWithUncheckedException() {
System.out.println("\n=== Resilience4j Retry Demonstration ===");
// 1. Define retry rules:
// - maxAttempts(3): try up to 3 times
// - waitDuration(2 seconds): pause 2 seconds between tries
//RetryConfig config = RetryConfig.ofDefaults(); // you can use default settings
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.of(2, SECONDS))
.build();
// 2. Create a registry to manage Retry instances
RetryRegistry retryRegistry = RetryRegistry.of(config);
// 3. Create or retrieve a Retry named "userServiceRetry" using our config
Retry retry = retryRegistry.retry("userServiceRetry", config);
// 4. Prepare the remote call as a Supplier (lambda)
// Here we search users with firstName = "Bob"
UserQuery query = UserQuery.builder()
.firstName("Bob")
.build();
Supplier<List<User>> remoteCallSupplier = () -> userService.search(query);
// 5. Decorate the Supplier so it will retry on failure
Supplier<List<User>> retryingUserSearch = Retry.decorateSupplier(retry, remoteCallSupplier);
try {
// 6. Execute the decorated call
List<User> users = retryingUserSearch.get();
// Print each returned user
users.forEach(System.out::println);
} catch (Exception e) {
// 7. Handle the case where all retry attempts failed
System.err.println("All retries failed: " + e.getMessage());
}
}
We would use decorateSupplier() if we wanted to create a decorator and
re-use it at a different place in the codebase. If we want to create it and immediately execute it, we
can use executeSupplier() instance method instead:
Retry.decorateSupplier()gives you a reusable function you can call multiple times.retry.executeSupplier()runs the supplier immediately with retry logic under the hood.
var query = UserQuery.builder().firstName("Bob").build();
List<User> users = retry.executeSupplier(()-> userService.search(query));
Checked Exceptions
Now, suppose we need to retry operations that can throw both checked and unchecked exceptions. For example, calling userService.searchThrowingException()
may throw a checked exception. Because Java’s Supplier interface doesn’t allow checked exceptions, this will cause a compiler error.
Instead, you should use Resilience4j’s
io.github.resilience4j.core.functions.CheckedSupplier
private static void demonstrateRetryWithCheckedException() {
...
var query = UserQuery.builder().firstName("Bob").build();
CheckedSupplier<List<User>> remoteCallSupplier = () -> userService.searchThrowingException(query);
CheckedSupplier<List<User>> retryingFlightSearch = Retry.decorateCheckedSupplier(retry, remoteCallSupplier);
...
}
If you prefer not to use Supplier, the Retry module offers other decorator methods, for example:
decorateFunction(), decorateCheckedFunction(), decorateRunnable(), decorateCallable() and etc.
The plain decorate* methods only retry on unchecked exceptions (RuntimeException).
The decorateChecked* variants will retry on both checked (Exception) and unchecked exceptions.
Conditional Retry
In real applications, we don’t always want to retry every error. For example, if an AuthenticationException happens, retrying won’t help. When calling an HTTP service, you can:
- Check the HTTP status code (for example, 401 Unauthorized)
- Look for a specific error code in the response body
Use these checks to decide whether to retry. Next, we’ll see how to set up conditional retries in Resilience4j.
private static void demonstrateRetryPredicateConfig() {
var errorCodes = List.of(5678, 5679);
RetryConfig config = RetryConfig.<UserWitErrorCode>custom()
.maxAttempts(3)
.waitDuration(Duration.of(2, SECONDS))
.retryOnResult(userWithErrorCode -> errorCodes.contains(userWithErrorCode.errorCode()))
.build();
RetryRegistry retryRegistry = RetryRegistry.of(config);
Retry retry = retryRegistry.retry("userServiceRetry", config);
Supplier<UserWitErrorCode> userWitErrorCodeSupplier = retry.decorateSupplier(() -> userService.findById(1));
try {
var user = userWitErrorCodeSupplier.get().user();
System.out.println(user);
} catch (Exception e) {
System.err.println("All retries failed: " + e.getMessage());
}
}
In this example, userService.findById(1) returns a record:
public record UserWithErrorCode(Integer errorCode, User user) { }
We use this line in our retry configuration:
.retryOnResult(userWithErrorCode -> errorCodes.contains(userWithErrorCode.errorCode()))
This means:
- If the returned
errorCodeis in the list[5678, 5679], Resilience4j treats it as a temporary error. It waits 2 seconds, then retries the call—up to 3 attempts in total. - If the
errorCodeis not in that list, it stops retrying immediately.
In other words, the retryOnResult predicate lets us retry only for specific error codes that we know are temporary.
Conditional Retry Based on Exception Types
Sometimes we want to retry on general failures but skip retrying for certain cases. For example:
- We throw
UserServiceExceptionfor any unexpected error in the user service. - But if a
UserNotFoundExceptionhappens, retrying won’t help because the user does not exist.
We can configure this with retryExceptions and ignoreExceptions:
RetryConfig config = RetryConfig.<UserWithErrorCode>custom()
.maxAttempts(3)
.waitDuration(Duration.ofSeconds(2))
.retryExceptions(UserServiceException.class) // retry on general service errors
.ignoreExceptions(UserNotFoundException.class) // do not retry if user not found
.build();
retryExceptions(...)lists the exception types that trigger a retry (including subclasses).ignoreExceptions(...)lists the exception types that should not be retried.- Any other exception (for example,
IOException) will not be retried, because it is neither inretryExceptionsnor inignoreExceptions.
Conditional Retry with a Predicate
Sometimes even a single exception type needs extra checks. For example, a LimitException may include an error code.
We only want to retry when that code is 34565:
Predicate<Throwable> limitPredicate = ex -> {
if (ex instanceof LimitException le) {
return le.getCode() == 34565; // retry only for code 34565
}
return false;
};
RetryConfig conditionalConfig = RetryConfig.<UserWithErrorCode>custom()
.maxAttempts(3)
.waitDuration(Duration.ofSeconds(2))
.retryOnException(limitPredicate)
.build();
retryOnException(...)takes a test (Predicate<Throwable>) that returns true when we should retry.- You can make this test as simple or as detailed as you need (checking error codes, message text, etc.).
With these options, you decide exactly which errors and conditions cause a retry.
Backoff Strategies
Until now, we used a fixed wait time between retries. In most cases, it is better to increase the delay after each attempt. This is called backoff, and it gives the remote service more time to recover if it is overloaded.
Resilience4j uses an IntervalFunction to control backoff. An IntervalFunction takes the attempt number (1, 2, 3, …)
and returns the wait time in milliseconds.
Use ofRandomized(baseDelay) to add randomness around a base delay. By default, it uses a randomization factor of 0.5:
RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofRandomized(2000))
.build();
Base
- delay: 2000 ms
- Factor: 0.5 (default)
- Actual delay: between
- 2000 − 2000×0.5 = 1000 ms
- 2000 + 2000×0.5 = 3000 ms
You can change the factor with ofRandomized(baseDelay, randomizationFactor).
Use ofExponentialBackoff(initialDelay, multiplier) so each wait time doubles (or more) each try:
RetryConfig config = RetryConfig.custom()
.maxAttempts(6)
.intervalFunction(IntervalFunction.ofExponentialBackoff(1000, 2))
.build();
- Attempt 1: 1000 ms
- Attempt 2: 1000 × 2 = 2000 ms
- Attempt 3: 2000 × 2 = 4000 ms
- …and so on.
Combine both strategies with ofExponentialRandomBackoff(initialDelay, multiplier). This adds randomness to the exponential delays.
You can write your own function, for example, to add a small random jitter:
IntervalFunction jitterBackoff = attempt ->
500L * (long) Math.pow(2, attempt - 1)
+ ThreadLocalRandom.current().nextLong(100);
RetryConfig config = RetryConfig.custom()
.maxAttempts(6)
.intervalFunction(jitterBackoff)
.build();
- Base exponential delay: 500 ms, doubled each attempt
- Plus up to 100 ms random extra delay
With IntervalFunction, you have full control over how delays grow. Pick the strategy that best fits your scenario.
Asynchronous Retries
So far, our examples have used synchronous calls. Now let’s see how to retry asynchronous operations.
Imagine we search for users on another thread:
CompletableFuture
.supplyAsync(() -> userService.search(query))
.thenAccept(System.out::println);
Here, supplyAsync() runs search(query) in a separate thread. When it finishes, thenAccept() prints the list of users.
To add retries, use the executeCompletionStage() method on your Retry instance. It needs:
- A
ScheduledExecutorServicefor scheduling retry attempts. - A
Supplier<CompletionStage<…>>that wraps the async call.
For example:
// 1. Create a scheduler for retries
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
// 2. Wrap the async user search in a Supplier
Supplier<CompletionStage<List<User>>> completionStageSupplier =
() -> CompletableFuture.supplyAsync(() -> userService.search(query));
// 3. Execute with retry logic
retry.executeCompletionStage(scheduler, completionStageSupplier)
.thenAccept(System.out::println);
- Step 1: We use a single-threaded scheduler here, but in real apps, you would use a shared pool, e.g.:
Executors.newScheduledThreadPool(n). - Step 2: The supplier returns a
CompletionStagethat will run the search. - Step 3:
executeCompletionStage()decorates and runs the async call, retrying if needed. The returnedCompletionStagestill lets you usethenAccept()or other callbacks.
This approach makes your asynchronous code resilient, automatically retrying failed operations without blocking the main thread.
Events
By default, retrying is a “black box”—we don’t see when an attempt fails or when a retry happens.
To log these details, Resilience4j lets you listen to Retry events. You can use the EventPublisher to run code on each retry, success, or error.
// Get the publisher from your Retry instance
Retry.EventPublisher publisher = retry.getEventPublisher();
// Log each retry attempt
publisher.onRetry(event ->
System.out.println("Retry #"
+ event.getNumberOfRetryAttempts()
+ ", wait "
+ event.getWaitInterval().toMillis()
+ " ms")
);
// Log when it finally succeeds
publisher.onSuccess(event ->
System.out.println("Succeeded after "
+ event.getNumberOfRetryAttempts()
+ " attempts")
);
// Log if all retries fail
publisher.onError(event ->
System.err.println("Failed after "
+ event.getNumberOfRetryAttempts()
+ ": "
+ event.getLastThrowable().getMessage())
);
You can also watch the RetryRegistry to see when Retry instances are added or removed:
RetryRegistry.EventPublisher<Retry> registryPub = retryRegistry.getEventPublisher();
registryPub.onEntryAdded(entry ->
System.out.println("Added retry: " + entry.getAddedEntry().getName())
);
registryPub.onEntryRemoved(entry ->
System.out.println("Removed retry: " + entry.getRemovedEntry().getName())
);
Metrics
Resilience4j tracks how many calls:
- Succeed on the first try
- Succeed after one or more retries
- Fail without any retry
- Fail even after retries
To collect these metrics, use Micrometer. First, bind your RetryRegistry to a MeterRegistry:
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedRetryMetrics.ofRetryRegistry(retryRegistry).bindTo(meterRegistry);
Then you can read and print each metric:
Consumer<Meter> meterConsumer = meter -> {
String desc = meter.getId().getDescription();
String metricName = meter.getId().getTag("kind");
Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
.filter(m -> m.getStatistic().name().equals("COUNT"))
.findFirst()
.map(Measurement::getValue)
.orElse(0.0);
System.out.println(desc + " - " + metricName + ": " + metricValue);
};
meterRegistry.forEachMeter(meterConsumer);
In production, send these metrics to a monitoring system (Prometheus.) for dashboards and alerts.
private static void demonstrateRetryWithCheckedException() {
RetryConfig config = RetryConfig.custom()
.maxAttempts(3) // up to 3 attempts
.waitDuration(Duration.ofSeconds(2)) // 2 seconds between attempts
.build();
// 2. Create registry and bind metrics
RetryRegistry retryRegistry = RetryRegistry.of(config);
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedRetryMetrics.ofRetryRegistry(retryRegistry).bindTo(meterRegistry);
// 3. Listen for registry events
retryRegistry.getEventPublisher()
.onEntryAdded(entry -> System.out.println("Retry created: " + entry.getAddedEntry().getName()))
.onEntryRemoved(entry -> System.out.println("Retry removed: " + entry.getRemovedEntry().getName()));
// 4. Create or retrieve a Retry instance
Retry retry = retryRegistry.retry("userServiceRetry");
// 5. Listen for retry lifecycle events
retry.getEventPublisher()
.onRetry(event -> System.out.printf("Retry #%d after waiting %d ms%n",
event.getNumberOfRetryAttempts(), event.getWaitInterval().toMillis()))
.onSuccess(event -> System.out.printf("Succeeded after %d attempts%n",
event.getNumberOfRetryAttempts()))
.onError(event -> System.err.printf("Failed after %d attempts: %s%n",
event.getNumberOfRetryAttempts(), event.getLastThrowable().getMessage()));
// 6. Prepare the checked supplier for a remote call
UserQuery query = UserQuery.builder().firstName("Bob").build();
CheckedSupplier<List<User>> checkedSupplier = () -> userService.searchThrowingException(query);
// 7. Decorate it with retry logic
CheckedSupplier<List<User>> retryingSupplier = Retry.decorateCheckedSupplier(retry, checkedSupplier);
// 8. Execute and handle the result
try {
List<User> users = retryingSupplier.get();
users.forEach(System.out::println);
} catch (Throwable e) {
System.err.println("All retries failed: " + e.getMessage());
}
// 9. Print out captured metrics
meterRegistry.forEachMeter(meter -> {
String desc = meter.getId().getDescription();
String kind = meter.getId().getTag("kind");
double count = StreamSupport.stream(meter.measure().spliterator(), false)
.filter(ms -> ms.getStatistic().name().equals("COUNT"))
.findFirst()
.map(Measurement::getValue)
.orElse(0.0);
System.out.printf("%s - %s: %.0f%n", desc, kind, count);
});
}