Share Button

We recently migrated our platform towards Kubernetes. That’s for sure one of the best move, technologically speaking, that we’ve ever done. It provides stability, scalability, service discovering, centralized configuration, rolling upgrade… the list could keep going.

When you move to a platform for automating deployment and scaling like Kubernetes/DCOS, one of the things you absolutely need to get right is a proper implementation of health checks (liveness and readiness) and a graceful shutdown of your application (on top of compute resources, but that’s another topic).

We use a lot Play framework, it’s a great Java/Scala framework to develop web applications. But in order to have a smooth deployment/scaling down process with Kubernetes, you need a proper graceful shutdown of your application. In a web context, that could mean:

  • Wait for current requests to finish
  • Refuse any new incoming requests
  • Cancel any task scheduled on the Akka actor system scheduler
  • Shutdown database connections

Unfortunately, all those things don’t come for free with Play. To give you the opportunity to shutdown gracefully, Kubernetes first sends a SIGTERM signal to your application. Play’s behaviour is to shutdown everything immediately upon receiving that signal. That means current requests don’t complete and clients receive a “connection closed” error. Not very “graceful”…

First, we need to handle SIGTERM in a better way. We can’t let the application shutdown. That’s the purpose of this signal handler singleton service:

import scala.concurrent.duration.Duration;
import scala.concurrent.duration.FiniteDuration;
import play.api.inject.DefaultApplicationLifecycle;
import akka.actor.ActorSystem;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import sun.misc.Signal;
import java.util.concurrent.TimeUnit;
import javax.inject.Inject;
import javax.inject.Singleton;

@Singleton
public class SignalHandler {

  private static final Logger LOG = LoggerFactory.getLogger(SignalHandler.class);
  private static final FiniteDuration STOP_DELAY = Duration.create(40, TimeUnit.SECONDS);

  private volatile boolean isShuttingDown = false;

  @Inject
  SignalHandler(ActorSystem actorSystem, DefaultApplicationLifecycle lifecycle) {
    Signal.handle(new Signal("TERM"), new sun.misc.SignalHandler() {
      @Override
      public void handle(Signal signal) {
        isShuttingDown = true;
        LOG.debug("Termination required, swallowing SIGTERM to allow current requests to finish");
        actorSystem.scheduler().scheduleOnce(STOP_DELAY, () -> {
          lifecycle.stop();
        }, actorSystem.dispatcher());
      }
    });
  }

  public boolean isShuttingDown() {
    return isShuttingDown;
  }
}

By swallowing SIGTERM we allow the current requests to finish, and we use this opportunity to delay the shutdown of the application by using the Akka scheduler.

There is still one issue: the application is still running, so potentially it can receive new requests. To make sure the application is not served by Kubernetes anymore, we need to leverage the SignalHandler in the liveness/readiness implementation:

import play.mvc.Controller;
import play.mvc.Http;
import play.mvc.Result;
import javax.inject.Inject;

public class HealthController extends Controller {
  private final SignalHandler signalHandler;

  @Inject
  public HealthController(SignalHandler signalHandler) {
    this.signalHandler = signalHandler;
  }
  
  public Result checkHealth() {
    if (signalHandler.isShuttingDown()) {
      return status(Http.Status.SERVICE_UNAVAILABLE);
    } else {
      return ok();
    }
  }
}

Once SIGTERM has been swallowed, the method isShuttingDow will start returning true , which will turn into a 503 HTTP response, therefore the application won’t receive any new request, because considered as unhealthy.

Combined to a long terminationGracePeriodSeconds configuration in your pod, this gives enough time to the application to finish processing the current requests. That’s how we achieved no downtime deployment.

Share Button