Rotating RDS SSL Certificates Without Downtime

Dmitriy Royzenberg
WorkMarket Engineering
3 min readDec 16, 2019

--

The challenge

If you are working at one of the many lucky companies that are using RDS databases on AWS, you should be more than familiar with the following email subject from AWS “Update Your Amazon RDS SSL/TLS Certificates by February 5, 2020.”

As you begin to learn the recommended procedure of Rotating Your SSL/TLS Certificate, you realize that it requires the RDS server restart which in turn will cause an outage even for AWS Aurora service.

If you are like most people, you probably will do exactly what AWS tells you and will plan a downtime window for your application. However, if you have strict SLAs, are fortunate to run your platform on MySQL, and have a little creativity, you can avoid the downtime altogether.

Truth to be told, you don’t need to declare downtime for your read replicas as you can just update SSL certs one at the time while shifting read-only traffic in a rolling fashion. However, for the writable Master databases, you really do not have many choices. Below we will explore the possible options using the AWS tools. This article is not intended to cover third-party solutions.

The Solution

One approach is to try using the recently released RDS Proxy.

Note: At the time that I am writing this article, the RDS Proxy is still in preview and not recommended for production use.

The way the RDS Proxy works is by being placed in front of your multi-az Master. In case the Master becomes unavailable, it will retain the existing connections open (except connections that are in the middle of a transaction) and queue the new connections for the time specified in the Connection borrow timeout setting until the failover is completed.

Using this approach will allow us to avoid downtime by keeping the client connection alive until RDS is back online after the restart with the new SSL certs. For example, if it takes up to 2 minutes to restart the database to update SSL certs, you can set the timeout to 5 minutes to avoid downtime.

It is worth mentioning that the timeout setting on the client app has to be correlated with Connection borrow timeout on the proxy because the client app might consider the proxy dead if the timeout on the client is smaller than Connection borrow timeout.

At first glance, this seems to be an attractive approach to address the SSL Update issue, but it does have some limitations. For example, if after restart you have any issue with SSL, your application will incur downtime for the duration of time that you need to troubleshoot the issue. Another disadvantage over the RDS Proxy that it is a very new and not a very robust product at this time.

However, there is another approach that we are successfully using at WorkMarket to deal with the RDS maintenances. I described the approach in the earlier blog post on How to perform RDS maintenances without downtime. The way it works is that we temporary launch a new Master and setup Master-Master replication with the existing Master. Then we update SSL on the new Master and gracefully switch application traffic to a new Master one App server at the time. The approach allowed us to update all our RDS SSL certificates without outages. The main advantage of the approach over the RDS Proxy is that it gives us the flexibility to test connection prior to switching all traffic, similar to Canary testing. Once the Master-Master RDS is configured, we take a Canary session and point it to the new Master for testing. Once all tests pass, we gracefully move production traffic to the new Master.

Conclusion

Once again, you can see that using RDS even if you are using Aurora has a price of facing downtime during scheduled maintenance a few times a year. However, it is absolutely possible to avoid the outages and keep your customers happy and unaffected by it.

I hope you’ll find the article useful. Please feel free to reach out to me with any questions.

--

--