Zero Downtime Maintenances on MySQL RDS

Published in

WorkMarket Engineering

6 min readFeb 9, 2018

Achieving zero-downtime during database maintenances has been a challenging goal since the dawn of the databases. However, even today (over 30 years later!) many large and small companies declare downtime for some for their services to perform DB maintenances. It is a not surprising example, to run into the maintenance page on bank portals during weekend nights. While NoSQL solutions have gone to a much better place, providing the 24/7 availability on the RDBMS is still very challenging even today.

Given the downtime challenge, I’ve seen companies select NoSQL as their database, not because of scale need but instead to be able to perform zero-downtime upgrades and ensure site availability. Of course, moving to NoSQL come with the price of technology transition and challenges that come with data modeling, denormalization, data integrity, security, and operational support.

On another hand, some RDBMS like MySQL has an advantage over many because of its ability to support master-to-master replication (M/M). The M/M replication offers flexibility to perform maintenance on one node while taking live traffic on the other and then flip the roles to cut-over.

In today’s demanding online world consumer expects sites to operate 24/7 and WorkMarket is not an exception to that. Therefore, the TechOps team at Work Market put every effort to make it happen. For over the past two and a half years that I work at here, all of the DB maintenances had no downtime during database maintenances.

Our original approach

To our advantage, we use Percona MySQL M/M configuration running on EC2 instances which supports in active/passive mode. Typically, the application writes to an active master only. However, during traffic switchovers both masters will be active for some short time. It allows us to perform zero-downtime upgrades via graceful switchovers and enjoy zero-downtime maintenances.

The move to the RDS

Just to go a few years back, as our platform began to expand and move to microservices architecture to support WorkMarket business growth, we started to look into how we could utilize managed AWS RDS to scale and automate our database operations without compromising availability. We’ve chosen to use RDS because it offered us an ability to scale our database platform without hiring additional DBA resources. Also, the RDS provides so many benefits that we enjoy including but not limited to automatic DNS failover, advanced security, encryption, SSL connectivity, auditing, automated backups, simplified recovery, and upgrades procedures, API support, simplified manageability, and automation while freeing database engineers to focus on the application related support. With the move to RDS, we can launch and configure new environments on the fly within minutes as suppose to hours or days.

The database high availability challenge

The decision to move to RDS, however, wasn’t a straightforward one. Before the move, we had to address one of our critical concern. Given known RDS availability limitations, how we can continue providing zero-downtime maintenance as we did in the past? By design, RDS requires some downtime during its patching and upgrades. Also, RDS doesn’t officially support M/M replication.

To upgrade for example MySQL RDS to a new version, you have to perform an upgrade on RDS Read Replica first and then promote the replica to be a new master and finally switch your application to the new master. The procedure, unfortunately, will result in a few minutes downtime which is problematic.

To give credit to Amazon, they came up with MySQL Aurora cluster which reduced the downtime to 30 seconds during the upgrades. Most recently Aurora provides zero-downtime patching, cutting downtime during upgrades to 5 seconds. While this a great news we didn’t want to jump straight to Aurora, as we ran many microservices and based on our benchmarks not every database behind it is a good fit for it. Plus Aurora zero-downtime patching comes with limitations and some of them very significant, such as binary logging has to be disabled. While having binary logging is not required for the single region Aurora cluster, it is a must for the cross-region DR support, and here you have to start making your choices as disabling it requires cluster reboot.

Therefore, we wanted to find the high availability RDS solution that won’t be limited only to Aurora solution but allow us to continue providing our clients with the zero-downtime database maintenances on RDS.

After some research and proof of concept (PoC), we’ve implemented the solution that allows us to do that.

The solution

While RDS officially doesn’t support MySQL M/M, it doesn’t prevent you from implementing it on your own. So, our approach was to set up a M/M RDS before the maintenance and performed maintenance in the same fashion as our traditional approach on for MySQL hosted on EC2. The PoC worked as expected, and we automated the entire process.

The solution allowed us to workaround RDS availability limitations and continue to provide zero-downtime during database maintenances. Just recently we had to upgrade MySQL RDS 5.7.11 to 5.11.17 due to the recent Amazon announcement of discontinued support for the 5.7.11 version, and it went smoothly without any impact on our site.

For those who got curious enough how does it work, here I will describe the steps on how do we configure M/M RDS. Otherwise, you can skip straight to the Conclusion.

The implementation

Note steps #1 through #3 is very RDS specific due to the restrictions of SUPER privileges in RDS. For example, you can’t merely run mysqldump with — master_data2 option to build a slave from the backup as the option is only limited to a SUPER user that only granted to AWS support. And in general, doing mysqldump on the large DB could be very time-consuming. Therefore you left over with the option to launch new read replica (R1) (or use existing one if you can temporarily stop replication on it) to obtain the binlog position of the current existing Master (M1) from it. Then create a snapshot (SNAP1) from the R1 and launch a new RDS instance a future Master 2 (M2).

Before you begin, please ensure that you set auto_increment_increment at least to 2 in RDS parameter group to support M/M replication. It has to be greater than or equal to the number of Masters that you will configure.

Create Read Replica(R1) from M1

a. Create replication user on M1GRANT EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 
  ‘repl’@’%’ IDENTIFIED BY PASSWORD <secret>;b. Connect to the R1 and stop replicationCALL mysql.rds_stop_replication;c. Obtain M1’s current binlog file and position by running the 
   “show slave status\G” on R1 read replica.Master_Log_File:Exec_Master_Log_Pos:

2. Create RDS Snapshot(SNAP1) from R1

3. Create New Master (M2) from the SNAP1 with the attributes obtained from M1

Assign a parameter group to M2 with a different auto_increment_ offset from M1 to avoid M/M replication key conflicts

4. Setup M/M replication

a. Configure M2 as a slave of M1CALL mysql.rds_set_external_masterb. Connect to M2 and obtain its current binlog file and position by running “show master status\G”Master_Log_File:Master_Log_Pos:c. Connect to M1 and configure it to be a slave of the M2mysql.rds_set_external_masterd. Delete R1 and SNAP1 as they’re no longer needed

5. Perform Graceful Switchover

As M/M replication is setup successfully, we are ready to proceed with DB maintenance without downtime by gracefully switching App servers one at the time.

Conclusion

While the solution may seem a bit involved at first glance, it is very straightforward, and with the automation, we can lunch M/M configuration to any of our instances under 30 minutes. The solution offers more flexibility than Aurora as we don’t have downtime while not compromising DR and can always go back to not yet upgraded server in case of the issue.

It is worth mentioning we are not always married in using M/M solution during maintenances. Some maintenances could be very be challenging even on the Aurora. For example resizing column requires an exclusive lock on the table for the duration of the table copy that will block traffic. While Aurora made a significant improvement for adding new columns online, there are still many ALTER TABLE operations such as resizing columns that are locking. Therefore, we often would use Percona toolkit pt-online-schema-change to execute the change if we find it more appropriate.

The bottom line — M/M RDS solution gives us additional options working around RDS limitations and continue to avoid downtime during DB maintenances while keeping our customers happy.

If you’re interested in moving to RDS, then you should also check out our article on “How to move to MySQL RDS without downtime.”

I hope you’ll find the article helpful. Please feel free to ask any questions!