If there is one thing without which a backend application become completely non-functional then it has to be the database. Working in an E-Commerce Company our day to day life is more about fetching the data from the database, process it and send it to the caller service(Front End) which uses the data to display some handful of insights to the end-user.
Communication with the database is solely done by the backend services and working at scale one needs to know the underneath of how the communication happens and how connections are created or reused.
In this post, we will discuss three flags that define how MYSQL client library maintains and re-use the connections created by the application layer.
Before going deep into the functionality of each flag, let me first give you an overview of what happens if these flags were not in the picture.
In the end, I will describe one situation that made us realise how costly MYSQL connection is.
Problem Statement-1 (Opening and closing connections on the fly)
To describe this problem I would like to relate it to the concept of a contract employee. The reason behind the concept of Contract Employee is that once the employer is done with the work, they no longer require the services of the employee thus terminating the employee. But the problem with having a contractual employee is that every time a new employee had to be recruited, it requires lots of processes starting from searching for the candidate to interviewing and onboarding and all those other formal stuff. In the same way, creating a new connection on the fly for every new request requires a huge process which is a resource-intensive task.
To overcome this, instead of having 1000 new contractual employees for 1000 tasks, I will recruit 100 permanent employees and keep re-using them i.e once the work assigned to one employee is over we can assign the topmost pending task to the idle employee. In this way, we don’t need to recruit a new employee each time and can re-use the existing one. If each employee can be related to a new collection this concept of re-using connections is called a connection pooling. Ideally, the max number of connections in a pool will nearly be equal to the max number of concurrent connections so that no requests wait in a queue to be served.
Problem Statement-2 (Resource Leakage. Solved using max-open-connections)
Let’s say me as a developer keep creating a new connection to the database instance for every new request from the client and there are no limitations on the number of max connections that can be opened. In this case for every new connection that’s created to the database, a pointer to that opened connection resides at the application layer and after establishing some huge number of connections the DB gets bloated and will stop responding to applications anymore.
Consider 3 applications are using the same database instance in production and just because of a mistake in one application not closing the opened connections and bloating the database which blocks the other two applications from contacting the database thus creating a butterfly effect.
Now let’s try and think about what could be an ideal solution in solving this problem.
I would go with having a constraint on the max number of open connections that the application can have at any instance of time. Consider we have set the max-open-connections to 200 and had never closed an opened connection, it only impacts that particular application which is the driving factor for the resource leak. After reaching a max of 200 connections no more additional requests will be served and thus providing an immediate insight to the development team on a possible memory leak without disturbing other services.
Problem Statement-2 (Proper Resource Usage. Solved using max-idle-connections)
Going back to the contractual employee example, we had recruited 100 permanent employees thinking that the organisation gets abundant work keeping the 100 employees busy all the time.
Consider this lockdown time where the number of tasks remains less than the number of employees. In this case, though many employees remain idle the employer still keeps paying the idle employees. Now the HR of the company thinks of a way to limit the max number of idle employees. So he/she puts a constraint saying, at any given point of time there shouldn’t be more than 20 idle employees.
When required tasks are less we got to eliminate as much fixed costs as possible. Same is the case with MYSQL Connections.
We at GoIbibo receive a huge number of requests during day time but very less number of requests during midnight. So is there any point to have the same number of open connections during the day and the night? Well, the answer is No. So with the help of max-idle-connections, we can notify the client SDK stating what’s the required state of idle connections.
Problem Statement-3 (Connection LifeTime)
Now we had limited the max number of idle connections from the application but databases don’t believe in Application Developers. As part of this, MYSQL DB has its own time-limit on how long a connection can be idle.
Let’s say the application set max idle connections to 20, out of which 10 connections aren’t being used for 28800 seconds as specified in timeout then MYSQL erases these connections from the backend without letting the application know about this.
Now on the next database-application handshake, the application sends an idle connection which is erased from MYSQL thus resulting in an INVALID CONNECTION ERROR.
Holy moly, how do we solve this?
Here comes the mysql-connection-lifetime property to our rescue. What this does is if I set mysql-connection-lifetime to < wait_timeout, this refreshes the connections as scheduled which flushes all existing connections as a scheduled job, thus we can avoid having idle connections waiting for more than wait_timeout period.
Problem Statement-4 (The cost of creating a new connection)
As I’ve promised at the start of this article, I am going to detail what’s the cost of creating a connection.
I keep hearing when people say creating a new connection on the fly is costly but I never understood what is meant by costly until I’ve experienced it live.
Our API’s are near to real-time with an SLA of 20 Milliseconds which includes a ton of database calls as well as request processing. So we had built a custom tracker module which tracks the execution time of each component and at the end, if the API takes more than 20ms, it logs the individual component execution time to the log aggregator(Sumologic in our case).
Once we integrated this logging module, we keep seeing a couple of requests to MYSQL taking more than 45ms. We then executed the same queries on our production databases which resulted in a response of lesser than 2ms.
Though if we consider the network as a bottle-neck, it shouldn’t take more than 10 ms. We debugged it for days and we gave up finally until a day I found a pattern on the timeouts.
Between each timeout that our log aggregator is reporting, there is at least 3 minutes difference. This is the same timeout period I’ve set to mysql-connection-lifetime. Which mean after every 3 minutes, the connections are getting refreshed and in the phase of creating new connections after the 3-minute timeout, the client is taking more than the said SLA. In our case, it is around 30–35 ms for creating a new connection. Once we extended the wait_timeout we had not seen the same problem ever since.