Deep dive on goroutine leaks and best practices to avoid them

One of the fantastic things about Go is the ease with which we can execute concurrent tasks using goroutines and channels. But using goroutines and channels in production env without proper context on how they behave can cause some serious effects.

Well, we faced one such impact where we had a leakage in goroutines that resulted in the application server bloating over time by consuming abundant CPU & frequent GC pauses affecting the SLA of multiple APIs.

What to expect from this article

  • Understand what a goroutine leak is.
  • Understand multiple ways in which a goroutine can be leaked.
  • Details on one real-world scenario that caused a goroutine leak.
  • How we figured out the goroutine leaks?
  • What are the best practices to stop goroutine leaks?

As you can see in the metrics attached above, the goroutines started spiking over time exponentially. The only time they came down was when our spot instances are taken away by AWS and newer instances are started or if there was a new release that kills existing containers and spawns new ones.

If you observe the GC Pause time, it keeps increasing with the number of active goroutines. The more the GC Pauses greater the CPU utilization and the higher the response times.

Coming back to the issue, what is a goroutine leak?

A goroutine leak is where the client spawns a goroutine to do some async task and writes some data to a channel once the task is done but

  • There is no listener consuming from that channel to which the data is being written.
func newgoroutine(dataChan chan <dataType>) {
data := makeNetworkCall()
dataChan <- data
return
}
func main() {
dataChan := make(chan <dataType>)
go newgoroutine(dataChan)
// Some application processing (but forgot to consume data from the channel (dataChan))
return
}

In the above scenario, the code completes execution succesfully as if there is no issue at all. But what happens here is that, there will be a dangling goroutine that resides in memory eating up the CPU & RAM.
Why?The major reason for that is because of line 3 where we are writing data into a channel but as per go principles, an unbuffered channel blocks write to channel until consumer consumes the message from that channel. So in this case the return on line number 4 will never get executed and the newgoroutine function gets stuck throughtout the application lifetime as there is no consumer for this channel.
  • There is some conditional logic between the goroutine start and channel listener.
// Re-iterating above example by tweaking the flow a bitfunc newgoroutine(dataChan chan <dataType>) {
data := makeNetworkCall()
dataChan <- data
return
}
func main() {
dataChan := make(chan <dataType>)
go newgoroutine(dataChan)
// Some application processing
if processingError != nil {
return
}
data := <- dataChan
// Do something with data
return
}
Here in this case there is a little improvement. We had a consumer consuming the data from the dataChan but from the time we spawned the goroutine and before we started consuming the data from the channel, there is a ton of application code that resides which can quit the main function on some processing error | DB error | Nil pointer exceptions | Panics due to which the consumer of the data channel never gets executed. This is one such case where the goroutine can remain dangling and cause leakage. We can't move the consumption from dataChan to the top before application processing because the consumer will block the application processing until it receives the data which eliminates concurrent task execution
  • The forgotten sender
The above two cases are when the goroutine is blocked because there is no receiver for the channel or the block of code where receiver consuming the data from channel is skipped. Can that be the same case when we pass a channel to goroutine to consume from it and the there is an issue while sender sending the data to channel ? Yes -> The goroutine will be dangling in this case too Ex: func newgoroutine(dataChan chan <dataType>) {
// Consume data from dataChan
data := <- dataChan
// Do some processing on the data
return
}
func main() {
dataChan := make(chan <dataType>)
go newgoroutine(dataChan)
data, err := makeNetworkCall()
if err != nil {
return
}
dataChan <- data // This piece of code is never executed in error case of networkCall which makes newgoroutine dangling
// Do something with data
return
}

Well, 95% of the goroutine leaks are because of one of the 3 cases and in our case, it was because of Scenario-2.

We at GoIbibo-Makemytrip work on Discounting and Convenience Fee Service.

When a customer applies a promo code we have a set of rules to execute to figure out what’s the right discount. We have another microservice which we call Realtime Dynamic Discounter(DD) that tries to compute discounts based on some algorithms(black box).

This dynamic discount is an A/B Experiment wherein only 10% of the users will be part of this. Only if there is a valid discount from our static rules, we have to override the DD discount.

A very vague pseudo-code on what we do

func loadDDDiscount(ddChan chan <dataType>) {
ddRequest := formDDRequest()
response := callDDService(ddRequest)
ddChan <- response
}

func ApplyPromo() (discount int, err error) {
ddChan := make(chan <dataType>)
if ddEnabledRequest {
loadDDDiscount(ddChan)
}
discount, err = validateStaticRules()
// Got to say there is hell lot of processing & multiple error handlings while processing
if err != nil {
return 0, err
}
if ddEnabledRequest {
ddDiscount <- ddChan
discount = overrideDiscount()
}
return discount, nil
}

We need the response from DD only when we are done with processing the static rules. So consumption from the ddChan will only be done at the end.

In case if there is an issue with static rule evaluation | if there are no valid rules that satisfy the request | if the user applied some dummy promo the code where we consume data from ddChan won’t be reached which causes the loadDDDiscount function as a dangling goroutine.

So what are the approaches to solve this problem?

Approach-1

  • Approach -> We identify every error condition from the time we started the goroutine till we consume from the channel where we exit and place a receiver before every return statement just to unblock the spawned goroutine.
  • Pitfall -> We have to find all edge cases manually and in the future, if we have to handle one more error condition, we need to remember what all channels we need to consume data from before returning. Buggy solution.

Approach-2

  • Approach -> Instead of placing a receiver at every error case, why not have a defer function that can receive the data from the channel.
  • Pitfall -> In case of success the data will be read from the channel after processing the static rules. So if we start to receive data from the channel at defer function this blocks the main goroutine in case of success. Faulty solution.

Approach-3

  • The perfect approach with little to no change. In all the above scenarios we create an unbuffered channel that blocks the sender who sends the data to the channel until the receiver receives it. The major problem here is we aren’t sure whether the receiver flow will be executed or not due to our application processing. Well, the simple solution is to create a buffered channel with cap 1. With this, the sender is never blocked to write the data once even if there is no consumer spawned or the spawned consumer code is not reached.
  • Pitfalls -> Absolutely zero. This works exactly like unbuffered channels but provides us an extra capability where sender is not blocked to send the data once and the consumer can consume it at any point and the spawned goroutine won’t be waiting for the consumer.

We took the changes to production with the third approach and you can see a significant impact.

What used to be a linearly increasing no of goroutines, came down to 150 and so does our GC pause frequency.

The biggest pain part of the entire thing is, how to find a part of code where the goroutine leak exists?

Well, there are packages like https://github.com/uber-go/goleak which helps you to find goroutine leaks, I found it difficult to debug the leaks using the package. So here is my approach.

  • When the server starts, disable Garbage Collector using debug.SetGCPercent(-1)
  • Now run every flow in the code where a Go routine is used(Dev Env).
  • At the entry point of each API, print the no of running goroutines before starting & after executing the API
func ApplyPromo() {
fmt.Println(runtime.NumGoroutine())
defer fmt.Println(runtime.NumGoroutine()
// Process your application logic
}
  • Now if a service returns a different count of Goroutines before & after, then there is a leak in that flow.

We have close to 20 APIs and around 35–40 places where we used concurrency using goroutines. Luckily for me, I was able to drill down the leak in the first 3 iterations and found this flow where the leak exists.

Hope this experience will help you all while writing some concurrent code and not leaking goroutines.

I will starve to death if you don’t feed me some code. Quora : https://www.quora.com/profile/Mourya-Venkat-1