2022-01-25
- The rabbit hole
- A quick review of the Go garbage collector
- Coming back to our code
- That's cool but wait a minute...
- Extra CPU cycles overhead
- Binary size
- Is the write barrier check race-free?
- Conclusion
Some weeks ago, I looked at the assembly output of a particular code snippet to understand a missed optimization opportunity. While trying to make sense of how many compiler layers transformed the program to the resulted output, I found something odd.
The rabbit hole
Let's see the following Go program:
package main
var sink *int
func main() {
foo := []int{1, 2, 3}
sink = &foo[1]
}
Pretty simple, right?
If we compile that program and see the assembly output from the compiler, we can see what is ~precisely what the CPU will do when running it. Here's the relevant part from the main
function (cleaned a bit):
L1 LEAQ type.[3]int(SB), AX
L2 CALL runtime.newobject(SB)
L3 MOVQ $1, (AX)
L4 MOVQ $2, 8(AX)
L5 MOVQ $3, 16(AX)
L6 LEAQ 8(AX), CX
L7 CMPL runtime.writeBarrier(SB), $0
L8 JNE main_pc82
L9 MOVQ CX, "".sink(SB)
L10 JMP main_pc94
main_pc82:
LEAQ "".sink(SB), DI
CALL runtime.gcWriteBarrierCX(SB)
main_pc94:
MOVQ 16(SP), BP
ADDQ $24, SP
RET
Here's a quick explanation of the code above:
- L1-L2: We create a new array object of size three.
- L3-L5: We're moving values 1, 2, and 3 to each array slot.
- L6: We're saving
&foo[1]
inCX
. - L7-L8: ???
- L9: We're storing
CX
insink
.
The above doesn't seem surprising since most of the description is what the original Go program does. But as you might notice, we have some additional instructions (L7-L8) there just before the sink
global variable assignment. So why is that code there that seems unrelated to our program logic?
A quick review of the Go garbage collector
Go is a garbage-collected language. The Go garbage collector is a state-of-the-art piece of engineering related to Go success. It's mainly designed to achieve low latency, avoiding long stop the world (STW) pauses.
The Go GC is a concurrent non-generational tri-color mark and sweep garbage collector. Many articles explain how this works in detail, so it isn't my goal to do a deep dive here.
A TL;DR is that you can imagine all allocated objects in the heap as nodes in a graph, where each node has a reference to another node (e.g., a struct
has a pointer to some other Go object). Start with roots ("entry points" of reference in memory) when running the garbage collection and walk the graph in all possible directions. While walking, you paint each node using three colors (mark). At the end of the algorithm, if a node was never painted a different color, it means it's unreachable and safe to sweep.
The non-generational part of the definition above stresses the contrast with other garbage collectors, which use generations of objects to prioritize which things are analyzed first in the next GC round. So, for example, if a particular node in the graph survived 3 rounds of GC, it's probably the wrong candidate to prioritize in the next round.
Walking the complete graph can take a lot of time, and here is where the concurrent part becomes important. The Go GC tries, as much as possible, to run this algorithm while allowing the program to do useful work. And here is where things get tricky.
If you allow the program to change memory references, many things can go wrong during the mark phase. For example, we could have a problem if the GC has already marked A as reachable, and the program suddenly does A.foo=bar. If bar is only reachable through A, since A was already scanned, we won't color bar as reachable!
The most important thing to remember is to acknowledge that, since the GC is doing its work allowing concurrent execution of program code, it needs a way to keep track when external code changes references. This is precisely what write barriers are for!
Coming back to our code
Hopefully, now things will start to click.
Let's zoom in again in the hand-wavy part from our original snippet:
L7 CMPL runtime.writeBarrier(SB), $0
L8 JNE main_pc82
L9 MOVQ CX, "".sink(SB)
L10 JMP main_pc94
main_pc82:
LEAQ "".sink(SB), DI
CALL runtime.gcWriteBarrierCX(SB)
main_pc94:
MOVQ 16(SP), BP
ADDQ $24, SP
RET
Since sink
and foo
in our Go code are variables living in the heap, the compiler needs to inject this writeBarrier conditional check to know if it should assist the GC in painting this change in memory reference correctly.
In L7 we're checking if the write barrier is enabled. Recall that doing this work is only necessary if the GC is running and in a stage where it allows external code to run concurrently. To know about that, we check if runtime.writeBarrier
global variable:
- If it's 0, it simply does
sink = &foo[1]
. No extra work. - If it isn't 0, the write barrier is enabled, so it jumps to
main_pc82
and calls a runtime function that will do the marking assist for the GC that is running to deal with it. Note that this function will dosink = &foo[1]
directly.
Makes sense!
That's cool but wait a minute...
Write barriers are great since it allows the garbage collector to be run concurrently with logic in our program, but there's no free lunch. So let's try to explore some possible drawbacks.
Extra CPU cycles overhead
Whenever a relevant memory reference is changed in our code, the CPU executes two additional instructions (L7 & L8). This is the case no matter if the GC is running or not.
Executing two extra instructions is negligible if the GC isn't running. Current CPUs are superscalar and have branch predictors. This branch looks hard to mispredict, so it seems like an ~almost free overhead.
If the GC is running, we must do the jump and execute main_pc82
. Without getting into details, we can appreciate that this logic was highly optimized since it is written in assembly!
For amd64 architecture, you can see that in the runtime·gcWriteBarrierCX<ABIInternal>() and runtime·gcWriteBarrier<ABIInternal>() implementations. It isn't negligible, but it is probably close to being the best it can be.
Binary size
Including two additional CPU instructions in many places adds to the binary size. Of course, two instructions is a small addition, but considering that it is injected on every slot that a heap memory reference is changed, maybe it isn't negligible.
I thought modifying the compiler source code was good to avoid injecting these write barriers and compare binary sizes. But, of course, this is a terrible idea if the program runs garbage collection since this is a necessary part of correctness.
I'm in my early days on messing around with the Go compiler, so after some exploration, I found that there's a compiler flag to disable write barriers:
WB bool "help:\\"enable write barrier\\"" // TODO: remove
Probably the TODO
comment is a good signal that we're getting in the weeds here, so don't count much on this flag in the future. ;)
So, I compiled our code snippet with `go build -gcflags='-wb=false":
LEAQ type.[3]int(SB), AX
CALL runtime.newobject(SB)
MOVQ $1, (AX)
MOVQ $2, 8(AX)
MOVQ $3, 16(AX)
LEAQ 8(AX), CX
MOVQ CX, "".sink(SB)
MOVQ 16(SP), BP
ADDQ $24, SP
RET
The write barrier was removed!
Let's try to count the number of write barrier with both values of -wb
:
➜ go build -gcflags='-wb=true' -o main main.go && objdump -d main | grep 'writeBarrier>' | wc -l
850
➜ go build -gcflags='-wb=false' -o main main.go && objdump -d main | grep 'writeBarrier>' | wc -l
849
That makes sense, but we still have other 849 write barrier calls. This happens because apparently, the -wb
flag only applies to code in our program, not in dependencies nor the runtime code itself.
Let's try to go further and create a customized Go compiler that directly removes the SSA pass that includes write barriers. This compiler would remove write barriers from all places!
The easiest way is to try doing the same as the -wb=false
flag, but simply force that in the compiler code itself. This would have the added benefit of probably seeing as dead code the runtime.gcWriteBarrier<CPU-registry>
helpers seen before.
Concretely, I did an early return false
in this line. When compiling the compiler, I got an interesting surprise:
Building Go cmd/dist using /usr/local/go. (go1.17 linux/amd64)
Building Go toolchain1 using /usr/local/go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
runtime: pointer 0xc00071c018 to unused region of span span.base()=0xc000376000 span.limit=0xc000378000 span.state=1
runtime: found in object at *(0xc000366000+0x208)
object=0xc000366000 s.base()=0xc000366000 s.limit=0xc000368000 s.spanclass=76 s.elemsize=2048 s.state=mSpanInUse
*(object+0) = 0x7c6900
*(object+8) = 0xc00000c270
*(object+16) = 0x0
*(object+24) = 0x0
*(object+32) = 0x0
*(object+40) = 0x0
*(object+48) = 0x0
*(object+56) = 0x0
*(object+64) = 0x0
*(object+72) = 0x0
*(object+80) = 0x0
*(object+88) = 0x0
*(object+96) = 0x0
*(object+104) = 0x0
...
Oops, what a beauty :). Something went wrong. After thinking for a moment, the problem becomes pretty clear. We're removing write barriers that will blow up the GC as soon as it runs. Since we're removing write barriers, we shouldn't allow the GC to run, which makes sense. For that, we can compile with GOGC=off
, which solved this problem!
Now we have our tweaked compiler that never includes write barriers. But, of course, we should always run any binaries compiled by this compiler with the GOGC=off
flag to avoid panics.
Let's try to compare again with our original toy program:
➜ go build -gcflags='-wb=true' -o main main.go && objdump -d main | grep 'writeBarrier>' | wc -l
850
➜ GOGC=off /home/ignacio/code/go/bin/go build -o main main.go && objdump -d main | grep 'writeBarrier>' | wc -l
15
That looks better! We still have some gcWriteBarrier
calls left, but I suspect that's related to how the toolchain gets built. The original go
compiler is part of the toolchain pipeline, and this compiler still enables write barriers. We could use this modified compiler to recompile the compiler, but these 15 calls won't add much.
While comparing the binary sizes of both binaries, I see they're the same size. This felt to me that's related to code alignment in the binary offsetting the gains (I'd like to know if that's correct). So let's do the same write barrier experiment with kubectl
, a bigger program.
➜ go build -o kubectl_on ./cmd/kubectl && wc -c < kubectl_on
76182368
➜ GOGC=off /home/ignacio/code/go/bin/go build -o kubectl_off ./cmd/kubectl && wc -c < kubectl_off
74180160
2MB (2.6%) smaller! This size difference is probably more related to dead code removal from GC regarding write-barriers than concrete write-barriers code injection at memory references changes in the program. The binary looks to work correctly: GOGC=off ./kubectl_off
.
If you're running a program that doesn't need the GC, you can leverage this idea to save some bytes in the binary and probably have more performance. How much? It depends, so you should measure.
Is the write barrier check race-free?
I found the CMPL runtime.writeBarrier(SB), $0
statement is rather surprising. If the write barrier is enabled and disabled by the GC goroutine, why is this unguarded CMPL
race-free?
From the Go memory model perspective, I'd expect that some happens-before relationship should happen at some other place on every write barrier check.
My gut feeling was that an STW point in the runtime should be the runtime's synchronization point.writeBarrier
global variable is synchronized in all CPU caches. Unfortunately, it took me a while to find the right spot in the code despite now looking obvious.
// Enter concurrent mark phase and enable
// write barriers.
//
// Because the world is stopped, all Ps will
// observe that write barriers are enabled by
// the time we start the world and begin
// scanning.
That's it! Since all Ps are stopped, this would mean that they should see the correct value in runtime.writeBarrier
when they're doing work again.
Conclusion
As usually happen, you can find something exciting to dive into even in the most naive code snippet.
I hope I haven't convinced anyone that disabling write barriers is a good idea to increase performance or reduce binary sizes. But, on the other hand, in some wild cases, it might be worth going and doing this if saving nanoseconds or bytes is meaningful in any way.
I've gone through this rabbit hole by reading the Go source code trying to make sense of the pieces. There's a reasonable chance that some things here aren't accurate. If you find that's the case, get in touch and let me know!