Optimizing Go programs by AVX2 using Auto-Vectorization in LLVM.

I found an interesting article written about optimizations for Arrow Go. It looks that this method can be applied to wide variety of Go projects which needs arithmetic vector operations.

The optimization is clearly explained in the above article. Here I give more detailed introduction and how I fixed problems that I bumped into. After reading this article, you will know the followings:

  1. What is SIMD?
  2. How to use SSE from C programs by using intrinsics
  3. What is AVX2 and why faster than SSE
  4. Learn about Auto-Vectorization of LLVM
  5. Converting x64 assembly into Go Plan9 Assembly for reducing overheads

By applying the method I will explain, your Go programs might be 10 times faster than the function written in pure Go. For example, the benchmark scores of sum calculation of all the float64 values in array is below.

The source code is available at https://github.com/c-bata/sample-c2goasm

What is SIMD?

As the name implies, SIMD (Single Instruction Multiple Data) means a method to process multiple data in single instruction. When compared to MIMD (multiple instruction, multiple data) which requires a mechanism to supply different instructions to each processor core, SIMD can design processors with a smaller area because it requires less transistors. So most of CPUs and GPUs support SIMD operations (Sorry, I don’t mention SIMT here). MMX, SSE and AVX are SIMD operation instructions on Intel CPU.

Matrix addition using SSE instructions

Because inline assembly is not available in x64, we can use intrinsics that is macros to expand assembly. It is also better to use intrinsics from the viewpoint of portability.

Let’s use SSE, which is one of SIMD instructions of Intel CPU. The header file xmmintrin.h contains macros for expanding into an assembly of SSE, so include that and execute matrix addition as follows.

A 128-bits register is available for SSE instructions. A float type consumes 32 bits, so that 4 elements can be calculated at the same time. The output of this is:

Matrix addition using AVX (AVX2)

In SSE, the SIMD register was 128 bits, so if you want to calculate float data, only four elements could be calculated at one time. The AVX instruction has been embed 256 bits registers which has significantly improved the arithmetic performance. Furthermore, AVX2 added later supports integer arithmetic as well as floating point.

Basically, it is desirable to use this in the environment where AVX2 is supported. Include immintrin.h when using macros for expanding into AVX and AVX2 assembly .

In AVX, eight elements can be calculated at one time because 256 bits divided by 32 bits (float consumes 4 bytes). Use __m128d if you want to use doubles instead of floats and __m128i if you want to use integers. Note that double precision floating point uses 8 bytes and can handle only 4 elements. Let’s compile the code:

Compilation is failed. According to this article , the functions supported by the CPU can be confirmed by the following command.

Certainly SSE, SSE2 , SSE4.1, SSE4.2, AVX1.0 exist, but AVX2 is not here. However, as CPU may support AVX2 and the instruction which appeared with the following command seems to be able to use it by giving a special compiler option.

It turns out that AVX2 can be used by giving some compiler options. According to this article gcc can use the -mavx2 option.

By checking the assembly file, we confirm that AVX2 instructions are used.

AVX-512
According to this article and wikipedia, AVX-512 instructions can use 512 bits register. It looks great performance we can obtain, but the most of projects I’ve ever seen (ex: simdjson) use AVX2. And AVX-512 doesn’t appear in the result of $ sysctl -a | grep machdep.cpu.leaf7_features. So I couldn’t use this, but it may be better to try if you want more performance improvements.

Auto vectorization of LLVM

One of powerful features of LLVM optimization is Auto-Vectorization. For example, following C function will be optimized by using SIMD instructions of CPU.

Compile this by clang compiler (See https://llvm.org/docs/Vectorizers.html for more details):

After executing this, check the assembly file:

There are instructions which uses xmm0 , So we can understand this program is optimized SSE instructions. We can get optimized code thanks to LLVM. In the next step, let’s call this assembly from Go.

Calling optimized x64 assembly with slight overheads

cgo is a de facto standard tool to call C functions from Go. But it is not great solution in the viewpoint of performance (see Why cgo is slow @ CapitalGo 2018 — Speaker Deck). Converting x64 assembly to Go Plan9 Assembly by using c2goasm is good solution to call C functions with slight overheads. The article written by minio (creator of c2goasm) is below:

Before executing c2goasm, we need to define a Go function:

Please caution that the function name should add a _ prefix to subroutine name of assembly file. In this case, the subroutine name is _sum_float64, so I named the Go function __sum_float64. After defined an our Go function, it’s time to use c2goasm.

The name of Go plan9 assembly file should be sum_float64.s if your Go program is named sum_float64.go.

OK! Let’s compare the performance with following simple function written in pure Go:

The code of benchmark is here. Let’s run this benchmark code!

OMG! Our program is 3 times slower than the pure Go function. I expect the reason why slow is that SSE instructions couldn’t make faster this program. We can store only 2 elements of float64 array in xmm0, because the register for SSE is 128 bits. If we use _mm_add_pd, the calculation is not parallelized.

But AVX2 instructions can use 256 bits register, so we can expect the performance improvements. I checked the compiler options:

  • Add -ffast-math option because it recommended by the -Rpass-analysis=loop-vectorize option.
  • Modify assembly file by hands to pass go build command. Sometimes generated Go Plan9 Assembly files are invalid.
  • Specify pragma hints for clang.
  • Change the clang version(Build failed if using clang installed macOS by default) and the optimization option(ex: O2 or O3)

As the result, It works good in following compiler version and options. Another pattern of clang version and compiler options will break because SegmentationFault error or something another reasons. For example if I just replace O2 option with O3, the assembly will be broken.

or using Clang3.8.0 and O3 option with pragma hints. The final benchmark is below:

The performance will be surprisingly improved. It’s 10 times faster than the function written in Pure Go.

Conclusion

Huge performance improvements can be expected by this optimization method. But I bumped into many unexpected behaviors while optimizing. For now, I expected one of the causes of this problem might be using red-zone on x64. I need to investigate the reason why some compiler options will make the program broken. If you know, please tell me!

Anyway, I want to use SIMD optimization in Go, more and more.

Creator of go-prompt and kube-prompt. Optuna committer. Kubeflow/Katib reviewer. GitHub: c-bata