Optimizing Go programs by AVX2 using Auto-Vectorization in LLVM.

8 min readMar 25, 2019


I found an interesting article written about optimizations for Arrow Go. It looks that this method can be applied to wide variety of Go projects which needs arithmetic vector operations.

The optimization is clearly explained in the above article. Here I give more detailed introduction and how I fixed problems that I bumped into. After reading this article, you will know the followings:

  1. What is SIMD?
  2. How to use SSE from C programs by using intrinsics
  3. What is AVX2 and why faster than SSE
  4. Learn about Auto-Vectorization of LLVM
  5. Converting x64 assembly into Go Plan9 Assembly for reducing overheads

By applying the method I will explain, your Go programs might be 10 times faster than the function written in pure Go. For example, the benchmark scores of sum calculation of all the float64 values in array is below.

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4 5000000 282 ns/op
BenchmarkSumFloat64_1024-4 1000000 1234 ns/op
BenchmarkSumFloat64_8192-4 200000 10021 ns/op
BenchmarkSumFloat64_AVX2_256-4 50000000 23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4 20000000 95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4 2000000 904 ns/op
ok github.com/c-bata/sample-c2goasm 10.911s

The source code is available at https://github.com/c-bata/sample-c2goasm

What is SIMD?

Photo by Slejven Djurakovic on Unsplash

As the name implies, SIMD (Single Instruction Multiple Data) means a method to process multiple data in single instruction. When compared to MIMD (multiple instruction, multiple data) which requires a mechanism to supply different instructions to each processor core, SIMD can design processors with a smaller area because it requires less transistors. So most of CPUs and GPUs support SIMD operations (Sorry, I don’t mention SIMT here). MMX, SSE and AVX are SIMD operation instructions on Intel CPU.

Matrix addition using SSE instructions

Because inline assembly is not available in x64, we can use intrinsics that is macros to expand assembly. It is also better to use intrinsics from the viewpoint of portability.

Let’s use SSE, which is one of SIMD instructions of Intel CPU. The header file xmmintrin.h contains macros for expanding into an assembly of SSE, so include that and execute matrix addition as follows.

#include <stdio.h>
#include <xmmintrin.h>

int main(void) {
__m128 a = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 b = {1.1f, 2.1f, 3.1f, 4.1f};
float c[4];
__m128 ps = _mm_add_ps(a, b); // add
_mm_storeu_ps(c, ps); // store

printf(" source: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
a[0], a[1], a[2], a[3]);
printf(" dest. : (%5.1f, %5.1f, %5.1f, %5.1f)\n",
b[0], b[1], b[2], b[3]);
printf(" result: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
c[0], c[1], c[2], c[3]);
return 0;

A 128-bits register is available for SSE instructions. A float type consumes 32 bits, so that 4 elements can be calculated at the same time. The output of this is:

$ gcc -o avx -Wall -O0 main.c
$ ./avx-
source: ( 1.0, 2.0, 3.0, 4.0)
dest. : ( 1.1, 2.1, 3.1, 4.1)
result: ( 2.1, 4.1, 6.1, 8.1)

Matrix addition using AVX (AVX2)

In SSE, the SIMD register was 128 bits, so if you want to calculate float data, only four elements could be calculated at one time. The AVX instruction has been embed 256 bits registers which has significantly improved the arithmetic performance. Furthermore, AVX2 added later supports integer arithmetic as well as floating point.

Basically, it is desirable to use this in the environment where AVX2 is supported. Include immintrin.h when using macros for expanding into AVX and AVX2 assembly .

#include <stdio.h>
#include <immintrin.h>

int main(void) {
__m256 a = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
__m256 b = {1.1f, 2.1f, 3.1f, 4.1f, 5.1f, 6.1f, 7.1f, 8.1f};
__m256 c;

c = _mm256_add_ps(a, b);

printf(" source: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]);
printf(" dest. : (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]);
printf(" result: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]);
return 0;

In AVX, eight elements can be calculated at one time because 256 bits divided by 32 bits (float consumes 4 bytes). Use __m128d if you want to use doubles instead of floats and __m128i if you want to use integers. Note that double precision floating point uses 8 bytes and can handle only 4 elements. Let’s compile the code:

$ gcc -o avx2 -Wall main.c
main.c:9:9: error: always_inline function '_mm256_add_ps' requires target feature 'xsave', but would be inlined into function 'main' that is compiled without support for 'xsave'
c = _mm256_add_ps(a, b);
1 error generated.

Compilation is failed. According to this article , the functions supported by the CPU can be confirmed by the following command.

$ sysctl -a | grep machdep.cpu.features

Certainly SSE, SSE2 , SSE4.1, SSE4.2, AVX1.0 exist, but AVX2 is not here. However, as CPU may support AVX2 and the instruction which appeared with the following command seems to be able to use it by giving a special compiler option.

$ sysctl -a | grep machdep.cpu.leaf7_features

It turns out that AVX2 can be used by giving some compiler options. According to this article gcc can use the -mavx2 option.

$ gcc -o avx2 -mavx2 -Wall main.c 
$ ./avx2
source: ( 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)
dest. : ( 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1)
result: ( 2.1, 4.1, 6.1, 8.1, 10.1, 12.1, 14.1, 16.1)

By checking the assembly file, we confirm that AVX2 instructions are used.

$ gcc -S -mavx2 -Wall -O0 avx2.c
$ less avx2.s
## %bb.0:
vmovaps LCPI0_0(%rip), %ymm0 ## ymm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
vmovaps %ymm0, 160(%rsp)

According to this article and wikipedia, AVX-512 instructions can use 512 bits register. It looks great performance we can obtain, but the most of projects I’ve ever seen (ex: simdjson) use AVX2. And AVX-512 doesn’t appear in the result of $ sysctl -a | grep machdep.cpu.leaf7_features. So I couldn’t use this, but it may be better to try if you want more performance improvements.

Auto vectorization of LLVM

One of powerful features of LLVM optimization is Auto-Vectorization. For example, following C function will be optimized by using SIMD instructions of CPU.

void sum_float64(double buf[], int len, double *res) {
double acc = 0.0;
for(int i = 0; i < len; i++) {
acc += buf[i];
*res = acc;

Compile this by clang compiler (See https://llvm.org/docs/Vectorizers.html for more details):

$ clang -S -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c

After executing this, check the assembly file:

.section    __TEXT,__text,regular,pure_instructions
.build_version macos, 10, 14
.intel_syntax noprefix
.globl _sum_float64 ## -- Begin function sum_float64
.p2align 4, 0x90
_sum_float64: ## @sum_float64
## %bb.0:
vxorps xmm0, xmm0, xmm0
mov qword ptr [rsp + 40], rdi
mov dword ptr [rsp + 36], esi
mov qword ptr [rsp + 24], rdx
vmovsd qword ptr [rsp + 16], xmm0

There are instructions which uses xmm0 , So we can understand this program is optimized SSE instructions. We can get optimized code thanks to LLVM. In the next step, let’s call this assembly from Go.

Calling optimized x64 assembly with slight overheads

cgo is a de facto standard tool to call C functions from Go. But it is not great solution in the viewpoint of performance (see Why cgo is slow @ CapitalGo 2018 — Speaker Deck). Converting x64 assembly to Go Plan9 Assembly by using c2goasm is good solution to call C functions with slight overheads. The article written by minio (creator of c2goasm) is below:

Before executing c2goasm, we need to define a Go function:

package c2goasm_sample

import "unsafe"

func __sum_float64(buf, len, res unsafe.Pointer)

func SumFloat64Avx2(a []float64) float64 {
var (
p1 = unsafe.Pointer(&a[0])
p2 = unsafe.Pointer(uintptr(len(a)))
res float64
__sum_float64(p1, p2, unsafe.Pointer(&res))
return res

Please caution that the function name should add a _ prefix to subroutine name of assembly file. In this case, the subroutine name is _sum_float64, so I named the Go function __sum_float64. After defined an our Go function, it’s time to use c2goasm.

$ go get -u github.com/minio/asm2plan9s
$ go get -u github.com/minio/c2goasm
$ go get -u github.com/klauspost/asmfmt/cmd/asmfmt
$ c2goasm -a -f _lib/sum_float64.s sum_float64.s

The name of Go plan9 assembly file should be sum_float64.s if your Go program is named sum_float64.go.

OK! Let’s compare the performance with following simple function written in pure Go:

package c2goasm_sample

func SumFloat64(a []float64) float64 {
var sum float64
for i := range a {
sum += a[i]
return sum

The code of benchmark is here. Let’s run this benchmark code!

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sandbox-go/c2goasm
BenchmarkSumFloat64_256-4 5000000 277 ns/op
BenchmarkSumFloat64_1024-4 1000000 1205 ns/op
BenchmarkSumFloat64_8192-4 100000 10401 ns/op
BenchmarkSumFloat64_AVX2_256-4 2000000 768 ns/op
BenchmarkSumFloat64_AVX2_1024-4 500000 2872 ns/op
BenchmarkSumFloat64_AVX2_8192-4 100000 23946 ns/op
ok github.com/c-bata/sandbox-go/c2goasm 10.474s

OMG! Our program is 3 times slower than the pure Go function. I expect the reason why slow is that SSE instructions couldn’t make faster this program. We can store only 2 elements of float64 array in xmm0, because the register for SSE is 128 bits. If we use _mm_add_pd, the calculation is not parallelized.

But AVX2 instructions can use 256 bits register, so we can expect the performance improvements. I checked the compiler options:

  • Add -ffast-math option because it recommended by the -Rpass-analysis=loop-vectorize option.
  • Modify assembly file by hands to pass go build command. Sometimes generated Go Plan9 Assembly files are invalid.
  • Specify pragma hints for clang.
  • Change the clang version(Build failed if using clang installed macOS by default) and the optimization option(ex: O2 or O3)

As the result, It works good in following compiler version and options. Another pattern of clang version and compiler options will break because SegmentationFault error or something another reasons. For example if I just replace O2 option with O3, the assembly will be broken.

$ clang7.0.1 -S -O2 -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c

or using Clang3.8.0 and O3 option with pragma hints. The final benchmark is below:

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4 5000000 282 ns/op
BenchmarkSumFloat64_1024-4 1000000 1234 ns/op
BenchmarkSumFloat64_8192-4 200000 10021 ns/op
BenchmarkSumFloat64_AVX2_256-4 50000000 23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4 20000000 95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4 2000000 904 ns/op
ok github.com/c-bata/sample-c2goasm 10.911s

The performance will be surprisingly improved. It’s 10 times faster than the function written in Pure Go.


Huge performance improvements can be expected by this optimization method. But I bumped into many unexpected behaviors while optimizing. For now, I expected one of the causes of this problem might be using red-zone on x64. I need to investigate the reason why some compiler options will make the program broken. If you know, please tell me!

Anyway, I want to use SIMD optimization in Go, more and more.




Creator of go-prompt and kube-prompt. Optuna core-dev. GitHub: c-bata