Optimizing Go programs by AVX2 using Auto-Vectorization in LLVM.

8 min readMar 25, 2019

I found an interesting article written about optimizations for Arrow Go. It looks that this method can be applied to wide variety of Go projects which needs arithmetic vector operations.

InfluxData Working on Go Implementation of Apache Arrow | InfluxData

InfluxData is pleased to announce our contribution to the Apache Arrow project. Essentially, we are contributing the…

www.influxdata.com

The optimization is clearly explained in the above article. Here I give more detailed introduction and how I fixed problems that I bumped into. After reading this article, you will know the followings:

What is SIMD?
How to use SSE from C programs by using intrinsics
What is AVX2 and why faster than SSE
Learn about Auto-Vectorization of LLVM
Converting x64 assembly into Go Plan9 Assembly for reducing overheads

By applying the method I will explain, your Go programs might be 10 times faster than the function written in pure Go. For example, the benchmark scores of sum calculation of all the float64 values in array is below.

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4             5000000    282 ns/op
BenchmarkSumFloat64_1024-4           1000000    1234 ns/op
BenchmarkSumFloat64_8192-4            200000   10021 ns/op
BenchmarkSumFloat64_AVX2_256-4      50000000    23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4     20000000    95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4      2000000     904 ns/op
PASS
ok      github.com/c-bata/sample-c2goasm    10.911s

The source code is available at https://github.com/c-bata/sample-c2goasm

What is SIMD?

As the name implies, SIMD (Single Instruction Multiple Data) means a method to process multiple data in single instruction. When compared to MIMD (multiple instruction, multiple data) which requires a mechanism to supply different instructions to each processor core, SIMD can design processors with a smaller area because it requires less transistors. So most of CPUs and GPUs support SIMD operations (Sorry, I don’t mention SIMT here). MMX, SSE and AVX are SIMD operation instructions on Intel CPU.

Matrix addition using SSE instructions

Because inline assembly is not available in x64, we can use intrinsics that is macros to expand assembly. It is also better to use intrinsics from the viewpoint of portability.

Let’s use SSE, which is one of SIMD instructions of Intel CPU. The header file xmmintrin.h contains macros for expanding into an assembly of SSE, so include that and execute matrix addition as follows.

#include <stdio.h>
#include <xmmintrin.h>

int main(void) {
    __m128 a = {1.0f, 2.0f, 3.0f, 4.0f};
    __m128 b = {1.1f, 2.1f, 3.1f, 4.1f};
    float c[4];
    __m128 ps = _mm_add_ps(a, b); // add
    _mm_storeu_ps(c, ps); // store

    printf("  source: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
            a[0], a[1], a[2], a[3]);
    printf("  dest. : (%5.1f, %5.1f, %5.1f, %5.1f)\n",
           b[0], b[1], b[2], b[3]);
    printf("  result: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
           c[0], c[1], c[2], c[3]);
    return 0;
}

A 128-bits register is available for SSE instructions. A float type consumes 32 bits, so that 4 elements can be calculated at the same time. The output of this is:

$ gcc -o avx -Wall -O0 main.c
$ ./avx-
  source: (  1.0,   2.0,   3.0,   4.0)
  dest. : (  1.1,   2.1,   3.1,   4.1)
  result: (  2.1,   4.1,   6.1,   8.1)

Matrix addition using AVX (AVX2)

In SSE, the SIMD register was 128 bits, so if you want to calculate float data, only four elements could be calculated at one time. The AVX instruction has been embed 256 bits registers which has significantly improved the arithmetic performance. Furthermore, AVX2 added later supports integer arithmetic as well as floating point.

Basically, it is desirable to use this in the environment where AVX2 is supported. Include immintrin.h when using macros for expanding into AVX and AVX2 assembly .

#include <stdio.h>
#include <immintrin.h>

int main(void) {
    __m256 a = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
    __m256 b = {1.1f, 2.1f, 3.1f, 4.1f, 5.1f, 6.1f, 7.1f, 8.1f};
    __m256 c;

    c = _mm256_add_ps(a, b);

    printf("  source: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
            a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]);
    printf("  dest. : (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
            b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]);
    printf("  result: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
           c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]);
    return 0;
}

In AVX, eight elements can be calculated at one time because 256 bits divided by 32 bits (float consumes 4 bytes). Use __m128d if you want to use doubles instead of floats and __m128i if you want to use integers. Note that double precision floating point uses 8 bytes and can handle only 4 elements. Let’s compile the code:

$ gcc -o avx2 -Wall main.c
main.c:9:9: error: always_inline function '_mm256_add_ps' requires target feature 'xsave', but would be inlined into function 'main' that is compiled without support for 'xsave'
    c = _mm256_add_ps(a, b);
        ^
1 error generated.

Compilation is failed. According to this article , the functions supported by the CPU can be confirmed by the following command.

$ sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

Certainly SSE, SSE2 , SSE4.1, SSE4.2, AVX1.0 exist, but AVX2 is not here. However, as CPU may support AVX2 and the instruction which appeared with the following command seems to be able to use it by giving a special compiler option.

$ sysctl -a | grep machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT

It turns out that AVX2 can be used by giving some compiler options. According to this article gcc can use the -mavx2 option.

$ gcc -o avx2 -mavx2 -Wall main.c 
$ ./avx2
  source: (  1.0,   2.0,   3.0,   4.0,   5.0,   6.0,   7.0,   8.0)
  dest. : (  1.1,   2.1,   3.1,   4.1,   5.1,   6.1,   7.1,   8.1)
  result: (  2.1,   4.1,   6.1,   8.1,  10.1,  12.1,  14.1,  16.1)

By checking the assembly file, we confirm that AVX2 instructions are used.

$ gcc -S -mavx2 -Wall -O0 avx2.c
$ less avx2.s
...
## %bb.0:
        ...
        vmovaps LCPI0_0(%rip), %ymm0    ## ymm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
        vmovaps %ymm0, 160(%rsp)

AVX-512
According to this article and wikipedia, AVX-512 instructions can use 512 bits register. It looks great performance we can obtain, but the most of projects I’ve ever seen (ex: simdjson) use AVX2. And AVX-512 doesn’t appear in the result of $ sysctl -a | grep machdep.cpu.leaf7_features. So I couldn’t use this, but it may be better to try if you want more performance improvements.

Auto vectorization of LLVM

One of powerful features of LLVM optimization is Auto-Vectorization. For example, following C function will be optimized by using SIMD instructions of CPU.

void sum_float64(double buf[], int len, double *res) {
    double acc = 0.0;
    for(int i = 0; i < len; i++) {
        acc += buf[i];
    }
    *res = acc;
}

Compile this by clang compiler (See https://llvm.org/docs/Vectorizers.html for more details):

$ clang -S -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c

After executing this, check the assembly file:

.section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 10, 14
    .intel_syntax noprefix
    .globl  _sum_float64            ## -- Begin function sum_float64
    .p2align    4, 0x90
_sum_float64:                           ## @sum_float64
## %bb.0:
    ...
    vxorps  xmm0, xmm0, xmm0
    mov qword ptr [rsp + 40], rdi
    mov dword ptr [rsp + 36], esi
    mov qword ptr [rsp + 24], rdx
    vmovsd  qword ptr [rsp + 16], xmm0
    ...

There are instructions which uses xmm0 , So we can understand this program is optimized SSE instructions. We can get optimized code thanks to LLVM. In the next step, let’s call this assembly from Go.

Calling optimized x64 assembly with slight overheads

cgo is a de facto standard tool to call C functions from Go. But it is not great solution in the viewpoint of performance (see Why cgo is slow @ CapitalGo 2018 — Speaker Deck). Converting x64 assembly to Go Plan9 Assembly by using c2goasm is good solution to call C functions with slight overheads. The article written by minio (creator of c2goasm) is below:

c2goasm: C to Go Assembly

At minio we have been optimizing our Amazon S3 compatible Object Storage minio server using Golang assembly for a…

blog.minio.io

Before executing c2goasm, we need to define a Go function:

package c2goasm_sample

import "unsafe"

//go:noescape
func __sum_float64(buf, len, res unsafe.Pointer)


func SumFloat64Avx2(a []float64) float64 {
    var (
        p1 = unsafe.Pointer(&a[0])
        p2 = unsafe.Pointer(uintptr(len(a)))
        res float64
    )
    __sum_float64(p1, p2, unsafe.Pointer(&res))
    return res
}

Please caution that the function name should add a _ prefix to subroutine name of assembly file. In this case, the subroutine name is _sum_float64, so I named the Go function __sum_float64. After defined an our Go function, it’s time to use c2goasm.

$ go get -u github.com/minio/asm2plan9s
$ go get -u github.com/minio/c2goasm
$ go get -u github.com/klauspost/asmfmt/cmd/asmfmt
$ c2goasm -a -f _lib/sum_float64.s sum_float64.s

The name of Go plan9 assembly file should be sum_float64.s if your Go program is named sum_float64.go.

OK! Let’s compare the performance with following simple function written in pure Go:

package c2goasm_sample

func SumFloat64(a []float64) float64 {
    var sum float64
    for i := range a {
        sum += a[i]
    }
    return sum
}

The code of benchmark is here. Let’s run this benchmark code!

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sandbox-go/c2goasm
BenchmarkSumFloat64_256-4              5000000         277 ns/op
BenchmarkSumFloat64_1024-4             1000000        1205 ns/op
BenchmarkSumFloat64_8192-4              100000       10401 ns/op
BenchmarkSumFloat64_AVX2_256-4         2000000         768 ns/op
BenchmarkSumFloat64_AVX2_1024-4         500000        2872 ns/op
BenchmarkSumFloat64_AVX2_8192-4         100000       23946 ns/op
PASS
ok      github.com/c-bata/sandbox-go/c2goasm    10.474s

OMG! Our program is 3 times slower than the pure Go function. I expect the reason why slow is that SSE instructions couldn’t make faster this program. We can store only 2 elements of float64 array in xmm0, because the register for SSE is 128 bits. If we use _mm_add_pd, the calculation is not parallelized.

But AVX2 instructions can use 256 bits register, so we can expect the performance improvements. I checked the compiler options:

Add -ffast-math option because it recommended by the -Rpass-analysis=loop-vectorize option.
Modify assembly file by hands to pass go build command. Sometimes generated Go Plan9 Assembly files are invalid.
Specify pragma hints for clang.
Change the clang version(Build failed if using clang installed macOS by default) and the optimization option(ex: O2 or O3)

As the result, It works good in following compiler version and options. Another pattern of clang version and compiler options will break because SegmentationFault error or something another reasons. For example if I just replace O2 option with O3, the assembly will be broken.

$ clang7.0.1 -S -O2 -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c

or using Clang3.8.0 and O3 option with pragma hints. The final benchmark is below:

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4              5000000           282 ns/op
BenchmarkSumFloat64_1024-4             1000000          1234 ns/op
BenchmarkSumFloat64_8192-4              200000         10021 ns/op
BenchmarkSumFloat64_AVX2_256-4        50000000            23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4       20000000            95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4        2000000           904 ns/op
PASS
ok      github.com/c-bata/sample-c2goasm    10.911s

The performance will be surprisingly improved. It’s 10 times faster than the function written in Pure Go.

Conclusion

Huge performance improvements can be expected by this optimization method. But I bumped into many unexpected behaviors while optimizing. For now, I expected one of the causes of this problem might be using red-zone on x64. I need to investigate the reason why some compiler options will make the program broken. If you know, please tell me!

Anyway, I want to use SIMD optimization in Go, more and more.