Optimizing Go programs by AVX2 using Auto-Vectorization in LLVM.

  1. What is SIMD?
  2. How to use SSE from C programs by using intrinsics
  3. What is AVX2 and why faster than SSE
  4. Learn about Auto-Vectorization of LLVM
  5. Converting x64 assembly into Go Plan9 Assembly for reducing overheads
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4 5000000 282 ns/op
BenchmarkSumFloat64_1024-4 1000000 1234 ns/op
BenchmarkSumFloat64_8192-4 200000 10021 ns/op
BenchmarkSumFloat64_AVX2_256-4 50000000 23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4 20000000 95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4 2000000 904 ns/op
PASS
ok github.com/c-bata/sample-c2goasm 10.911s

What is SIMD?

Photo by Slejven Djurakovic on Unsplash

Matrix addition using SSE instructions

Because inline assembly is not available in x64, we can use intrinsics that is macros to expand assembly. It is also better to use intrinsics from the viewpoint of portability.

#include <stdio.h>
#include <xmmintrin.h>

int main(void) {
__m128 a = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 b = {1.1f, 2.1f, 3.1f, 4.1f};
float c[4];
__m128 ps = _mm_add_ps(a, b); // add
_mm_storeu_ps(c, ps); // store

printf(" source: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
a[0], a[1], a[2], a[3]);
printf(" dest. : (%5.1f, %5.1f, %5.1f, %5.1f)\n",
b[0], b[1], b[2], b[3]);
printf(" result: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
c[0], c[1], c[2], c[3]);
return 0;
}
$ gcc -o avx -Wall -O0 main.c
$ ./avx-
source: ( 1.0, 2.0, 3.0, 4.0)
dest. : ( 1.1, 2.1, 3.1, 4.1)
result: ( 2.1, 4.1, 6.1, 8.1)

Matrix addition using AVX (AVX2)

In SSE, the SIMD register was 128 bits, so if you want to calculate float data, only four elements could be calculated at one time. The AVX instruction has been embed 256 bits registers which has significantly improved the arithmetic performance. Furthermore, AVX2 added later supports integer arithmetic as well as floating point.

#include <stdio.h>
#include <immintrin.h>

int main(void) {
__m256 a = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
__m256 b = {1.1f, 2.1f, 3.1f, 4.1f, 5.1f, 6.1f, 7.1f, 8.1f};
__m256 c;

c = _mm256_add_ps(a, b);

printf(" source: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]);
printf(" dest. : (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]);
printf(" result: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]);
return 0;
}
$ gcc -o avx2 -Wall main.c
main.c:9:9: error: always_inline function '_mm256_add_ps' requires target feature 'xsave', but would be inlined into function 'main' that is compiled without support for 'xsave'
c = _mm256_add_ps(a, b);
^
1 error generated.
$ sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
$ sysctl -a | grep machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
$ gcc -o avx2 -mavx2 -Wall main.c 
$ ./avx2
source: ( 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)
dest. : ( 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1)
result: ( 2.1, 4.1, 6.1, 8.1, 10.1, 12.1, 14.1, 16.1)
$ gcc -S -mavx2 -Wall -O0 avx2.c
$ less avx2.s
...
## %bb.0:
...
vmovaps LCPI0_0(%rip), %ymm0 ## ymm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
vmovaps %ymm0, 160(%rsp)

Auto vectorization of LLVM

One of powerful features of LLVM optimization is Auto-Vectorization. For example, following C function will be optimized by using SIMD instructions of CPU.

void sum_float64(double buf[], int len, double *res) {
double acc = 0.0;
for(int i = 0; i < len; i++) {
acc += buf[i];
}
*res = acc;
}
$ clang -S -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c
.section    __TEXT,__text,regular,pure_instructions
.build_version macos, 10, 14
.intel_syntax noprefix
.globl _sum_float64 ## -- Begin function sum_float64
.p2align 4, 0x90
_sum_float64: ## @sum_float64
## %bb.0:
...
vxorps xmm0, xmm0, xmm0
mov qword ptr [rsp + 40], rdi
mov dword ptr [rsp + 36], esi
mov qword ptr [rsp + 24], rdx
vmovsd qword ptr [rsp + 16], xmm0
...

Calling optimized x64 assembly with slight overheads

cgo is a de facto standard tool to call C functions from Go. But it is not great solution in the viewpoint of performance (see Why cgo is slow @ CapitalGo 2018 — Speaker Deck). Converting x64 assembly to Go Plan9 Assembly by using c2goasm is good solution to call C functions with slight overheads. The article written by minio (creator of c2goasm) is below:

package c2goasm_sample

import "unsafe"

//go:noescape
func __sum_float64(buf, len, res unsafe.Pointer)


func SumFloat64Avx2(a []float64) float64 {
var (
p1 = unsafe.Pointer(&a[0])
p2 = unsafe.Pointer(uintptr(len(a)))
res float64
)
__sum_float64(p1, p2, unsafe.Pointer(&res))
return res
}
$ go get -u github.com/minio/asm2plan9s
$ go get -u github.com/minio/c2goasm
$ go get -u github.com/klauspost/asmfmt/cmd/asmfmt
$ c2goasm -a -f _lib/sum_float64.s sum_float64.s
package c2goasm_sample

func SumFloat64(a []float64) float64 {
var sum float64
for i := range a {
sum += a[i]
}
return sum
}
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sandbox-go/c2goasm
BenchmarkSumFloat64_256-4 5000000 277 ns/op
BenchmarkSumFloat64_1024-4 1000000 1205 ns/op
BenchmarkSumFloat64_8192-4 100000 10401 ns/op
BenchmarkSumFloat64_AVX2_256-4 2000000 768 ns/op
BenchmarkSumFloat64_AVX2_1024-4 500000 2872 ns/op
BenchmarkSumFloat64_AVX2_8192-4 100000 23946 ns/op
PASS
ok github.com/c-bata/sandbox-go/c2goasm 10.474s
  • Add -ffast-math option because it recommended by the -Rpass-analysis=loop-vectorize option.
  • Modify assembly file by hands to pass go build command. Sometimes generated Go Plan9 Assembly files are invalid.
  • Specify pragma hints for clang.
  • Change the clang version(Build failed if using clang installed macOS by default) and the optimization option(ex: O2 or O3)
$ clang7.0.1 -S -O2 -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4 5000000 282 ns/op
BenchmarkSumFloat64_1024-4 1000000 1234 ns/op
BenchmarkSumFloat64_8192-4 200000 10021 ns/op
BenchmarkSumFloat64_AVX2_256-4 50000000 23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4 20000000 95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4 2000000 904 ns/op
PASS
ok github.com/c-bata/sample-c2goasm 10.911s

Conclusion

Huge performance improvements can be expected by this optimization method. But I bumped into many unexpected behaviors while optimizing. For now, I expected one of the causes of this problem might be using red-zone on x64. I need to investigate the reason why some compiler options will make the program broken. If you know, please tell me!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Masashi SHIBATA

Masashi SHIBATA

115 Followers

Creator of go-prompt and kube-prompt. Optuna core-dev. Kubeflow/Katib reviewer. GitHub: c-bata