Optimizing Go programs by AVX2 using Auto-Vectorization in LLVM.

  1. What is SIMD?
  2. How to use SSE from C programs by using intrinsics
  3. What is AVX2 and why faster than SSE
  4. Learn about Auto-Vectorization of LLVM
  5. Converting x64 assembly into Go Plan9 Assembly for reducing overheads
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4 5000000 282 ns/op
BenchmarkSumFloat64_1024-4 1000000 1234 ns/op
BenchmarkSumFloat64_8192-4 200000 10021 ns/op
BenchmarkSumFloat64_AVX2_256-4 50000000 23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4 20000000 95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4 2000000 904 ns/op
PASS
ok github.com/c-bata/sample-c2goasm 10.911s

What is SIMD?

Photo by Slejven Djurakovic on Unsplash

Matrix addition using SSE instructions

#include <stdio.h>
#include <xmmintrin.h>

int main(void) {
__m128 a = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 b = {1.1f, 2.1f, 3.1f, 4.1f};
float c[4];
__m128 ps = _mm_add_ps(a, b); // add
_mm_storeu_ps(c, ps); // store

printf(" source: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
a[0], a[1], a[2], a[3]);
printf(" dest. : (%5.1f, %5.1f, %5.1f, %5.1f)\n",
b[0], b[1], b[2], b[3]);
printf(" result: (%5.1f, %5.1f, %5.1f, %5.1f)\n",
c[0], c[1], c[2], c[3]);
return 0;
}
$ gcc -o avx -Wall -O0 main.c
$ ./avx-
source: ( 1.0, 2.0, 3.0, 4.0)
dest. : ( 1.1, 2.1, 3.1, 4.1)
result: ( 2.1, 4.1, 6.1, 8.1)

Matrix addition using AVX (AVX2)

#include <stdio.h>
#include <immintrin.h>

int main(void) {
__m256 a = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
__m256 b = {1.1f, 2.1f, 3.1f, 4.1f, 5.1f, 6.1f, 7.1f, 8.1f};
__m256 c;

c = _mm256_add_ps(a, b);

printf(" source: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]);
printf(" dest. : (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]);
printf(" result: (%5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f, %5.1f)\n",
c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]);
return 0;
}
$ gcc -o avx2 -Wall main.c
main.c:9:9: error: always_inline function '_mm256_add_ps' requires target feature 'xsave', but would be inlined into function 'main' that is compiled without support for 'xsave'
c = _mm256_add_ps(a, b);
^
1 error generated.
$ sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
$ sysctl -a | grep machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
$ gcc -o avx2 -mavx2 -Wall main.c 
$ ./avx2
source: ( 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)
dest. : ( 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1)
result: ( 2.1, 4.1, 6.1, 8.1, 10.1, 12.1, 14.1, 16.1)
$ gcc -S -mavx2 -Wall -O0 avx2.c
$ less avx2.s
...
## %bb.0:
...
vmovaps LCPI0_0(%rip), %ymm0 ## ymm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
vmovaps %ymm0, 160(%rsp)

Auto vectorization of LLVM

void sum_float64(double buf[], int len, double *res) {
double acc = 0.0;
for(int i = 0; i < len; i++) {
acc += buf[i];
}
*res = acc;
}
$ clang -S -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c
.section    __TEXT,__text,regular,pure_instructions
.build_version macos, 10, 14
.intel_syntax noprefix
.globl _sum_float64 ## -- Begin function sum_float64
.p2align 4, 0x90
_sum_float64: ## @sum_float64
## %bb.0:
...
vxorps xmm0, xmm0, xmm0
mov qword ptr [rsp + 40], rdi
mov dword ptr [rsp + 36], esi
mov qword ptr [rsp + 24], rdx
vmovsd qword ptr [rsp + 16], xmm0
...

Calling optimized x64 assembly with slight overheads

package c2goasm_sample

import "unsafe"

//go:noescape
func __sum_float64(buf, len, res unsafe.Pointer)


func SumFloat64Avx2(a []float64) float64 {
var (
p1 = unsafe.Pointer(&a[0])
p2 = unsafe.Pointer(uintptr(len(a)))
res float64
)
__sum_float64(p1, p2, unsafe.Pointer(&res))
return res
}
$ go get -u github.com/minio/asm2plan9s
$ go get -u github.com/minio/c2goasm
$ go get -u github.com/klauspost/asmfmt/cmd/asmfmt
$ c2goasm -a -f _lib/sum_float64.s sum_float64.s
package c2goasm_sample

func SumFloat64(a []float64) float64 {
var sum float64
for i := range a {
sum += a[i]
}
return sum
}
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sandbox-go/c2goasm
BenchmarkSumFloat64_256-4 5000000 277 ns/op
BenchmarkSumFloat64_1024-4 1000000 1205 ns/op
BenchmarkSumFloat64_8192-4 100000 10401 ns/op
BenchmarkSumFloat64_AVX2_256-4 2000000 768 ns/op
BenchmarkSumFloat64_AVX2_1024-4 500000 2872 ns/op
BenchmarkSumFloat64_AVX2_8192-4 100000 23946 ns/op
PASS
ok github.com/c-bata/sandbox-go/c2goasm 10.474s
  • Add -ffast-math option because it recommended by the -Rpass-analysis=loop-vectorize option.
  • Modify assembly file by hands to pass go build command. Sometimes generated Go Plan9 Assembly files are invalid.
  • Specify pragma hints for clang.
  • Change the clang version(Build failed if using clang installed macOS by default) and the optimization option(ex: O2 or O3)
$ clang7.0.1 -S -O2 -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -c sum_float64.c
$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/c-bata/sample-c2goasm
BenchmarkSumFloat64_256-4 5000000 282 ns/op
BenchmarkSumFloat64_1024-4 1000000 1234 ns/op
BenchmarkSumFloat64_8192-4 200000 10021 ns/op
BenchmarkSumFloat64_AVX2_256-4 50000000 23.5 ns/op
BenchmarkSumFloat64_AVX2_1024-4 20000000 95.9 ns/op
BenchmarkSumFloat64_AVX2_8192-4 2000000 904 ns/op
PASS
ok github.com/c-bata/sample-c2goasm 10.911s

Conclusion

--

--

--

Creator of go-prompt and kube-prompt. Optuna committer. Kubeflow/Katib reviewer. GitHub: c-bata

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

OLED and PWM on ESP32

Dynamically populate a select field’s choices for Custom Post Type s— ACF

The 7 Best Android App Analytics Tools [2018–2019]

AWS Introduction

4+ years of cracking technical interviews

RChain cutting into the database circuit?

My experience as a Data Science was a tough call.As

What is cloud computing and why you should choose Google Cloud?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Masashi SHIBATA

Masashi SHIBATA

Creator of go-prompt and kube-prompt. Optuna committer. Kubeflow/Katib reviewer. GitHub: c-bata

More from Medium

Concurrency in Go ( Beginners )

What are the limits of Go channels?

Go modules in mono-repo

Testing the main of a golang http server