Vectorized function application

profile picture

Auto-Vectorized function application

For my many-body simulation I was looking for a way to implement user defined force calculations, which doesn’t break the auto vectorization of the compiler. The user should be able to pass a lambda function that takes in two bodies and returns the pairwise interaction force between them, e.g. due to gravity or electromagnetic repulsion.

Roughly speaking, the code has this structure:

for(int i=0; i<particles.size(); i++){
  double acceleration = 0.0;
  for(int j=0; j<particles.size(); j++){
    acceleration += pairwise_force(particles[i], particles[j]);
  }
}

Turns out that passing the pairwise_force using a std::function<double(Particle, Particle)> is a really bad idea, because it turns this into an entirely scalar loop, even if the same code was vectorizing perfectly fine when inlined. Since this was the hot-loop in my simulation, the resulting performance drop of 8-16x is completely unacceptable.

Quick Bench Link

Quick Bench results of different ways to pass the function

Turns out there is a really nice way to fix this issue! By making the argument pairwise_force of a template type instead of std::function<double(Particle, Particle)>, we force the compiler to inline the function. The result is a more readable, flexible implementation, where the user can pass their own force implementation, but at the performance level of a hardcoded routine.

Compiler Explorer

Looking at the compiled cssembly in Compiler Explorer reveals that the fast implementations were all successfully vectorized and use the YMM registers

.L21:
    vmulpd  ymm0, ymm1, YMMWORD PTR [rbx+rax]
    vmovupd YMMWORD PTR [rdx+rax], ymm0

https://godbolt.org/z/zexbh1vTs

while the std::function implementation is stuck with XMM registers, that only hold one QWORD (64bit double)

std::_Function_handler<double (double, double), main::{lambda(double, double)#1}>::_M_invoke(std::_Any_data const&, double&&, double&&):
        vmovsd  xmm0, QWORD PTR [rsi]
        vmulsd  xmm0, xmm0, QWORD PTR [rdx]

https://godbolt.org/z/afGo1o1f1

References