c - How can a single sqrt() runs twice slower than when it was put in a for loop -
i'm doing experiment of profiling time takes compute single sqrt in c code. have 2 strategies.
one direct measurement of single sqrt call , other execute sqrt multiple times in loop , compute average. c code simple , shown follows:
long long readtsc(void); int main(int argc, char** argv) { int n = atoi(argv[1]); //v input of sqrt() making sure compiler won't //precompute result of sqrt(v) if v constant double v = atof(argv[2]);. long long tm; //track cpu clock cycles double x; //result of sqrt() //-- strategy --- tm = readtsc(); //a function uses rdtsc instruction number of clock cycles intel cpu x = sqrt(v); tm = readtsc() - tm; printf("x=%15.6\n",x); //make sure compiler won't optimize out above sqrt() printf("%lld clocks\n",tm); double sum = 0.0; int i; //-- strategy ii -- tm = readtsc(); ( = 0; < n; i++ ) sum += sqrt((double) i); tm = readtsc() - tm; printf("%lld clocks\n",tm); printf("%15.6e\n",sum); return 0; } long long readtsc(void) { /* read time stamp counter on intel x86 chips */ union { long long complete; unsigned int part[2]; } ticks; __asm__ ("rdtsc; mov %%eax,%0;mov %%edx,%1" : "=mr" (ticks.part[0]), "=mr" (ticks.part[1]) : /* no inputs */ : "eax", "edx"); return ticks.complete; } before running code, expected timing result of strategy might smaller 1 strategy ii, because strategy ii counts overhead incurred loop , sum addition.
i use following command without o3 optimization compile code on intel xeon e5-2680 2.7ghz machine.
gcc -o timing -lm timing.c however, result shows strategy takes 40 clock cycles while strategy ii takes average of 21.8 clock cycles, half of former one.
for reference, pasted related assembly code below comments. occurs me, based on timing result, each iteration executes 2 sqrt()'s. can hardly tell assembly code how cpu can execute 2 sqrt()'s call in parallel?
call atof cvtsi2ss %eax, %xmm0 movss %xmm0, -36(%rbp) //-- timing single sqrt --- call readtsc movq %rax, -32(%rbp) movss -36(%rbp), %xmm1 cvtps2pd %xmm1, %xmm1 //--- sqrtsd instruction sqrtsd %xmm1, %xmm0 ucomisd %xmm0, %xmm0 jp .l8 je .l4 .l8: movapd %xmm1, %xmm0 //--- c function call sqrt() call sqrt .l4: movsd %xmm0, -72(%rbp) movq -72(%rbp), %rax movq %rax, -24(%rbp) call readtsc //-- end of timing single sqrt --- subq -32(%rbp), %rax movq %rax, -32(%rbp) movl $.lc0, %eax movsd -24(%rbp), %xmm0 movq %rax, %rdi movl $1, %eax call printf movl $.lc1, %eax movq -32(%rbp), %rdx movq %rdx, %rsi movq %rax, %rdi movl $0, %eax call printf movl $0, %eax movq %rax, -16(%rbp) call readtsc //-- start of loop---- movq %rax, -32(%rbp) movl $0, -4(%rbp) jmp .l5 .l6: //(double) cvtsi2sd -4(%rbp), %xmm0 //-- c function call sqrt() call sqrt movsd -16(%rbp), %xmm1 //add sqrt(i) sum (%xmm0) addsd %xmm1, %xmm0 movsd %xmm0, -16(%rbp) //i++ addl $1, -4(%rbp) .l5: movl -4(%rbp), %eax //check i<n cmpl -40(%rbp), %eax jl .l6 //-- end of loop-- //you can skip rest of part. call readtsc subq -32(%rbp), %rax movq %rax, -32(%rbp) movl $.lc1, %eax movq -32(%rbp), %rdx movq %rdx, %rsi movq %rax, %rdi movl $0, %eax call printf movl $.lc3, %eax movsd -16(%rbp), %xmm0 movq %rax, %rdi movl $1, %eax call printf
it looks me strategy i, uses sqrtsd instruction. because ucomisd instruction not set parity flag , code jumps directly l4.
the loop, strategy ii uses call sqrt compute square root. may optimized version of sqrt, achieved through approximation, faster call sqrtsd. compilers might optimization.
even if call sqrt uses in sqrtsd instruction there no reason should run faster outside of loop.
please take note measuring latency of single instruction executed once not deterministic. rdtsc instruction has latency of it's own , because nowadays cpus superscalar , out of order cannot know rdtsc sqrtsd rdtsc executed in program order. sure not execute on same port therefore sqrtsd not guaranteed complete @ time second rdtsc completes.
another important thing consider when execute sqrt in loop, you decrease average latency of instructions. that's because you can have multiple instructions executing in parallel in different stages of pipeline. agner fog's instruction tables shows ivy bridge's sqrtsd instruction can have throughput ranging 1/8 1/14 instructions per cycle. loop increases average throughput while single isolated instruction have highest latency.
so because of pipelined parallel execution, instructions in loop run faster on average, when running isolated.
Comments
Post a Comment