What is the meaning of the data32 data32 nopw %cs:0x0(%rax,%rax,1) instruction in gcc inline asm? -
while running tests -o2 optimization of gcc compilers, observed following instruction in disassembled code function:
data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
what instruction do?
to more detailed trying understand how compiler optimize useless recursions below o2 optimization:
int foo(void) { return foo(); } int main (void) { return foo(); }
the above code causes stack overflow when compiled without optimization, works o2 optimized code.
i think o2 removed pushing stack of function foo, why data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
needed?
0000000000400480 <foo>: foo(): 400480: eb fe jmp 400480 <foo> 400482: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 400489: 1f 84 00 00 00 00 00 0000000000400490 <main>: main(): 400490: eb fe jmp 400490 <main>
you see operand forwarding optimization of cpu pipeline.
although empty loop, gcc tries optimize :-).
the cpu running has superscalar architecture. means, has pipeline in it, , different phases of executions of consecuting instructions happen parallel. example, if there a
mov eax, ebx ;(#1) mov ecx, edx ;(#2)
then loading & decoding of instruction #2 can happen while #1 executed.
the pipelining has major problems solve in case of branches, if unconditional.
for example, while jmp
decoding, next instruction prefetched pipeline. jmp
changes location of next instruction. in such cases, pipeline needs emptied , refilled, , lot of worthy cpu cycles lost.
looks empty loop run faster if pipeline filled no-op in case, despite won't ever executed. optimization of uncommon feature of x86 pipeline.
earlier dec alphas segfault such things, , empty loops had have lot of no-ops in them. x86 slower. because must compatible intel 8086.
here can read lot handling of branching instructions in pipelines.
Comments
Post a Comment