Why doesn't clang use memory-destination x86 instructions when I compile with optimization disabled? Are they...












18















I wrote this simple assembly code, ran it and looked at the memory location using GDB:



    .text

.global _main

_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret


It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.



Now writing the same thing in C and compiling it to assembly turns out like this:



...  # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....


It's moving them to a register before adding them together.



So why don't we add directly in memory?



Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?



Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.



#include <iostream>

int main(){
int a = 5;
a+=6;
return 0;
}









share|improve this question




















  • 2





    Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU

    – Ctx
    Jan 27 at 18:08








  • 10





    The code from C appears to be unoptimised so it has extra loads and stores. Compile with -O3 and see what happens.

    – Michael Petch
    Jan 27 at 18:13






  • 7





    @Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)

    – Ctx
    Jan 27 at 18:26








  • 3





    I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.

    – Michael Petch
    Jan 27 at 18:48








  • 5





    It is not realistic to complain about a compiler's code generation when you disable optimization.

    – Raymond Chen
    Jan 27 at 19:34
















18















I wrote this simple assembly code, ran it and looked at the memory location using GDB:



    .text

.global _main

_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret


It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.



Now writing the same thing in C and compiling it to assembly turns out like this:



...  # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....


It's moving them to a register before adding them together.



So why don't we add directly in memory?



Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?



Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.



#include <iostream>

int main(){
int a = 5;
a+=6;
return 0;
}









share|improve this question




















  • 2





    Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU

    – Ctx
    Jan 27 at 18:08








  • 10





    The code from C appears to be unoptimised so it has extra loads and stores. Compile with -O3 and see what happens.

    – Michael Petch
    Jan 27 at 18:13






  • 7





    @Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)

    – Ctx
    Jan 27 at 18:26








  • 3





    I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.

    – Michael Petch
    Jan 27 at 18:48








  • 5





    It is not realistic to complain about a compiler's code generation when you disable optimization.

    – Raymond Chen
    Jan 27 at 19:34














18












18








18


2






I wrote this simple assembly code, ran it and looked at the memory location using GDB:



    .text

.global _main

_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret


It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.



Now writing the same thing in C and compiling it to assembly turns out like this:



...  # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....


It's moving them to a register before adding them together.



So why don't we add directly in memory?



Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?



Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.



#include <iostream>

int main(){
int a = 5;
a+=6;
return 0;
}









share|improve this question
















I wrote this simple assembly code, ran it and looked at the memory location using GDB:



    .text

.global _main

_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret


It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.



Now writing the same thing in C and compiling it to assembly turns out like this:



...  # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....


It's moving them to a register before adding them together.



So why don't we add directly in memory?



Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?



Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.



#include <iostream>

int main(){
int a = 5;
a+=6;
return 0;
}






c gcc assembly clang compiler-optimization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 28 at 7:51









Peter Cordes

125k18189320




125k18189320










asked Jan 27 at 18:07









SamSam

973




973








  • 2





    Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU

    – Ctx
    Jan 27 at 18:08








  • 10





    The code from C appears to be unoptimised so it has extra loads and stores. Compile with -O3 and see what happens.

    – Michael Petch
    Jan 27 at 18:13






  • 7





    @Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)

    – Ctx
    Jan 27 at 18:26








  • 3





    I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.

    – Michael Petch
    Jan 27 at 18:48








  • 5





    It is not realistic to complain about a compiler's code generation when you disable optimization.

    – Raymond Chen
    Jan 27 at 19:34














  • 2





    Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU

    – Ctx
    Jan 27 at 18:08








  • 10





    The code from C appears to be unoptimised so it has extra loads and stores. Compile with -O3 and see what happens.

    – Michael Petch
    Jan 27 at 18:13






  • 7





    @Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)

    – Ctx
    Jan 27 at 18:26








  • 3





    I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.

    – Michael Petch
    Jan 27 at 18:48








  • 5





    It is not realistic to complain about a compiler's code generation when you disable optimization.

    – Raymond Chen
    Jan 27 at 19:34








2




2





Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU

– Ctx
Jan 27 at 18:08







Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU

– Ctx
Jan 27 at 18:08






10




10





The code from C appears to be unoptimised so it has extra loads and stores. Compile with -O3 and see what happens.

– Michael Petch
Jan 27 at 18:13





The code from C appears to be unoptimised so it has extra loads and stores. Compile with -O3 and see what happens.

– Michael Petch
Jan 27 at 18:13




7




7





@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)

– Ctx
Jan 27 at 18:26







@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)

– Ctx
Jan 27 at 18:26






3




3





I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.

– Michael Petch
Jan 27 at 18:48







I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.

– Michael Petch
Jan 27 at 18:48






5




5





It is not realistic to complain about a compiler's code generation when you disable optimization.

– Raymond Chen
Jan 27 at 19:34





It is not realistic to complain about a compiler's code generation when you disable optimization.

– Raymond Chen
Jan 27 at 19:34












1 Answer
1






active

oldest

votes


















33














You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.



Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.



(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)



Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.





You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.



A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.



There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)



But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.



Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.



We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.



Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.






is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?




In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)



Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)



x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.



You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.



Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.





Avoiding slowdowns requires looking at more than 1 instruction at once



Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.



e.g. this code will cause a partial-register stall on Intel P6-family CPUs:



mov  ah, 1
add eax, 123


Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs



If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.



But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:




  • Assembly - How to score a CPU instruction by latency and throughput

  • How many CPU cycles are needed for each assembly instruction?

  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?




GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).



But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.



GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)



main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret




How to see clang/LLVM using memory-destination add



I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)



void add_reg_to_mem(int *p, int b) {
*p += b;
}

# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret

void add_imm_to_mem(int *p) {
*p += 3;
}

# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret


The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.



    # gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3


GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.



The clang -O0 output happens to be less horrible for this:



 # clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax




See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.





If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:



The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.



# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention

movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret


You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.



Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).



In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.



GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.



The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.






share|improve this answer





















  • 8





    Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

    – Peter Cordes
    Jan 28 at 2:47











  • I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

    – Stjepan Bakrac
    Jan 28 at 9:39











  • @StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

    – Peter Cordes
    Jan 28 at 9:51













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54391268%2fwhy-doesnt-clang-use-memory-destination-x86-instructions-when-i-compile-with-op%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









33














You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.



Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.



(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)



Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.





You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.



A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.



There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)



But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.



Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.



We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.



Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.






is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?




In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)



Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)



x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.



You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.



Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.





Avoiding slowdowns requires looking at more than 1 instruction at once



Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.



e.g. this code will cause a partial-register stall on Intel P6-family CPUs:



mov  ah, 1
add eax, 123


Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs



If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.



But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:




  • Assembly - How to score a CPU instruction by latency and throughput

  • How many CPU cycles are needed for each assembly instruction?

  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?




GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).



But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.



GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)



main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret




How to see clang/LLVM using memory-destination add



I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)



void add_reg_to_mem(int *p, int b) {
*p += b;
}

# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret

void add_imm_to_mem(int *p) {
*p += 3;
}

# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret


The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.



    # gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3


GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.



The clang -O0 output happens to be less horrible for this:



 # clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax




See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.





If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:



The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.



# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention

movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret


You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.



Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).



In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.



GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.



The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.






share|improve this answer





















  • 8





    Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

    – Peter Cordes
    Jan 28 at 2:47











  • I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

    – Stjepan Bakrac
    Jan 28 at 9:39











  • @StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

    – Peter Cordes
    Jan 28 at 9:51


















33














You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.



Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.



(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)



Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.





You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.



A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.



There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)



But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.



Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.



We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.



Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.






is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?




In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)



Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)



x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.



You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.



Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.





Avoiding slowdowns requires looking at more than 1 instruction at once



Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.



e.g. this code will cause a partial-register stall on Intel P6-family CPUs:



mov  ah, 1
add eax, 123


Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs



If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.



But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:




  • Assembly - How to score a CPU instruction by latency and throughput

  • How many CPU cycles are needed for each assembly instruction?

  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?




GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).



But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.



GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)



main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret




How to see clang/LLVM using memory-destination add



I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)



void add_reg_to_mem(int *p, int b) {
*p += b;
}

# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret

void add_imm_to_mem(int *p) {
*p += 3;
}

# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret


The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.



    # gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3


GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.



The clang -O0 output happens to be less horrible for this:



 # clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax




See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.





If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:



The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.



# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention

movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret


You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.



Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).



In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.



GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.



The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.






share|improve this answer





















  • 8





    Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

    – Peter Cordes
    Jan 28 at 2:47











  • I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

    – Stjepan Bakrac
    Jan 28 at 9:39











  • @StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

    – Peter Cordes
    Jan 28 at 9:51
















33












33








33







You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.



Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.



(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)



Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.





You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.



A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.



There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)



But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.



Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.



We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.



Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.






is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?




In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)



Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)



x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.



You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.



Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.





Avoiding slowdowns requires looking at more than 1 instruction at once



Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.



e.g. this code will cause a partial-register stall on Intel P6-family CPUs:



mov  ah, 1
add eax, 123


Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs



If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.



But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:




  • Assembly - How to score a CPU instruction by latency and throughput

  • How many CPU cycles are needed for each assembly instruction?

  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?




GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).



But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.



GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)



main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret




How to see clang/LLVM using memory-destination add



I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)



void add_reg_to_mem(int *p, int b) {
*p += b;
}

# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret

void add_imm_to_mem(int *p) {
*p += 3;
}

# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret


The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.



    # gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3


GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.



The clang -O0 output happens to be less horrible for this:



 # clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax




See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.





If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:



The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.



# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention

movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret


You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.



Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).



In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.



GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.



The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.






share|improve this answer















You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.



Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.



(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)



Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.





You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.



A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.



There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)



But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.



Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.



We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.



Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.






is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?




In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)



Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)



x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.



You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.



Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.





Avoiding slowdowns requires looking at more than 1 instruction at once



Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.



e.g. this code will cause a partial-register stall on Intel P6-family CPUs:



mov  ah, 1
add eax, 123


Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs



If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.



But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:




  • Assembly - How to score a CPU instruction by latency and throughput

  • How many CPU cycles are needed for each assembly instruction?

  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?




GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).



But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.



GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)



main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret




How to see clang/LLVM using memory-destination add



I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)



void add_reg_to_mem(int *p, int b) {
*p += b;
}

# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret

void add_imm_to_mem(int *p) {
*p += 3;
}

# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret


The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.



    # gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3


GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.



The clang -O0 output happens to be less horrible for this:



 # clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax




See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.





If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:



The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.



# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention

movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret


You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.



Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).



In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.



GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.



The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 27 at 23:32

























answered Jan 27 at 19:35









Peter CordesPeter Cordes

125k18189320




125k18189320








  • 8





    Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

    – Peter Cordes
    Jan 28 at 2:47











  • I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

    – Stjepan Bakrac
    Jan 28 at 9:39











  • @StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

    – Peter Cordes
    Jan 28 at 9:51
















  • 8





    Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

    – Peter Cordes
    Jan 28 at 2:47











  • I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

    – Stjepan Bakrac
    Jan 28 at 9:39











  • @StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

    – Peter Cordes
    Jan 28 at 9:51










8




8





Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

– Peter Cordes
Jan 28 at 2:47





Downvoter: this is long and rambling and takes a long time to get to the point, but I'm pretty sure none of it is actually wrong. Please explain what you think is wrong with this.

– Peter Cordes
Jan 28 at 2:47













I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

– Stjepan Bakrac
Jan 28 at 9:39





I'm not a downvoter, but I'm pretty sure long and rambling and takes a long time to get to the point is the reason for the downvotes. That's not indicative of a good answer.

– Stjepan Bakrac
Jan 28 at 9:39













@StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

– Peter Cordes
Jan 28 at 9:51







@StjepanBakrac: after re-reading the question, it really is asking about what's efficient, and my answer gets to that point right away. It's long and maybe a bit rambling, but looking over it again I don't think I buried the actual point. The part I wrote first was the code example where gcc and clang emit memory-destination ADD with -O3, that's not the only point this answer makes. I'm hopeful that most of it is comprehensible and useful, and presented in a somewhat-sensible order, especially after I tidied up the question after posting that previous comment. Did you find it hard to follow?

– Peter Cordes
Jan 28 at 9:51




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54391268%2fwhy-doesnt-clang-use-memory-destination-x86-instructions-when-i-compile-with-op%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to make a Squid Proxy server?

Is this a new Fibonacci Identity?

19世紀