Useless test instruction?












24














I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!










share|improve this question




















  • 2




    This instruction does indeed seem useless.
    – fuz
    yesterday






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    yesterday






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    yesterday












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    yesterday












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    yesterday
















24














I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!










share|improve this question




















  • 2




    This instruction does indeed seem useless.
    – fuz
    yesterday






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    yesterday






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    yesterday












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    yesterday












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    yesterday














24












24








24


5





I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!










share|improve this question















I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!







java assembly jvm jit jvm-hotspot






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 20 hours ago









Henrik Schumacher

1433




1433










asked yesterday









QIvan

1586




1586








  • 2




    This instruction does indeed seem useless.
    – fuz
    yesterday






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    yesterday






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    yesterday












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    yesterday












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    yesterday














  • 2




    This instruction does indeed seem useless.
    – fuz
    yesterday






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    yesterday






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    yesterday












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    yesterday












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    yesterday








2




2




This instruction does indeed seem useless.
– fuz
yesterday




This instruction does indeed seem useless.
– fuz
yesterday




6




6




FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
yesterday




FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
yesterday




2




2




Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
yesterday






Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
yesterday














But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
yesterday






But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
yesterday














@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
yesterday




@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
yesterday












1 Answer
1






active

oldest

votes


















35














That must be the thread-local handshake poll.
Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



  0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70


It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



UPD: Hopefully, more details here.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54054782%2fuseless-test-instruction%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    35














    That must be the thread-local handshake poll.
    Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



      0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
    0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
    25.62% │ ...7f: add $0x1,%rbp
    35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
    34.91% │ ...86: test %r10d,%r10d
    ╰ ...89: je ...70


    It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



    UPD: Hopefully, more details here.






    share|improve this answer




























      35














      That must be the thread-local handshake poll.
      Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



        0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
      0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
      25.62% │ ...7f: add $0x1,%rbp
      35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
      34.91% │ ...86: test %r10d,%r10d
      ╰ ...89: je ...70


      It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



      UPD: Hopefully, more details here.






      share|improve this answer


























        35












        35








        35






        That must be the thread-local handshake poll.
        Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



          0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
        0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
        25.62% │ ...7f: add $0x1,%rbp
        35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
        34.91% │ ...86: test %r10d,%r10d
        ╰ ...89: je ...70


        It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



        UPD: Hopefully, more details here.






        share|improve this answer














        That must be the thread-local handshake poll.
        Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



          0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
        0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
        25.62% │ ...7f: add $0x1,%rbp
        35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
        34.91% │ ...86: test %r10d,%r10d
        ╰ ...89: je ...70


        It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



        UPD: Hopefully, more details here.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited yesterday

























        answered yesterday









        Aleksey Shipilev

        13.8k23769




        13.8k23769






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54054782%2fuseless-test-instruction%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to make a Squid Proxy server?

            Is this a new Fibonacci Identity?

            19世紀