What kind of metadata is primarily being loaded/evicted from ARC, in my ZFS system?












2















I'm trying to check how my ZFS pool's ARC is used, motivated by "do I need more RAM, even though expensive".



I have 128GB fast ECC and using NVMe SSDs for L2ARC, but it's still doing a lot of small IO read thrashing related to a mix of DDT/spacemaps. The system is specced for DDT and has about 3.5x ~ 4x dedup ratio, so please no "DDT is baad" replies, I know that, but need DDT, so I'm working to minimise the remaining read-thrash by ensuring at least that ddt/spacemap metadata is pretty much all retained in ARC once the system is warm.



The way I'm expecting RAM to be used is - DDT is about 35-40GB, and I've used sysctls to reserve 85GB of ARC for metadata. I've also set spacemap block size to a larger size, and defragged the pool (copied to new pool), which looks like it's helping a lot. But because I can't see metrics for how much of different types of metadata is loaded or evicted (ddt/spacemap/other), and there aren't tools to set ZFS ddt block size or preload DDT entries to ARC, I'm in the dark as to the exact impact, or whether more RAM will help, or other systematic ways to do better.



I've looked for solutions. zdb, arc-stats etc don't expose a breakdown of the metadata ARC situation, just a lump sum for all metadata.



Is there a straightforward way to get a sense what's going on, in order to assess whether more RAM will help, even if it's not precise, or to get some better sense (even if imprecise) of breakdown of the amounts of ddt/spacemap/"other" metadata MRU/MFU being loaded/cached/evicted?










share|improve this question



























    2















    I'm trying to check how my ZFS pool's ARC is used, motivated by "do I need more RAM, even though expensive".



    I have 128GB fast ECC and using NVMe SSDs for L2ARC, but it's still doing a lot of small IO read thrashing related to a mix of DDT/spacemaps. The system is specced for DDT and has about 3.5x ~ 4x dedup ratio, so please no "DDT is baad" replies, I know that, but need DDT, so I'm working to minimise the remaining read-thrash by ensuring at least that ddt/spacemap metadata is pretty much all retained in ARC once the system is warm.



    The way I'm expecting RAM to be used is - DDT is about 35-40GB, and I've used sysctls to reserve 85GB of ARC for metadata. I've also set spacemap block size to a larger size, and defragged the pool (copied to new pool), which looks like it's helping a lot. But because I can't see metrics for how much of different types of metadata is loaded or evicted (ddt/spacemap/other), and there aren't tools to set ZFS ddt block size or preload DDT entries to ARC, I'm in the dark as to the exact impact, or whether more RAM will help, or other systematic ways to do better.



    I've looked for solutions. zdb, arc-stats etc don't expose a breakdown of the metadata ARC situation, just a lump sum for all metadata.



    Is there a straightforward way to get a sense what's going on, in order to assess whether more RAM will help, even if it's not precise, or to get some better sense (even if imprecise) of breakdown of the amounts of ddt/spacemap/"other" metadata MRU/MFU being loaded/cached/evicted?










    share|improve this question

























      2












      2








      2








      I'm trying to check how my ZFS pool's ARC is used, motivated by "do I need more RAM, even though expensive".



      I have 128GB fast ECC and using NVMe SSDs for L2ARC, but it's still doing a lot of small IO read thrashing related to a mix of DDT/spacemaps. The system is specced for DDT and has about 3.5x ~ 4x dedup ratio, so please no "DDT is baad" replies, I know that, but need DDT, so I'm working to minimise the remaining read-thrash by ensuring at least that ddt/spacemap metadata is pretty much all retained in ARC once the system is warm.



      The way I'm expecting RAM to be used is - DDT is about 35-40GB, and I've used sysctls to reserve 85GB of ARC for metadata. I've also set spacemap block size to a larger size, and defragged the pool (copied to new pool), which looks like it's helping a lot. But because I can't see metrics for how much of different types of metadata is loaded or evicted (ddt/spacemap/other), and there aren't tools to set ZFS ddt block size or preload DDT entries to ARC, I'm in the dark as to the exact impact, or whether more RAM will help, or other systematic ways to do better.



      I've looked for solutions. zdb, arc-stats etc don't expose a breakdown of the metadata ARC situation, just a lump sum for all metadata.



      Is there a straightforward way to get a sense what's going on, in order to assess whether more RAM will help, even if it's not precise, or to get some better sense (even if imprecise) of breakdown of the amounts of ddt/spacemap/"other" metadata MRU/MFU being loaded/cached/evicted?










      share|improve this question














      I'm trying to check how my ZFS pool's ARC is used, motivated by "do I need more RAM, even though expensive".



      I have 128GB fast ECC and using NVMe SSDs for L2ARC, but it's still doing a lot of small IO read thrashing related to a mix of DDT/spacemaps. The system is specced for DDT and has about 3.5x ~ 4x dedup ratio, so please no "DDT is baad" replies, I know that, but need DDT, so I'm working to minimise the remaining read-thrash by ensuring at least that ddt/spacemap metadata is pretty much all retained in ARC once the system is warm.



      The way I'm expecting RAM to be used is - DDT is about 35-40GB, and I've used sysctls to reserve 85GB of ARC for metadata. I've also set spacemap block size to a larger size, and defragged the pool (copied to new pool), which looks like it's helping a lot. But because I can't see metrics for how much of different types of metadata is loaded or evicted (ddt/spacemap/other), and there aren't tools to set ZFS ddt block size or preload DDT entries to ARC, I'm in the dark as to the exact impact, or whether more RAM will help, or other systematic ways to do better.



      I've looked for solutions. zdb, arc-stats etc don't expose a breakdown of the metadata ARC situation, just a lump sum for all metadata.



      Is there a straightforward way to get a sense what's going on, in order to assess whether more RAM will help, even if it's not precise, or to get some better sense (even if imprecise) of breakdown of the amounts of ddt/spacemap/"other" metadata MRU/MFU being loaded/cached/evicted?







      performance freebsd zfs freenas dtrace






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Feb 12 at 12:39









      StilezStilez

      78011022




      78011022






















          1 Answer
          1






          active

          oldest

          votes


















          2














          I don’t think there’s any built-in tool like arcstats for this purpose, especially since you’re on FreeBSD, which I’m guessing doesn’t have mdb (from illumos / Solaris).



          The simplest solution would be giving a little bit more RAM and seeing if that helps. Of course, that costs money, but it might be less expensive than your time trying to figure out the answer (if you’re being paid for this work).



          The next easiest thing to try would be fussing with the ARC memory limits while running some test workload. The impact these have on workloads is often unintuitive, so my recommendation would be to start with all default settings, then gradually change stuff to make it more complicated. It’s possible that e.g. when you tried to set the min memory reserved for metadata, you actually set the max by mistake, which would cause thrashing like you described — so make sure to read the “docs” for these settings carefully.



          Finally, if you’re a bit intrepid and fairly confident in your ability to analyze complicated data to determine the answer to your question, you can also use DTrace for this. This probe is triggered each time a cache miss happens in the ARC:



          DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
          uint64_t, lsize, zbookmark_phys_t *, zb);


          So, you could write a D script that listens for the :::arc_miss probe, and use args[0], args[1], and / or the backtrace to figure out what type of requests are missing the cache.



          I suspect the easiest way to go is looking at the type of the blockptr_t in args[1]. Unfortunately that’s somewhat annoying to extract because it’s part of a bitfield. The definition of the block pointer object can be found here, and you want your DTrace script to output the same thing that BP_GET_TYPE(args[1]) would output, and then interpret those values by comparing to values of dmu_object_type from here.



          Alternately, I can recommend a simpler script that’s potentially more involved to interpret. It would collect the backtrace every time the probe fires, and then you can post-process the traces to make a flame graph for easier interpretation. The method names in ZFS are pretty descriptive in general (at least, they all have acronyms, e.g. ddt for “dedup table” that you can search online), so you can probably figure out what the callers are doing that way.



          If there’s a ton of values showing up that aren’t file or directory data, you probably need to keep more metadata in cache. You can do that either by dedicating more space to it using the tunables, or by giving the machine more RAM.






          share|improve this answer


























          • This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

            – Stilez
            Feb 13 at 23:09












          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "3"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1404834%2fwhat-kind-of-metadata-is-primarily-being-loaded-evicted-from-arc-in-my-zfs-syst%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          I don’t think there’s any built-in tool like arcstats for this purpose, especially since you’re on FreeBSD, which I’m guessing doesn’t have mdb (from illumos / Solaris).



          The simplest solution would be giving a little bit more RAM and seeing if that helps. Of course, that costs money, but it might be less expensive than your time trying to figure out the answer (if you’re being paid for this work).



          The next easiest thing to try would be fussing with the ARC memory limits while running some test workload. The impact these have on workloads is often unintuitive, so my recommendation would be to start with all default settings, then gradually change stuff to make it more complicated. It’s possible that e.g. when you tried to set the min memory reserved for metadata, you actually set the max by mistake, which would cause thrashing like you described — so make sure to read the “docs” for these settings carefully.



          Finally, if you’re a bit intrepid and fairly confident in your ability to analyze complicated data to determine the answer to your question, you can also use DTrace for this. This probe is triggered each time a cache miss happens in the ARC:



          DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
          uint64_t, lsize, zbookmark_phys_t *, zb);


          So, you could write a D script that listens for the :::arc_miss probe, and use args[0], args[1], and / or the backtrace to figure out what type of requests are missing the cache.



          I suspect the easiest way to go is looking at the type of the blockptr_t in args[1]. Unfortunately that’s somewhat annoying to extract because it’s part of a bitfield. The definition of the block pointer object can be found here, and you want your DTrace script to output the same thing that BP_GET_TYPE(args[1]) would output, and then interpret those values by comparing to values of dmu_object_type from here.



          Alternately, I can recommend a simpler script that’s potentially more involved to interpret. It would collect the backtrace every time the probe fires, and then you can post-process the traces to make a flame graph for easier interpretation. The method names in ZFS are pretty descriptive in general (at least, they all have acronyms, e.g. ddt for “dedup table” that you can search online), so you can probably figure out what the callers are doing that way.



          If there’s a ton of values showing up that aren’t file or directory data, you probably need to keep more metadata in cache. You can do that either by dedicating more space to it using the tunables, or by giving the machine more RAM.






          share|improve this answer


























          • This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

            – Stilez
            Feb 13 at 23:09
















          2














          I don’t think there’s any built-in tool like arcstats for this purpose, especially since you’re on FreeBSD, which I’m guessing doesn’t have mdb (from illumos / Solaris).



          The simplest solution would be giving a little bit more RAM and seeing if that helps. Of course, that costs money, but it might be less expensive than your time trying to figure out the answer (if you’re being paid for this work).



          The next easiest thing to try would be fussing with the ARC memory limits while running some test workload. The impact these have on workloads is often unintuitive, so my recommendation would be to start with all default settings, then gradually change stuff to make it more complicated. It’s possible that e.g. when you tried to set the min memory reserved for metadata, you actually set the max by mistake, which would cause thrashing like you described — so make sure to read the “docs” for these settings carefully.



          Finally, if you’re a bit intrepid and fairly confident in your ability to analyze complicated data to determine the answer to your question, you can also use DTrace for this. This probe is triggered each time a cache miss happens in the ARC:



          DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
          uint64_t, lsize, zbookmark_phys_t *, zb);


          So, you could write a D script that listens for the :::arc_miss probe, and use args[0], args[1], and / or the backtrace to figure out what type of requests are missing the cache.



          I suspect the easiest way to go is looking at the type of the blockptr_t in args[1]. Unfortunately that’s somewhat annoying to extract because it’s part of a bitfield. The definition of the block pointer object can be found here, and you want your DTrace script to output the same thing that BP_GET_TYPE(args[1]) would output, and then interpret those values by comparing to values of dmu_object_type from here.



          Alternately, I can recommend a simpler script that’s potentially more involved to interpret. It would collect the backtrace every time the probe fires, and then you can post-process the traces to make a flame graph for easier interpretation. The method names in ZFS are pretty descriptive in general (at least, they all have acronyms, e.g. ddt for “dedup table” that you can search online), so you can probably figure out what the callers are doing that way.



          If there’s a ton of values showing up that aren’t file or directory data, you probably need to keep more metadata in cache. You can do that either by dedicating more space to it using the tunables, or by giving the machine more RAM.






          share|improve this answer


























          • This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

            – Stilez
            Feb 13 at 23:09














          2












          2








          2







          I don’t think there’s any built-in tool like arcstats for this purpose, especially since you’re on FreeBSD, which I’m guessing doesn’t have mdb (from illumos / Solaris).



          The simplest solution would be giving a little bit more RAM and seeing if that helps. Of course, that costs money, but it might be less expensive than your time trying to figure out the answer (if you’re being paid for this work).



          The next easiest thing to try would be fussing with the ARC memory limits while running some test workload. The impact these have on workloads is often unintuitive, so my recommendation would be to start with all default settings, then gradually change stuff to make it more complicated. It’s possible that e.g. when you tried to set the min memory reserved for metadata, you actually set the max by mistake, which would cause thrashing like you described — so make sure to read the “docs” for these settings carefully.



          Finally, if you’re a bit intrepid and fairly confident in your ability to analyze complicated data to determine the answer to your question, you can also use DTrace for this. This probe is triggered each time a cache miss happens in the ARC:



          DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
          uint64_t, lsize, zbookmark_phys_t *, zb);


          So, you could write a D script that listens for the :::arc_miss probe, and use args[0], args[1], and / or the backtrace to figure out what type of requests are missing the cache.



          I suspect the easiest way to go is looking at the type of the blockptr_t in args[1]. Unfortunately that’s somewhat annoying to extract because it’s part of a bitfield. The definition of the block pointer object can be found here, and you want your DTrace script to output the same thing that BP_GET_TYPE(args[1]) would output, and then interpret those values by comparing to values of dmu_object_type from here.



          Alternately, I can recommend a simpler script that’s potentially more involved to interpret. It would collect the backtrace every time the probe fires, and then you can post-process the traces to make a flame graph for easier interpretation. The method names in ZFS are pretty descriptive in general (at least, they all have acronyms, e.g. ddt for “dedup table” that you can search online), so you can probably figure out what the callers are doing that way.



          If there’s a ton of values showing up that aren’t file or directory data, you probably need to keep more metadata in cache. You can do that either by dedicating more space to it using the tunables, or by giving the machine more RAM.






          share|improve this answer















          I don’t think there’s any built-in tool like arcstats for this purpose, especially since you’re on FreeBSD, which I’m guessing doesn’t have mdb (from illumos / Solaris).



          The simplest solution would be giving a little bit more RAM and seeing if that helps. Of course, that costs money, but it might be less expensive than your time trying to figure out the answer (if you’re being paid for this work).



          The next easiest thing to try would be fussing with the ARC memory limits while running some test workload. The impact these have on workloads is often unintuitive, so my recommendation would be to start with all default settings, then gradually change stuff to make it more complicated. It’s possible that e.g. when you tried to set the min memory reserved for metadata, you actually set the max by mistake, which would cause thrashing like you described — so make sure to read the “docs” for these settings carefully.



          Finally, if you’re a bit intrepid and fairly confident in your ability to analyze complicated data to determine the answer to your question, you can also use DTrace for this. This probe is triggered each time a cache miss happens in the ARC:



          DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
          uint64_t, lsize, zbookmark_phys_t *, zb);


          So, you could write a D script that listens for the :::arc_miss probe, and use args[0], args[1], and / or the backtrace to figure out what type of requests are missing the cache.



          I suspect the easiest way to go is looking at the type of the blockptr_t in args[1]. Unfortunately that’s somewhat annoying to extract because it’s part of a bitfield. The definition of the block pointer object can be found here, and you want your DTrace script to output the same thing that BP_GET_TYPE(args[1]) would output, and then interpret those values by comparing to values of dmu_object_type from here.



          Alternately, I can recommend a simpler script that’s potentially more involved to interpret. It would collect the backtrace every time the probe fires, and then you can post-process the traces to make a flame graph for easier interpretation. The method names in ZFS are pretty descriptive in general (at least, they all have acronyms, e.g. ddt for “dedup table” that you can search online), so you can probably figure out what the callers are doing that way.



          If there’s a ton of values showing up that aren’t file or directory data, you probably need to keep more metadata in cache. You can do that either by dedicating more space to it using the tunables, or by giving the machine more RAM.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 13 at 13:39

























          answered Feb 13 at 13:27









          DanDan

          758314




          758314













          • This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

            – Stilez
            Feb 13 at 23:09



















          • This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

            – Stilez
            Feb 13 at 23:09

















          This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

          – Stilez
          Feb 13 at 23:09





          This is a fantastically promising answer. I'm happy to explore with dtrace (hence the tag). Rather than a long ramble in comments, would you be willing to join me in chat, to narrow it down and to ask a couple of technical questions from your reply? If so, I've set up a room at chat.stackexchange.com/rooms/89701/… and (time zones allowing) hope this is good for you too

          – Stilez
          Feb 13 at 23:09


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Super User!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1404834%2fwhat-kind-of-metadata-is-primarily-being-loaded-evicted-from-arc-in-my-zfs-syst%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

          is 'sed' thread safe

          How to make a Squid Proxy server?