Why has the size of L1 cache not increased very much over the last 20 years?












30















The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.



The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.



Why not?










share|improve this question

























  • You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.

    – Keltari
    May 26 '13 at 4:45











  • Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.

    – Fiasco Labs
    May 26 '13 at 4:54
















30















The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.



The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.



Why not?










share|improve this question

























  • You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.

    – Keltari
    May 26 '13 at 4:45











  • Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.

    – Fiasco Labs
    May 26 '13 at 4:54














30












30








30


5






The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.



The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.



Why not?










share|improve this question
















The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.



The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.



Why not?







cpu architecture cpu-cache progress






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 15 '15 at 12:39









Hennes

59.1k792141




59.1k792141










asked Nov 18 '09 at 16:45









eleven81eleven81

7,433114677




7,433114677













  • You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.

    – Keltari
    May 26 '13 at 4:45











  • Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.

    – Fiasco Labs
    May 26 '13 at 4:54



















  • You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.

    – Keltari
    May 26 '13 at 4:45











  • Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.

    – Fiasco Labs
    May 26 '13 at 4:54

















You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.

– Keltari
May 26 '13 at 4:45





You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.

– Keltari
May 26 '13 at 4:45













Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.

– Fiasco Labs
May 26 '13 at 4:54





Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.

– Fiasco Labs
May 26 '13 at 4:54










6 Answers
6






active

oldest

votes


















14














30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)



EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)






share|improve this answer



















  • 6





    "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

    – sYnfo
    Nov 18 '09 at 19:18






  • 2





    I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

    – JMD
    Nov 18 '09 at 19:48






  • 3





    From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

    – Brian Knoblauch
    Nov 18 '09 at 20:09






  • 1





    @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

    – sYnfo
    Nov 18 '09 at 20:30



















9














One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.






share|improve this answer



















  • 1





    most interesting answer:)

    – GameDeveloper
    Jan 3 '14 at 16:52






  • 1





    I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

    – b_jonas
    Sep 3 '15 at 19:53











  • As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

    – b_jonas
    Sep 3 '15 at 19:54



















7














Cache size is influenced by many factors:





  1. Speed of electric signals (should be if not the speed of light, something of same order of magnitude):




    • 300 meters in one microsecond.

    • 30 centimeters in one nanosecond.




  2. Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)




    • Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.

    • At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)

    • For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.




If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.



I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.






share|improve this answer































    6














    I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.



    Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.



    (I'm a software guy so hopefully this isn't woefully wrong)






    share|improve this answer































      3














      From L1 cache:




      The Level 1 cache, or primary cache,
      is on the CPU and is used for
      temporary storage of instructions and
      data organised in blocks of 32 bytes.
      Primary cache is the fastest form of
      storage. Because it's built in to the chip with a zero wait-state (delay)
      interface to the processor's execution unit, it is limited in size.



      SRAM uses two transistors per bit and
      can hold data without external
      assistance, for as long as power is
      supplied to the circuit. This is
      contrasted to dynamic RAM (DRAM),
      which must be refreshed many times per
      second in order to hold its data
      contents.



      Intel's P55 MMX processor, launched at
      the start of 1997, was noteworthy for
      the increase in size of its Level 1
      cache to 32KB. The AMD K6 and Cyrix M2
      chips launched later that year upped
      the ante further by providing Level 1
      caches of 64KB. 64Kb has remained the
      standard L1 cache size, though various
      multiple-core processors may utilise
      it differently.




      EDIT: Please note that this answer is from 2009 and CPUs have evolved
      enormously in the last 10 years. If you have arrived to this post,
      don't take all our answers here too seriously.






      share|improve this answer


























      • A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

        – lukecampbell
        May 28 '13 at 16:44











      • This is just description of situation, and does not explain anything about why.

        – Eonil
        Jan 23 at 3:45











      • @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

        – Ramhound
        Jan 23 at 10:07



















      -4














      Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.






      share|improve this answer
























        protected by harrymc Jan 23 at 9:18



        Thank you for your interest in this question.
        Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



        Would you like to answer one of these unanswered questions instead?














        6 Answers
        6






        active

        oldest

        votes








        6 Answers
        6






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        14














        30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)



        EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)






        share|improve this answer



















        • 6





          "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

          – sYnfo
          Nov 18 '09 at 19:18






        • 2





          I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

          – JMD
          Nov 18 '09 at 19:48






        • 3





          From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

          – Brian Knoblauch
          Nov 18 '09 at 20:09






        • 1





          @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

          – sYnfo
          Nov 18 '09 at 20:30
















        14














        30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)



        EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)






        share|improve this answer



















        • 6





          "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

          – sYnfo
          Nov 18 '09 at 19:18






        • 2





          I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

          – JMD
          Nov 18 '09 at 19:48






        • 3





          From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

          – Brian Knoblauch
          Nov 18 '09 at 20:09






        • 1





          @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

          – sYnfo
          Nov 18 '09 at 20:30














        14












        14








        14







        30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)



        EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)






        share|improve this answer













        30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)



        EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 18 '09 at 16:59









        JMDJMD

        4,16411724




        4,16411724








        • 6





          "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

          – sYnfo
          Nov 18 '09 at 19:18






        • 2





          I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

          – JMD
          Nov 18 '09 at 19:48






        • 3





          From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

          – Brian Knoblauch
          Nov 18 '09 at 20:09






        • 1





          @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

          – sYnfo
          Nov 18 '09 at 20:30














        • 6





          "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

          – sYnfo
          Nov 18 '09 at 19:18






        • 2





          I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

          – JMD
          Nov 18 '09 at 19:48






        • 3





          From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

          – Brian Knoblauch
          Nov 18 '09 at 20:09






        • 1





          @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

          – sYnfo
          Nov 18 '09 at 20:30








        6




        6





        "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

        – sYnfo
        Nov 18 '09 at 19:18





        "When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)

        – sYnfo
        Nov 18 '09 at 19:18




        2




        2





        I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

        – JMD
        Nov 18 '09 at 19:48





        I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.

        – JMD
        Nov 18 '09 at 19:48




        3




        3





        From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

        – Brian Knoblauch
        Nov 18 '09 at 20:09





        From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.

        – Brian Knoblauch
        Nov 18 '09 at 20:09




        1




        1





        @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

        – sYnfo
        Nov 18 '09 at 20:30





        @JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)

        – sYnfo
        Nov 18 '09 at 20:30













        9














        One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.






        share|improve this answer



















        • 1





          most interesting answer:)

          – GameDeveloper
          Jan 3 '14 at 16:52






        • 1





          I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

          – b_jonas
          Sep 3 '15 at 19:53











        • As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

          – b_jonas
          Sep 3 '15 at 19:54
















        9














        One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.






        share|improve this answer



















        • 1





          most interesting answer:)

          – GameDeveloper
          Jan 3 '14 at 16:52






        • 1





          I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

          – b_jonas
          Sep 3 '15 at 19:53











        • As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

          – b_jonas
          Sep 3 '15 at 19:54














        9












        9








        9







        One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.






        share|improve this answer













        One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered May 26 '13 at 4:19









        AJWAJW

        9111




        9111








        • 1





          most interesting answer:)

          – GameDeveloper
          Jan 3 '14 at 16:52






        • 1





          I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

          – b_jonas
          Sep 3 '15 at 19:53











        • As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

          – b_jonas
          Sep 3 '15 at 19:54














        • 1





          most interesting answer:)

          – GameDeveloper
          Jan 3 '14 at 16:52






        • 1





          I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

          – b_jonas
          Sep 3 '15 at 19:53











        • As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

          – b_jonas
          Sep 3 '15 at 19:54








        1




        1





        most interesting answer:)

        – GameDeveloper
        Jan 3 '14 at 16:52





        most interesting answer:)

        – GameDeveloper
        Jan 3 '14 at 16:52




        1




        1





        I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

        – b_jonas
        Sep 3 '15 at 19:53





        I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.

        – b_jonas
        Sep 3 '15 at 19:53













        As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

        – b_jonas
        Sep 3 '15 at 19:54





        As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.

        – b_jonas
        Sep 3 '15 at 19:54











        7














        Cache size is influenced by many factors:





        1. Speed of electric signals (should be if not the speed of light, something of same order of magnitude):




          • 300 meters in one microsecond.

          • 30 centimeters in one nanosecond.




        2. Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)




          • Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.

          • At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)

          • For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.




        If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.



        I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
        that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.






        share|improve this answer




























          7














          Cache size is influenced by many factors:





          1. Speed of electric signals (should be if not the speed of light, something of same order of magnitude):




            • 300 meters in one microsecond.

            • 30 centimeters in one nanosecond.




          2. Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)




            • Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.

            • At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)

            • For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.




          If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.



          I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
          that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.






          share|improve this answer


























            7












            7








            7







            Cache size is influenced by many factors:





            1. Speed of electric signals (should be if not the speed of light, something of same order of magnitude):




              • 300 meters in one microsecond.

              • 30 centimeters in one nanosecond.




            2. Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)




              • Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.

              • At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)

              • For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.




            If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.



            I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
            that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.






            share|improve this answer













            Cache size is influenced by many factors:





            1. Speed of electric signals (should be if not the speed of light, something of same order of magnitude):




              • 300 meters in one microsecond.

              • 30 centimeters in one nanosecond.




            2. Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)




              • Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.

              • At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)

              • For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.




            If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.



            I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
            that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 3 '14 at 16:39









            GameDeveloperGameDeveloper

            16914




            16914























                6














                I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.



                Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.



                (I'm a software guy so hopefully this isn't woefully wrong)






                share|improve this answer




























                  6














                  I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.



                  Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.



                  (I'm a software guy so hopefully this isn't woefully wrong)






                  share|improve this answer


























                    6












                    6








                    6







                    I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.



                    Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.



                    (I'm a software guy so hopefully this isn't woefully wrong)






                    share|improve this answer













                    I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.



                    Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.



                    (I'm a software guy so hopefully this isn't woefully wrong)







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 18 '09 at 16:57









                    Andrew FlanaganAndrew Flanagan

                    1,4051415




                    1,4051415























                        3














                        From L1 cache:




                        The Level 1 cache, or primary cache,
                        is on the CPU and is used for
                        temporary storage of instructions and
                        data organised in blocks of 32 bytes.
                        Primary cache is the fastest form of
                        storage. Because it's built in to the chip with a zero wait-state (delay)
                        interface to the processor's execution unit, it is limited in size.



                        SRAM uses two transistors per bit and
                        can hold data without external
                        assistance, for as long as power is
                        supplied to the circuit. This is
                        contrasted to dynamic RAM (DRAM),
                        which must be refreshed many times per
                        second in order to hold its data
                        contents.



                        Intel's P55 MMX processor, launched at
                        the start of 1997, was noteworthy for
                        the increase in size of its Level 1
                        cache to 32KB. The AMD K6 and Cyrix M2
                        chips launched later that year upped
                        the ante further by providing Level 1
                        caches of 64KB. 64Kb has remained the
                        standard L1 cache size, though various
                        multiple-core processors may utilise
                        it differently.




                        EDIT: Please note that this answer is from 2009 and CPUs have evolved
                        enormously in the last 10 years. If you have arrived to this post,
                        don't take all our answers here too seriously.






                        share|improve this answer


























                        • A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

                          – lukecampbell
                          May 28 '13 at 16:44











                        • This is just description of situation, and does not explain anything about why.

                          – Eonil
                          Jan 23 at 3:45











                        • @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

                          – Ramhound
                          Jan 23 at 10:07
















                        3














                        From L1 cache:




                        The Level 1 cache, or primary cache,
                        is on the CPU and is used for
                        temporary storage of instructions and
                        data organised in blocks of 32 bytes.
                        Primary cache is the fastest form of
                        storage. Because it's built in to the chip with a zero wait-state (delay)
                        interface to the processor's execution unit, it is limited in size.



                        SRAM uses two transistors per bit and
                        can hold data without external
                        assistance, for as long as power is
                        supplied to the circuit. This is
                        contrasted to dynamic RAM (DRAM),
                        which must be refreshed many times per
                        second in order to hold its data
                        contents.



                        Intel's P55 MMX processor, launched at
                        the start of 1997, was noteworthy for
                        the increase in size of its Level 1
                        cache to 32KB. The AMD K6 and Cyrix M2
                        chips launched later that year upped
                        the ante further by providing Level 1
                        caches of 64KB. 64Kb has remained the
                        standard L1 cache size, though various
                        multiple-core processors may utilise
                        it differently.




                        EDIT: Please note that this answer is from 2009 and CPUs have evolved
                        enormously in the last 10 years. If you have arrived to this post,
                        don't take all our answers here too seriously.






                        share|improve this answer


























                        • A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

                          – lukecampbell
                          May 28 '13 at 16:44











                        • This is just description of situation, and does not explain anything about why.

                          – Eonil
                          Jan 23 at 3:45











                        • @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

                          – Ramhound
                          Jan 23 at 10:07














                        3












                        3








                        3







                        From L1 cache:




                        The Level 1 cache, or primary cache,
                        is on the CPU and is used for
                        temporary storage of instructions and
                        data organised in blocks of 32 bytes.
                        Primary cache is the fastest form of
                        storage. Because it's built in to the chip with a zero wait-state (delay)
                        interface to the processor's execution unit, it is limited in size.



                        SRAM uses two transistors per bit and
                        can hold data without external
                        assistance, for as long as power is
                        supplied to the circuit. This is
                        contrasted to dynamic RAM (DRAM),
                        which must be refreshed many times per
                        second in order to hold its data
                        contents.



                        Intel's P55 MMX processor, launched at
                        the start of 1997, was noteworthy for
                        the increase in size of its Level 1
                        cache to 32KB. The AMD K6 and Cyrix M2
                        chips launched later that year upped
                        the ante further by providing Level 1
                        caches of 64KB. 64Kb has remained the
                        standard L1 cache size, though various
                        multiple-core processors may utilise
                        it differently.




                        EDIT: Please note that this answer is from 2009 and CPUs have evolved
                        enormously in the last 10 years. If you have arrived to this post,
                        don't take all our answers here too seriously.






                        share|improve this answer















                        From L1 cache:




                        The Level 1 cache, or primary cache,
                        is on the CPU and is used for
                        temporary storage of instructions and
                        data organised in blocks of 32 bytes.
                        Primary cache is the fastest form of
                        storage. Because it's built in to the chip with a zero wait-state (delay)
                        interface to the processor's execution unit, it is limited in size.



                        SRAM uses two transistors per bit and
                        can hold data without external
                        assistance, for as long as power is
                        supplied to the circuit. This is
                        contrasted to dynamic RAM (DRAM),
                        which must be refreshed many times per
                        second in order to hold its data
                        contents.



                        Intel's P55 MMX processor, launched at
                        the start of 1997, was noteworthy for
                        the increase in size of its Level 1
                        cache to 32KB. The AMD K6 and Cyrix M2
                        chips launched later that year upped
                        the ante further by providing Level 1
                        caches of 64KB. 64Kb has remained the
                        standard L1 cache size, though various
                        multiple-core processors may utilise
                        it differently.




                        EDIT: Please note that this answer is from 2009 and CPUs have evolved
                        enormously in the last 10 years. If you have arrived to this post,
                        don't take all our answers here too seriously.







                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Jan 23 at 9:18

























                        answered Nov 18 '09 at 16:55









                        harrymcharrymc

                        258k14271573




                        258k14271573













                        • A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

                          – lukecampbell
                          May 28 '13 at 16:44











                        • This is just description of situation, and does not explain anything about why.

                          – Eonil
                          Jan 23 at 3:45











                        • @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

                          – Ramhound
                          Jan 23 at 10:07



















                        • A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

                          – lukecampbell
                          May 28 '13 at 16:44











                        • This is just description of situation, and does not explain anything about why.

                          – Eonil
                          Jan 23 at 3:45











                        • @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

                          – Ramhound
                          Jan 23 at 10:07

















                        A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

                        – lukecampbell
                        May 28 '13 at 16:44





                        A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

                        – lukecampbell
                        May 28 '13 at 16:44













                        This is just description of situation, and does not explain anything about why.

                        – Eonil
                        Jan 23 at 3:45





                        This is just description of situation, and does not explain anything about why.

                        – Eonil
                        Jan 23 at 3:45













                        @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

                        – Ramhound
                        Jan 23 at 10:07





                        @Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.

                        – Ramhound
                        Jan 23 at 10:07











                        -4














                        Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.






                        share|improve this answer






























                          -4














                          Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.






                          share|improve this answer




























                            -4












                            -4








                            -4







                            Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.






                            share|improve this answer















                            Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Jun 16 '17 at 8:19

























                            answered Jun 16 '17 at 7:49









                            Zack BarkleyZack Barkley

                            11




                            11

















                                protected by harrymc Jan 23 at 9:18



                                Thank you for your interest in this question.
                                Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                                Would you like to answer one of these unanswered questions instead?



                                Popular posts from this blog

                                How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

                                is 'sed' thread safe

                                How to make a Squid Proxy server?