Why has the size of L1 cache not increased very much over the last 20 years?
The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.
The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.
Why not?
cpu architecture cpu-cache progress
add a comment |
The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.
The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.
Why not?
cpu architecture cpu-cache progress
You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.
– Keltari
May 26 '13 at 4:45
Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.
– Fiasco Labs
May 26 '13 at 4:54
add a comment |
The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.
The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.
Why not?
cpu architecture cpu-cache progress
The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.
The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.
Why not?
cpu architecture cpu-cache progress
cpu architecture cpu-cache progress
edited Dec 15 '15 at 12:39
Hennes
59.1k792141
59.1k792141
asked Nov 18 '09 at 16:45
eleven81eleven81
7,433114677
7,433114677
You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.
– Keltari
May 26 '13 at 4:45
Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.
– Fiasco Labs
May 26 '13 at 4:54
add a comment |
You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.
– Keltari
May 26 '13 at 4:45
Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.
– Fiasco Labs
May 26 '13 at 4:54
You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.
– Keltari
May 26 '13 at 4:45
You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.
– Keltari
May 26 '13 at 4:45
Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.
– Fiasco Labs
May 26 '13 at 4:54
Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.
– Fiasco Labs
May 26 '13 at 4:54
add a comment |
6 Answers
6
active
oldest
votes
30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)
EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)
6
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
2
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
3
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
1
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
add a comment |
One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.
1
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
1
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
add a comment |
Cache size is influenced by many factors:
Speed of electric signals (should be if not the speed of light, something of same order of magnitude):
- 300 meters in one microsecond.
- 30 centimeters in one nanosecond.
Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)
- Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.
- At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)
- For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.
If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.
I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.
add a comment |
I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.
Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.
(I'm a software guy so hopefully this isn't woefully wrong)
add a comment |
From L1 cache:
The Level 1 cache, or primary cache,
is on the CPU and is used for
temporary storage of instructions and
data organised in blocks of 32 bytes.
Primary cache is the fastest form of
storage. Because it's built in to the chip with a zero wait-state (delay)
interface to the processor's execution unit, it is limited in size.
SRAM uses two transistors per bit and
can hold data without external
assistance, for as long as power is
supplied to the circuit. This is
contrasted to dynamic RAM (DRAM),
which must be refreshed many times per
second in order to hold its data
contents.
Intel's P55 MMX processor, launched at
the start of 1997, was noteworthy for
the increase in size of its Level 1
cache to 32KB. The AMD K6 and Cyrix M2
chips launched later that year upped
the ante further by providing Level 1
caches of 64KB. 64Kb has remained the
standard L1 cache size, though various
multiple-core processors may utilise
it differently.
EDIT: Please note that this answer is from 2009 and CPUs have evolved
enormously in the last 10 years. If you have arrived to this post,
don't take all our answers here too seriously.
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
add a comment |
Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.
add a comment |
protected by harrymc Jan 23 at 9:18
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)
EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)
6
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
2
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
3
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
1
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
add a comment |
30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)
EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)
6
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
2
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
3
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
1
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
add a comment |
30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)
EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)
30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)
EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)
answered Nov 18 '09 at 16:59
JMDJMD
4,16411724
4,16411724
6
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
2
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
3
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
1
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
add a comment |
6
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
2
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
3
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
1
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
6
6
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :)
– sYnfo
Nov 18 '09 at 19:18
2
2
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better.
– JMD
Nov 18 '09 at 19:48
3
3
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance.
– Brian Knoblauch
Nov 18 '09 at 20:09
1
1
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true. @Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :)
– sYnfo
Nov 18 '09 at 20:30
add a comment |
One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.
1
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
1
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
add a comment |
One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.
1
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
1
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
add a comment |
One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.
One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.
answered May 26 '13 at 4:19
AJWAJW
9111
9111
1
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
1
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
add a comment |
1
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
1
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
1
1
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
most interesting answer:)
– GameDeveloper
Jan 3 '14 at 16:52
1
1
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache.
– b_jonas
Sep 3 '15 at 19:53
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data.
– b_jonas
Sep 3 '15 at 19:54
add a comment |
Cache size is influenced by many factors:
Speed of electric signals (should be if not the speed of light, something of same order of magnitude):
- 300 meters in one microsecond.
- 30 centimeters in one nanosecond.
Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)
- Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.
- At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)
- For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.
If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.
I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.
add a comment |
Cache size is influenced by many factors:
Speed of electric signals (should be if not the speed of light, something of same order of magnitude):
- 300 meters in one microsecond.
- 30 centimeters in one nanosecond.
Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)
- Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.
- At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)
- For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.
If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.
I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.
add a comment |
Cache size is influenced by many factors:
Speed of electric signals (should be if not the speed of light, something of same order of magnitude):
- 300 meters in one microsecond.
- 30 centimeters in one nanosecond.
Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)
- Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.
- At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)
- For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.
If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.
I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.
Cache size is influenced by many factors:
Speed of electric signals (should be if not the speed of light, something of same order of magnitude):
- 300 meters in one microsecond.
- 30 centimeters in one nanosecond.
Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)
- Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.
- At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)
- For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.
If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.
I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes):
that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.
answered Jan 3 '14 at 16:39
GameDeveloperGameDeveloper
16914
16914
add a comment |
add a comment |
I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.
Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.
(I'm a software guy so hopefully this isn't woefully wrong)
add a comment |
I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.
Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.
(I'm a software guy so hopefully this isn't woefully wrong)
add a comment |
I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.
Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.
(I'm a software guy so hopefully this isn't woefully wrong)
I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.
Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.
(I'm a software guy so hopefully this isn't woefully wrong)
answered Nov 18 '09 at 16:57
Andrew FlanaganAndrew Flanagan
1,4051415
1,4051415
add a comment |
add a comment |
From L1 cache:
The Level 1 cache, or primary cache,
is on the CPU and is used for
temporary storage of instructions and
data organised in blocks of 32 bytes.
Primary cache is the fastest form of
storage. Because it's built in to the chip with a zero wait-state (delay)
interface to the processor's execution unit, it is limited in size.
SRAM uses two transistors per bit and
can hold data without external
assistance, for as long as power is
supplied to the circuit. This is
contrasted to dynamic RAM (DRAM),
which must be refreshed many times per
second in order to hold its data
contents.
Intel's P55 MMX processor, launched at
the start of 1997, was noteworthy for
the increase in size of its Level 1
cache to 32KB. The AMD K6 and Cyrix M2
chips launched later that year upped
the ante further by providing Level 1
caches of 64KB. 64Kb has remained the
standard L1 cache size, though various
multiple-core processors may utilise
it differently.
EDIT: Please note that this answer is from 2009 and CPUs have evolved
enormously in the last 10 years. If you have arrived to this post,
don't take all our answers here too seriously.
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
add a comment |
From L1 cache:
The Level 1 cache, or primary cache,
is on the CPU and is used for
temporary storage of instructions and
data organised in blocks of 32 bytes.
Primary cache is the fastest form of
storage. Because it's built in to the chip with a zero wait-state (delay)
interface to the processor's execution unit, it is limited in size.
SRAM uses two transistors per bit and
can hold data without external
assistance, for as long as power is
supplied to the circuit. This is
contrasted to dynamic RAM (DRAM),
which must be refreshed many times per
second in order to hold its data
contents.
Intel's P55 MMX processor, launched at
the start of 1997, was noteworthy for
the increase in size of its Level 1
cache to 32KB. The AMD K6 and Cyrix M2
chips launched later that year upped
the ante further by providing Level 1
caches of 64KB. 64Kb has remained the
standard L1 cache size, though various
multiple-core processors may utilise
it differently.
EDIT: Please note that this answer is from 2009 and CPUs have evolved
enormously in the last 10 years. If you have arrived to this post,
don't take all our answers here too seriously.
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
add a comment |
From L1 cache:
The Level 1 cache, or primary cache,
is on the CPU and is used for
temporary storage of instructions and
data organised in blocks of 32 bytes.
Primary cache is the fastest form of
storage. Because it's built in to the chip with a zero wait-state (delay)
interface to the processor's execution unit, it is limited in size.
SRAM uses two transistors per bit and
can hold data without external
assistance, for as long as power is
supplied to the circuit. This is
contrasted to dynamic RAM (DRAM),
which must be refreshed many times per
second in order to hold its data
contents.
Intel's P55 MMX processor, launched at
the start of 1997, was noteworthy for
the increase in size of its Level 1
cache to 32KB. The AMD K6 and Cyrix M2
chips launched later that year upped
the ante further by providing Level 1
caches of 64KB. 64Kb has remained the
standard L1 cache size, though various
multiple-core processors may utilise
it differently.
EDIT: Please note that this answer is from 2009 and CPUs have evolved
enormously in the last 10 years. If you have arrived to this post,
don't take all our answers here too seriously.
From L1 cache:
The Level 1 cache, or primary cache,
is on the CPU and is used for
temporary storage of instructions and
data organised in blocks of 32 bytes.
Primary cache is the fastest form of
storage. Because it's built in to the chip with a zero wait-state (delay)
interface to the processor's execution unit, it is limited in size.
SRAM uses two transistors per bit and
can hold data without external
assistance, for as long as power is
supplied to the circuit. This is
contrasted to dynamic RAM (DRAM),
which must be refreshed many times per
second in order to hold its data
contents.
Intel's P55 MMX processor, launched at
the start of 1997, was noteworthy for
the increase in size of its Level 1
cache to 32KB. The AMD K6 and Cyrix M2
chips launched later that year upped
the ante further by providing Level 1
caches of 64KB. 64Kb has remained the
standard L1 cache size, though various
multiple-core processors may utilise
it differently.
EDIT: Please note that this answer is from 2009 and CPUs have evolved
enormously in the last 10 years. If you have arrived to this post,
don't take all our answers here too seriously.
edited Jan 23 at 9:18
answered Nov 18 '09 at 16:55
harrymcharrymc
258k14271573
258k14271573
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
add a comment |
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source
– lukecampbell
May 28 '13 at 16:44
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
This is just description of situation, and does not explain anything about why.
– Eonil
Jan 23 at 3:45
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked.
– Ramhound
Jan 23 at 10:07
add a comment |
Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.
add a comment |
Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.
add a comment |
Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.
Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.
edited Jun 16 '17 at 8:19
answered Jun 16 '17 at 7:49
Zack BarkleyZack Barkley
11
11
add a comment |
add a comment |
protected by harrymc Jan 23 at 9:18
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket.
– Keltari
May 26 '13 at 4:45
Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain.
– Fiasco Labs
May 26 '13 at 4:54