Mark van Gulik wrote:
Someone in this thread (I think it was Ian Piumarta) mentioned that basically there is no such thing as "locality of reference" between cache lines. This is due to the fact that this kind of cache uses associative lookup (typically with some kind of "wired-or" if MY associative memory serves me).
Cache associativity has some effect, here, but it's rare to see highly- or fully-associative caches (the ones I can think of are TLBs on MIPS processors, and the L1 caches on some ARM designs). Most caches these days have a small amount of associativity, from 1-way "direct mapped" to 4-way or 8-way. The future trend is towards on-chip cache hierarchies with small, direct-mapped caches at the top and large, 4 to 8-way caches at the bottom.
All small-associativity caches have problematic access patterns which cause conflict misses that can lead to greatly reduced performance, even when the working set being accessed is smaller than the total cache size. An interesting idea would be to collect "hot spot" access statistics while running, for use by the garbage collector to dynamically remap heavily-used areas to different cache sets to avoid thrashing.
He then stated that until a memory boundary was reached, the cache miss cost stayed pretty low, then suddenly spiked way up.
That's the transition from conflict misses, where the working set is small enough to fit in the cache, to capacity misses, where the short-term working set exceeds the cache's total capacity.
If a CPU running in User Mode is allowed to test the current permissions of a page (without triggering a page fault), a garbage collector could simply do its best to mark what's currently in memory while building up a list of what to fetch later from disk.
If the rate of increase of processor speeds continues to be significantly higher than the rate of increase of memory speed, this idea might soon become applicable at the cache/main-memory boundary too.
It already is: consider the 2GHz processors that will be available in the not-too-distant future. An L1 data cache access will take 1 to 2 cycles (0.5ns to 1ns), an on-chip L2 access will take around 20 cycles (10ns), but main memory access to a non-open DRAM page will still be at least 60ns.
Future processor designs are attempting to address this in a way similar to what you propose, by incoporating the DRAM controller directly on chip, and coordinating it with the cache, reordering the cache <-> memory traffic.
-- Tim Olson
squeak-dev@lists.squeakfoundation.org