The persistent cache approach.
Alexander Artiushin (alexnikart@m...)
Jul 31, 2002
Ideally, memory should be able to supply a processor by data in
such a way of avoiding any extra data delays. Unfortunately, there
are unknown an appropriate compilation methods to provide a proper
data transport. In modern computer systems the decreasing of
data retrieving time is achieved via a hierarchical memory
organization where any following memory (cache) level larger
and slower then the previous one, at that only the last level is
addressable. First level cache size is commonly 8Kwords (1.5Mb at
HP8500), second and third (if exists) levels are much greater.
When an instruction gets as operand a virtual memory address, it
is converted into a physical one. Using this physical address as a
key (tag), data are requested from memory and if these data exist
in cache, we get some speed-up. Let's think, why cache works with
physical addresses? The problem is in probable intersection of
address spaces allocated by OS for different processes. It is
possible to enlarge virtual address (tag) with the process ID and
convert it into the physical one only if cache lookup is
unsuccessful but in this case we have significant cache complexity
increasing against the small saving of address conversion costs.
Moreover, address conversion is usually executed in processor
whereas the upper cache levels are commonly separated from
the processor. The shortcomings of existing caching approach
are obvious and mostly are concluded in shared usage of cache
memory by all running processes. Using any strategy of caching, at
a control returning after a series of context (processes) switches
the process should consider its own cached data context totally
lost (so-called cold start problem). When the regular number of
simultaneously running processes is evaluated in hundreds, to save
(even partially) some personal data of process we need caches of
huge sizes.
Virtual pages swapping causes still one problem. When during a
virtual address conversion we got a lack of corresponding virtual
page, it will be read but some other page will be lost. Obviously,
we should destroy all the cache lines corresponding to this
condemned page.
The advantages of existing caching technique are sometimes given
to be moot points. The average retrieval time is better especially
in case of consecutive and localized work. In any other case a
program should not expect any significant help from a caching
system. At that such an "average" behavior plays a mean trick with
programmers when it makes no sense to take care of algorithm's
"beauty" and to optimize programs by hands - cache levels
off all code and frequently "bad" code turned out faster then
"good". An attempts to use a cache particularities in optimization
yield in the lack of portability even between different versions
of the same processor. There are attempts to make cache more
predictable e.g. speculative loading technique. The compiler
places the instruction of loading data into cache before their
necessity (SPARC V9, IBM POWER3 and HP PA-8xxx). The ability to
explicitly invalidate an unnecessary cache line also could be an
useful feature. Some systems (e.g. TMS320C6xxx) allow to configure
the cache as directly addressed super-fast memory but in this
case there exist difficulties with its sharing between tasks.
Additional words should be devoted to the cache coherency problem.
Such as the coherency is supported by inter-processor messages and
the number of the messages is significantly nonlinear against the
number of processors, there exist an objective upper limit for the
number of processors in the system. Let's note what the majority
of the mentioned messages are useless such as certainly will not
be ever used. This is a payment for a caching in physical
addresses, which causes extra dependencies between processor
units. And the method to avoid these extra dependencies will let
to significantly increase above-mentioned upper limit of the
number of processors in a system.
Let's advert to the essence of supposed approach. The basic ideas
are:
- the separation of virtual addresses into global and
local which can be done on basis of the value of this address
- caching is performed in virtual address space
- each task works with own cache context and
its cached local data can't be affected by other tasks
- the buffer of cache memory is spliced into two equal parts (pages)
- at a task switch, cache pages are switched,
at this a new task immediately runs with one page and
the other page is used firstly for the saving of its content
into main memory and, secondly, for a restoring of data
for a next task from main memory
- the global data tag is extended with the identifier
given by OS, e.g. the process number
Let the discussing computing system consists of at least one
processor module. Let this system is intended to run in parallel a
number of processes each of which can contain more then one
parallel thread. We will call the data, which can be used by the
only thread as local, and the data, which are shared between
different threads of one process as global. The data sharing
between processes is allowable but requires a special treatment
e.g. they are uncacheable or unswapable or accessed only via
system calls.
The program loader is implemented to place a process global data
into segments with easily discriminate virtual addresses e.g. the
result of the binary '&' should give true at testing this address
against some predefined or software re-definable bit mask. The OS
kernel (later simply kernel) is responsible for the threads
scheduling and switching.
Each processor module contains local caching mechanism (later
cache) of the same size and organization. The cache contains
following logical blocks:
- two equal data pages one of which we call active and the second one - shadowed
- pages access block
- global data management block
- global data page
- at least one DMA channel
Depending on the implementation, data and command caches can be
placed either on the same cache page with sharing of all caching
mechanism or on independent pages with duplication of some
(or all) mentioned mechanism.
The pages access block realizes one of known caching strategies
and algorithms e.g. it can be fully associative, write through and
LRU extrusion.
The global data management block is used for the maintenance of
data coherency in a processor module and also in all system
scale.
The global data page is shared between all threads, this page
doesn't have to be of the same size as a local data page and is
used directly by the global data management block as an
associative memory pool.
The DMA channels are meant for copying data from cache shadowed
local page to basic memory and back. Every of existing DMA
channels works independently with its own (at least in this
moment) page region.
The active page is dedicated for usage by active on this processor
module task, during that the content of this page is the caching
context of the current task. The kernel contains the mechanism for
downloading of the shadowed page content into a main memory and
for uploading back. In the moment, when the kernel changes the
current task, the cache pages switching is performed, so the
active page becomes shadowed and vice versa. Which page is
currently active and which is shadowed can be determined by the
value of some control register and this value is changed either
automatically or explicitly by kernel.
In this moment the shadowed page should contain caching context of
the next task fully uploaded by the kernel. Just after the task
switch, the kernel begins to download the content of new shadowed
page. After that and when kernel defines the next task to switch,
the caching context of this task should be uploaded into the
shadowed page. The uploading and downloading processes can be
performed in the same time with the time shift to exclude their
collision.
When the cache line is set in the global page, the line tag is
expanded by some identifier, which is got from the OS (which gets
it from e.g. dedicated register). An example of such identifier
can be the process ID, in any way this extended tag should be
unique in the scale of all system. When the processor requests a
global data, which are not registered in the global page, one
should process the retrieval from the main memory and registration
in the global page and put data to the local (active) page
In case of global page repletion different implementations are
possible. E.g. we can reject to perform caching for this value or
substitute some any value from an active and global page in any
way all global data in both local page should be coherent with
those in the global page. In the process of data uploading to the
shadowed page, the global data management block passes through all
local data (which, as we remember, can be easily separated from
the global ones by virtual address) and brings global data to
conformity with the content of the global page. In case of a
multiprocessor system, the global data management block should
realize some mechanism of distributed transactions to maintain the
all-system data coherence. Is there a coincidence of global pages
contents or not, depends on the supported protocol of distributed
transactions.
The caching context of each task is stored in some memory,
allocated by the kernel for this task to describe its behavior.
Touching some control register can stop the caching activity. In
the same way can be stopped the caching of global data. In case of
an apparatus interrupt, e.g. in case of a virtual page miss, the
caching activity is stopped up to the end of an interrupt handler
work.
The active page additionally has a direct addressing mode. In such
a way every task can (if necessary) use the active page not as a
caching buffer but as an own area of super-fast memory.
What are the advantages of supposed cache organization? First of
all, a super-fast memory can have own address space and own
support instructions or it can be mapped into a virtual address
space. In any way the existence of such memory by no means is
considered in high level languages. The objective method of their
utilization is a temporary variables (appeared during a
compilation) storing. Alas, there is no necessity to store
thousands of temporary variables so we need some new ideas. There
are at least three variants.
- A stack is arranged in this memory.
Stack contains local variables (and temporary too) which are
owned only by the current task. Traditionally, a thread can pass to
other thread of the same process an address of variable from
its own stack, but this technique is rather an evil deed and
creates more problems then solves. Also this data are grouped,
the caching of the top of a stack is always useful and an actual
stack top is usually not too large. When the stack overgrows an allocated for it memory,
the exception occurs, OS allocates an additional buffer for data
saving and uses the super-fast memory as an window of accelerated
access to a stack. If there is an request out of window borders,
the conversion of a virtual address to physical gives an
appropriate physical address, pointing into
one of previously allocated saving buffers.
- This super-fast memory can be manually managed via OS calls.
- This method assumes some changes in high level languages
and/or creation of new compilation methods supporting mentioned
architecture details. This is especially important for the
architectures with explicit parallelism (ILP) where the caching
in general can be harmful due an inherent unpredictability.
Second advantage is in explicit division of data into local and
global. This allows a directly managing of an intended for
coherency support inter-processor messages. The entire of spectrum
is possible. If we prohibit the global data caching, the
synchronization occurs on the level of main memory access.
This is not so bad if global data are rear. In case of absence of
global data the coherency support is not necessary. In
situation where all data are global, we have the same as when we
use traditional cache with physical tags. It is important for a
programmer to clear understand which data are used in this
moment and what is the payment for it. The choice here can be
realized by using of two system calls for memory allocations - for
local and process-global data.
Third advantage. A saving task context can contain some additional
data such as branch history table (BHT) or physical addresses
cache (TLB). Some systems keep BHT together with an instructions
cache, in any way if this table becomes task-personal, this allow
to increase jump predictions for speculative calculations without
significant costs (e.g. in MIPS R10000 BHT contains just 512
items). A common TLB size is in hundreds of items. There are no
problems to arrange it inside savable cache page. The obvious
difficulty is in necessity to care about swapped out pages during
the context uploading.
This letter is an essence of the patent application:
"B. Muratshin, A. Artiushin The persistent cache approach for
multitasking and SMP systems. Novosibirsk, Jul 2002, patent
application RUS N2002121880/20(022788)". But the OpenCores members can use it freely under the OpenCores license.