|
Message
From: cvs at opencores.org<cvs@o...>
Date: Sat Aug 25 19:59:22 CEST 2007
Subject: [cvs-checkins] MODIFIED: jop ...
Date: 00/07/08 25:19:59 Added: jop/doc/book/cache cache.tex Log: Handbook update Revision Changes Path 1.1 jop/doc/book/cache/cache.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/cache/cache.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: cache.tex =================================================================== \label{sec:cache} Worst-case execution time (WCET) analysis \cite{84850} of real-time programs is essential for any schedulability analysis. To provide a low WCET value, a good processor model is necessary. However, the architectural advancement in modern processor designs is dominated by the rule: '\emph{Make the common case fast}`. This is the opposite of '\emph{Reduce the worst case}` and complicates WCET analysis. Cache memory for the instructions and data is a classic example of this paradigm. Avoiding or ignoring this feature in real-time systems, due to its unpredictable behavior, results in a very pessimistic WCET value. Plenty of effort has gone into research into integrating the instruction cache in the timing analysis of tasks \cite{Arnold1994, Healy1995, 225068} and the influence of the cache on task preemption \cite{279589, RTAS96*204}. The influence of different cache architectures on WCET analysis is described in \cite{Heckmann:IEEE2003}. We will tackle this problem from the architectural side -- an instruction cache organization in which simpler and more accurate WCET analysis is more important than average case performance. In this section, we will propose a method cache with a novel replacement policy. In Java bytecode only relative branches exist, and a method is therefore only left when a return instruction has been executed\footnote{An uncaught exception also results in a method exit.}. It has been observed that methods are typically short (see Section~\ref{sec:bench:jvm:methods}) in Java applications. These properties are utilized by a cache architecture that stores complete methods. A complete method is loaded into the cache on both invocation and return. This cache fill strategy lumps all cache misses together and is very simple to analyze. \subsection{Cache Performance} In real-time systems we prefer time-predictable architectures over those with a high average performance. However, performance is still important. In this section, we will give a short overview of the formulas from \cite{Hennessy02} that are used to calculate the cache's influence on execution time. We will extend the single measurement \emph{miss rate} to a two value set, memory read and transaction rate, that is architecture independent and better reflects the two properties (bandwidth and latency) of the main memory. To evaluate cache performance, \(MEM_{clk}\) memory stall cycles are added to the CPU execution time (\(t_{exe}\)) equation: % \begin{align} \nonumber t_{exe} & = (CPU_{clk}+MEM_{clk}) \times t_{clk} \\ \nonumber MEM_{clk}& = Misses \times MP_{clk} \end{align} % The miss penalty \(MP_{clk}\) is the cost per miss, measured in clock cycles. When the instruction count $IC$ is given as the number of instructions executed, $CPI$ the average clock cycles per instruction and the number of misses per instruction, we obtain the following result: % \begin{align} \nonumber CPU_{clk} & = IC \times CPI_{exe} \\ \nonumber MEM_{clk} & = IC \times \frac{Misses}{Instruction} \times MP_{clk} \\ \nonumber t_{exe} & = IC \times (CPI_{exe}+\frac{Misses}{Instruction} \times MP_{clk}) \times t_{clk} \end{align} % As this section is only concerned with the instruction cache, we will split the memory stall cycles into misses caused by the instruction fetch and misses caused by data access. % \[CPI = CPI_{exe} + CPI_{IM} + CPI_{DM}\] % $CPI_{exe}$ is the average number of clock cycles per instruction, given an ideal memory system without any stalls. $CPI_{IM}$ are the additional clock cycles caused by instruction cache misses and $CPI_{DM}$ the data miss portion of the CPI. This split between instruction and data portions of the CPI better reflects the split of the cache between instruction and data cache found in actual processors. The misses per instruction are often given as misses per 1000 instructions. However, there are several drawbacks to using a single number: % \begin{description}
\item[Architecture dependent:] The average number of memory accesses per
instruction differs greatly between a RISC processor and the Java
Virtual Machine (JVM). A typical RISC processor needs one memory
word (4 bytes) per instruction word, and about 40\% of the
instructions \cite{Hennessy02} are \emph{load} or \emph{store}
instructions. Using the example of a 32-bit RISC processor, this
results in 5.6 bytes memory access per instruction. The average
length of a JVM bytecode instruction is 1.7 bytes and about 18\% of
the instructions access the memory for data load and store.
\item[Block size dependent:] Misses per instruction depends subtly
on the block size. On a single cache miss, a whole block of the
cache is filled. Therefore, the probability that a future
instruction request is a hit is higher with a larger block size.
However, a larger block size results in a higher miss penalty as
more memory is transferred.
\end{description}
%
Main memory is usually composed of DRAMs. Access time to this memory
is measured in terms of latency (the time taken to access the first
word of a larger block) and bandwidth (the number of bytes read or
written in a single request per time unit). These two values, along
with the block size of a cache, are used to calculate the miss
penalty:
%
\[MP_{clk} = Latency + \frac{Block\;size}{Bandwidth}\]
%
To better evaluate different cache organizations and different
instruction sets (RISC versus JVM), we will introduce two
performance measurements -- memory bytes read per instruction byte
and memory transactions per instruction byte:
%
\begin{align}
\nonumber
MBIB & = \frac{Memory\;bytes\;read}{Instruction\;bytes}\\
\nonumber
MTIB & = \frac{Memory\;transactions}{Instruction\;bytes}
\end{align}
%
These two measures are closely related to memory bandwidth and
latency. With these two values and the properties of the main
memory, we can calculate the average memory cycles per instruction
byte \(MCIB\) and \(CPI_{IM}\), i.e.\ the values we are concerned in
this section.
%
\begin{align}
\nonumber
MCIB & = (\frac{MBIB}{Bandwith}+MTIB \times Latency)\\
\nonumber
CPI_{IM} & = MCIB \times {Instruction\;length}
\end{align}
%
The misses per instruction can be converted to MBIB and MTIB when
the following parameters are known: the average instruction length
of the architecture, the block size of the cache and the miss
penalty in latency and bandwidth. We will examine this further in
the following example:
We use the following architecture to illustrate the conversion: a
RISC architecture with a 4 bytes instruction length, an 8KB
instruction cache with 64-byte blocks and a miss rate of 8.16 per
1000 instructions \cite{Hennessy02}. The miss penalty is 100 clock
cycles. The memory system is assumed to deliver one word (4 bytes)
per cycle.
Firstly, we need to calculate the latency of the memory system.
\begin{equation}
\nonumber
\begin{split}
Latency & = MP_{clk} - \frac{Block size}{Bandwidth} \\
& = 100 - \frac{64}{4} = 84 \mbox{ clock cycles}
\end{split}
\end{equation}
%
With \(Miss\;rate=\frac{Cache\;miss}{Cache\;access}\), we obtain
MBIB.
%
\begin{equation}
\nonumber
\begin{split}
MBIB & = \frac{Memory\;bytes\;read}{Instruction\;bytes} \\
& = \frac{Cache\;miss \times Block\;size}
{Cache\;access \times Instruction\;length} \\
& = Miss\;rate \times \frac{Block\;size}{Instruction\;length} \\
& = 8.16 \times 10^{-3} \times \frac{65}{4}\\
& = 0.131
\end{split}
\end{equation}
%
MTIB is calculated in a similar way:
%
\begin{equation}
\nonumber
\begin{split}
MTIB & = \frac{Memory\;transactions}{Instruction\;bytes}\\
& = \frac{Cache\;miss}
{Cache\;access \times Instruction\;length} \\
& = \frac{Miss\;rate}{Instruction\;length}\\
& = \frac{8.16 \times 10^{-3}}{4}\\
& = 2.04 \times 10^{-3}
\end{split}
\end{equation}
%
For a quick check, we can calculate \(CPI_{IM}\):
%
\begin{equation}
\nonumber
\begin{split}
MCIB & = \frac{MBIB}{Bandwith}+MTIB \times Latency\\
& = \frac{0.131}{4}+2.04 \times 10^{-3} \times 84\\
& = 0.204\\
CPI_{IM} & = MCIB \times {Instruction\;length} \\
& = 0.204 \times 4\\
& = 0.816
\end{split}
\end{equation}
%
This is the same value as that which we get from using the miss rate
with the miss penalty:
%
\begin{equation}
\nonumber
\begin{split}
CPI_{IM} & = Miss\;rate \times Miss\;penalty\\
& = 8.16 \times 10^{-3} \times 100\\
& = 0.816
\end{split}
\end{equation}
%
However, MBIB and MTIB are architecture independent and better
reflect the latency and bandwidth of the main memory.
\subsection{Proposed Cache Solution}
In this section, we will develop a solution for a predictable cache.
Typical Java programs consist of short methods. There are no
branches out of the method and all branches inside are relative. In
the proposed architecture, the full code of a method is loaded into
the cache before execution. The cache is filled on invocations and
returns. This means that all cache fills are lumped together with a
known execution time. The full loaded method and relative addressing
inside a method also result in a simpler cache. Tag memory and
address translation are not necessary.
However, we will first discuss an even simpler solution -- no
caching at all. Without an instruction cache, prefetching is
mandatory, especially with a variable length instruction set. The
issues surrounding prefetching are discussed in the next section.
\subsubsection{Instruction Prefetching}
A simple prefetch queue, as found in older processors, can increase
instruction throughput after a multi-cycle bytecode is executed.
However, for a stream of single-cycle bytecodes, prefetching is
useless and the frequent occurrence of branches, method invocations,
and method returns (see Section~\ref{sec:bench:jvm}) reduces the
performance gain. Using a prefetch queue also results in execution
time dependencies over a stream of instructions, which complicates
timing analysis.
For a variable length instruction set, prefetching is also not a
straightforward option. The prefetching unit needs to guarantee the
availability of a complete instruction for the fetch unit. As the
actual length of the instruction is not known at this stage, the
prefetch unit must be a minimum of $maximum\;length - 1$ bytes ahead
of the requested instruction. This can lead to unnecessary memory
transfers. The return instruction is a typical example of this. It
is 1 byte long and the additional prefetched instruction bytes are
never used.
A memory interface with a bus width greater than one byte adds an
artificial boundary to the instruction stream. For the purpose of
this example, we are assuming a 4 byte memory interface. In this
case we need an 8 byte prefetch buffer. On a branch to an address
$address\;mod\;4
> 4-maximum\;instruction\;length$, two words need to be loaded from
main memory before the processor can continue.
A memory technology, such as synchronous DRAM, has a large latency
for the first accessed word and then a high bandwidth for the
following words. Prefetching that only loads small quantities (one
or two words) from the memory is therefore impracticable with these
memory technologies.
\subsubsection{Single Method Cache}
A single method cache, although less efficient than a conventional
instruction cache, can be incorporated very easily into the WCET
analysis. The time needed for the memory transfer must be added to
the invoke and return instructions.
% Wiederholung von Einlietung der proposed cache solution
% könnte nach vorne gelegt werden,
% dann aber ein, zwei Sätze über den single method cache
The method cache also simplifies the hardware of the cache, as it
means that no tag memory or address translation is necessary. Other
parts of the processor are also smaller. The program counter, the
associated adders and multiplexer are shorter than in a standard
cache solution. For example, for a 1KB cache, the size of these
units is only 10 bits, instead of 32 bits.
The main disadvantage of this single method cache is the high
overhead when a complete method is loaded into the cache and only a
small fraction of the code is executed. This issue is similar to
that encountered with unused data in a cache line. However, in
extreme cases, this overhead can be very high. The second problem
can be seen in following example:
%
\begin{samepage}
\begin{lstlisting}
foo() {
a();
b();
}
\end{lstlisting}
\end{samepage}
%
This code sequence results in the following cache loads:
%
\begin{enumerate}
\item method \code{foo} is loaded on invocation of \code{foo()}
\item method \code{a} is loaded on invocation of \code{a()}
\item method \code{foo} is loaded on return from \code{a()}
\item method \code{b} is loaded on invocation of \code{b()}
\item method \code{foo} is loaded on return from \code{b()}
\end{enumerate}
%
The main drawback of the single method cache is the multiple cache
fill of \code{foo()} on return from methods \code{a()} and
\code{b()}. In a conventional cache design, if these three methods
fit in the cache memory at the same time and there is no placement
conflict, each method is only loaded once. This issue can be
overcome by caching more than one method. The simplest solution is a
two-block cache.
\subsubsection{Two-Block Cache}
The two-block cache can hold up to two methods in the cache. This
results in having to decide which block is replaced on a cache miss.
With only two blocks, Least-Recently Used (LRU) is trivial to
implement. The code sequence now results in the cache loads and hits
as shown in Table~\ref{tab_cache_two_block_example}.
%
\begin{table}[htbp]
\centering
\begin{tabular}{llll}
\toprule
Instruction & Block 1 & Block 2 & Cache \\
\midrule
foo() & foo & -- & load \\
a() & foo & a & load \\
return & foo & a & hit \\
b() & foo & b & load \\
return & foo & b & hit \\
\bottomrule
\end{tabular}
\caption{Cache load and hit example with the two-block cache}
\label{tab_cache_two_block_example}
\end{table}
%
With the two-block cache, we have to double the cache memory or use
both blocks for a single large method. The WCET analysis is slightly
more complex than with a single block. A short history of the
invocation sequence has to be used to find the cache fills and hits.
However, a cache that can only hold two methods is still very
restrictive. The next code sequence shows the conflict.
\tablename~\ref{tab_cache_two_block_conflict} shows the resulting
cache loads.
%
\begin{samepage}
\begin{lstlisting}
foo() {
a();
}
a() {
b();
}
\end{lstlisting}
%
\begin{table}[htbp]
\centering
\begin{tabular}{llll}
\toprule
Instruction & Block 1 & Block 2 & Cache \\
\midrule
foo() & foo & -- & load \\
a() & foo & a & load \\
b() & b & a & load \\
return & b & a & hit \\
return & foo & a & load \\
\bottomrule
\end{tabular}
\caption{Cache conflict example with the two-block cache}
\label{tab_cache_two_block_conflict}
\end{table}
\end{samepage}
%
A memory (similar to the tag memory) with one word per block is used
to store a reference to the cached method. However, this memory can
be slower than the tag memory as it is only accessed on invocation
or return, rather than on every cache access.
\subsubsection{More Blocks}
We can improve the hit rate by adding more blocks to the cache. If
only one block per method is used, the cache size increases with the
number of blocks. With more than two blocks, LRU replacement policy
means that another word is needed for every block containing a use
counter that is updated on every invoke and return. During
replacement, this list is searched for the LRU block. Hit detection
involves a search through the list of the method references of the
blocks. If this search is done in microcode, it imposes a limit on
the maximum number of blocks.
\subsubsection{Variable Block Cache}
\label{sec:cache:var}
Several cache blocks, all of the size as the largest method, are a
waste of cache memory. Using smaller block sizes and allowing a
method to span over several blocks, the blocks become very similar
to cache lines. The main difference from a conventional cache is
that the blocks for a method are all loaded at once and need to be
consecutive.
Choosing the block size is now a major design decision. Smaller
block sizes allow better memory usage, but the search time for a hit
also increases.
With varying block numbers per method, an LRU replacement becomes
impractical. When the method found to be LRU is smaller than the
loaded method, this new method invalidates two cached methods.
For the replacement, we will use a pointer $next$ that indicates the
start of the blocks to be replaced on a cache miss. Two practical
replace policies are:
%
\begin{description}
\item [Next block:]At the very first beginning, $next$ points to the
first block. When a method of length $l$ is loaded into the block
$n$, $next$ is updated to $(n+l)\;mod\;block\;count$.
\item [Stack oriented:]$next$ is updated in the same way as before
on a method load. It is also updated on a method return --
independent of a resulting hit or miss -- to point to the first
block of the leaving method.
\end{description}
%
We will show the operation of these different replacement policies
in an example with three methods: a(), b() and c() of block sizes 2,
2 and 1. The cache consists of 4 blocks and is therefore too small
to hold all the methods during the execution of the code fragment
shown in Listing~\ref{lst:cache:replace}.
%
\begin{samepage}
\begin{lstlisting}[float,caption={Code fragment for the replacement example},
label=lst:cache:replace]
a() {
for (;;) {
b();
c();
}
...
}
\end{lstlisting}
\end{samepage}
%
Tables \ref{tab_cache_replace_next} and
\ref{tab_cache_replace_stack} show the cache content during program
execution for both replacement policies. The content of the cache
blocks is shown after the execution of the invoke or return
instruction. An uppercase letter indicates that this block has been
newly loaded. A right arrow depicts the block to be replaced on a
cache miss (the \emph{next} pointer). The last row shows the number
of blocks that are filled during the execution of the program.
% too long for book format
%
%\begin{table*}
% \centering
%\begin{tt}
% \begin{tabular}{lrrrrrrrrrrrrr}
% \toprule
%
% &a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret &c() &ret \\
% \midrule
% \rm{Block 1}&A &$\to$a &$\to$a &C &c &B &b &b &$\to$- &B &b &b &A \\
% \rm{Block 2}&A &a &a &$\to$- &A &$\to$a &$\to$a &C &c &B &b &b &$\to$- \\
% \rm{Block 3}&$\to$- &B &b &b &A &a &a &$\to$- &A &$\to$a &$\to$a &C &c \\
% \rm{Block 4}&- &B &b &b &$\to$- &B &b &b &A &a &a &$\to$- &A \\
% \midrule
% \rm{Fill} &2 &4 & &5 &7 &9 & &11 &13 &15 & &16 &18 \\
% \bottomrule
%
% \end{tabular}
%\end{tt}
% \caption{Next block replacement policy}
% \label{tab_cache_replace_next}
%\end{table*}
%
%\begin{table*}
% \centering
%\begin{tt}
% \begin{tabular}{lrrrrrrrrrrrrr}
% \toprule
%
% &a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret &c() &ret \\
% \midrule
% \rm{Block 1}&A &$\to$a &a &a &a &$\to$a &a &a &a &$\to$a &a &a &a \\
% \rm{Block 2}&A &a &a &a &a &a &a &a &a &a &a &a &a \\
% \rm{Block 3}&$\to$- &B &$\to$b &C &$\to$c &B &$\to$b &C &$\to$c &B &$\to$b &C &$\to$c \\
% \rm{Block 4}&- &B &b &$\to$- &- &B &b &$\to$- &- &B &b &$\to$- &- \\
% \midrule
% \rm{Fill} &2 &4 & &5 & &7 & &8 & &10 & &11 & \\
% \bottomrule
%
% \end{tabular}
%\end{tt}
% \caption{Stack oriented replacement policy}
% \label{tab_cache_replace_stack}
%\end{table*}
\begin{table}
\centering
\begin{tt}
\begin{tabular}{lrrrrrrrrrrr}
\toprule
&a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret \\
\midrule
\rm{Block 1}&A &$\to$a &$\to$a &C &c &B &b &b &$\to$- &B &b \\
\rm{Block 2}&A &a &a &$\to$- &A &$\to$a &$\to$a &C &c &B &b \\
\rm{Block 3}&$\to$- &B &b &b &A &a &a &$\to$- &A &$\to$a &$\to$a \\
\rm{Block 4}&- &B &b &b &$\to$- &B &b &b &A &a &a \\
\midrule
\rm{Fill} &2 &4 & &5 &7 &9 & &11 &13 &15 & \\
\bottomrule
\end{tabular}
\end{tt}
\caption{Next block replacement policy}
\label{tab_cache_replace_next}
\end{table}
\begin{table}
\centering
\begin{tt}
\begin{tabular}{lrrrrrrrrrrr}
\toprule
&a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret \\
\midrule
\rm{Block 1}&A &$\to$a &a &a &a &$\to$a &a &a &a &$\to$a &a \\
\rm{Block 2}&A &a &a &a &a &a &a &a &a &a &a \\
\rm{Block 3}&$\to$- &B &$\to$b &C &$\to$c &B &$\to$b &C &$\to$c &B &$\to$b \\
\rm{Block 4}&- &B &b &$\to$- &- &B &b &$\to$- &- &B &b \\
\midrule
\rm{Fill} &2 &4 & &5 & &7 & &8 & &10 & \\
\bottomrule
\end{tabular}
\end{tt}
\caption{Stack oriented replacement policy}
\label{tab_cache_replace_stack}
\end{table}
In this example, the stack oriented approach needs fewer fills, as
only methods b() and c() are exchanged and method a() stays in the
cache. However, if, for example, method b() is the size of one
block, all methods can be held in the cache using the the \emph{next
block} policy, but b() and c() would be still exchanged using the
\emph{stack} policy. Therefore, the first approach is used in the
proposed cache.
\subsection{WCET Analysis}
\label{sec:cache:wcet}
The proposed instruction cache is designed to simplify WCET
analysis. Due to the fact that all cache misses are only included in
two instructions (\emph{invoke} and \emph{return}), the instruction
cache can be ignored on all other instructions. The time needed to
load a complete method is calculated using the memory properties
(latency and bandwidth) and the length of the method. On an invoke,
the length of the invoked method is used, and on a return, the
method length of the caller is used to calculate the load time.
With a single method cache this calculation can be further
simplified. For every invoke there is a corresponding return. That
means that the time needed for the cache load on return can be
included in the time for the invoke instruction. This is simpler
because both methods, the caller and the callee, are known at the
occurrence of the invoke instruction. The information about which
method was the caller need not be stored for the return instruction
to be analyzed.
With more than one method in the cache, a cache hit detection has to
be performed as part of the WCET analysis. If there are only two
blocks, this is trivial, as (i) a hit on invoke is only possible if
the method is the same as the last invoked (e.g.\ a single method in
a loop) and (ii) a hit on return is only possible when the method is
a leaf in the call tree. In the latter case, it is always a hit.
When the cache contains more blocks (i.e.\ more than two methods can
be cached), a part of the call tree has to be taken into account for
hit detection. The variable block cache further complicates the
analysis, as the method length also determines the cache content.
However, this analysis is still simpler than a cache modeling of a
direct-mapped instruction cache, as cache block replacement depends
on the call tree instead of instruction addresses.
In traditional caches, data access and instruction cache fill
requests can compete for the main memory bus. For example, a load or
store at the end of the processor pipeline competes with an
instruction fetch that results in a cache miss. One of the two
instructions is stalled for additional cycles by the other
instruction. With a data cache, this situation can be even worse.
The worst-case scenario for the memory stall time for an instruction
fetch or a data load is two miss penalties when both cache reads are
a miss. This unpredictable behavior leads to very pessimistic WCET
bounds.
A \emph{method cache}, with cache fills only on invoke and return,
does not interfere with data access to the main memory. Data in the
main memory is accessed with \emph{getfield} and \emph{putfield},
instructions that never overlap with \emph{invoke} and
\emph{return}. This property removes another uncertainty found in
traditional cache designs.
\subsection{Caches Compared}
In this section, we will compare the different cache architectures
in a quantitative way. Although our primary concern is
predictability, performance remains important. We will therefore
first present the results from a conventional direct-mapped
instruction cache. These measurements will then provide a baseline
for the evaluation of the proposed architecture.
Cache performance varies with different application domains. As the
proposed system is intended for real-time applications, the
benchmark for these tests should reflect this fact. However, there
are no standard benchmarks available for embedded real-time systems.
A real-time application was therefore adapted to create this
benchmark. The application is from one node of a distributed motor
control system \cite{jop:wises03} (see also
Section~\ref{sec:app:kfl}). A simulation of the environment (sensors
and actors) and the communication system (commands from the master
station) forms part of the benchmark for simulating the real-world
workload.
The data for all measurements was captured using a simulation of JOP
and running the application for 500,000 clock cycles. During this
time, the major loop of the application was executed several hundred
times, effectively rendering any misses during the initialization
code irrelevant to the measurements.
\subsubsection{Direct-Mapped Cache}
\tablename~\ref{tab_cache_direct} gives the memory bytes and memory
transactions per instruction byte for a standard direct-mapped
cache. As we can see from the values for a cache size of 4KB, the
kernel of the application is small enough to fit completely into the
4KB cache. The cache performs better (i.e.\ fewer bytes are
transferred) with smaller block sizes. With smaller block sizes, the
chance of unused data being read is reduced and the larger number of
blocks reduces conflict misses. However, reducing the block size
also increases memory transactions (MTIB), which directly relates to
memory latency.
\begin{table}
\centering
\begin{tabular}{cd{2.0}cc}
\toprule
Cache size & \cc{Block size} & MBIB & MTIB \\
\midrule
1 KB & 8 & 0.28 & 0.035 \\
1 KB & 16 & 0.38 & 0.024 \\
1 KB & 32 & 0.58 & 0.018 \\
2 KB & 8 & 0.17 & 0.022 \\
2 KB & 16 & 0.25 & 0.015 \\
2 KB & 32 & 0.41 & 0.013 \\
4 KB & 8 & 0.00 & 0.001 \\
4 KB & 16 & 0.01 & 0.000 \\
4 KB & 32 & 0.01 & 0.000 \\
\bottomrule
\end{tabular}
\caption{Direct-mapped cache}
\label{tab_cache_direct}
\end{table}
Which configuration performs best depends on the relationship
between memory bandwidth and memory latency. Examples of average
memory access times in cycles per instruction byte for different
memory technologies are provided in
Table~\ref{tab_cache_direct_mem}. The third column shows the cache
performance for a Static RAM (SRAM) that is very common in embedded
systems. A latency of 1 clock cycle and an access time of 2 clock
cycles per 32-bit word are assumed. For the synchronous DRAM (SDRAM)
in the forth column, a latency of 5 cycles (3 cycle for the row
address and 2 cycle CAS latency) is assumed. The memory delivers one
word (4 bytes) per cycle. The Double Data Rate (DDR) SDRAM in the
last column has an enhanced latency of 4.5 cycles and transfers data
on both the rising and falling edge of the clock signal.
%
% simulate 32-bits (DDR) SDRAM
%
% SDRAM: 5 cycle latency: 3 cycle row address and 2 cycle CAS latency
% 4 Bytes / cycle (0.25 / Byte)
%
% DDR: 4.5 cycle latency: 2 cycle row address and 2.5 cycle CAS latency
% 4 Bytes / 0.5 cycle (0.125 / Byte)
%
% SRAM 15 ns at 100 MHz: 1 cycle latency, 2 cycles / word
%
\begin{table}
\centering
\begin{tabular}{cd{2.0}ccc}
\toprule
Cache size & \cc{Block size} & SRAM & SDRAM & DDR \\
\midrule
1 KB & 8 & \textbf{0.18} & 0.25 & 0.19 \\
1 KB & 16 & 0.22 & \textbf{0.22} & 0.16 \\
1 KB & 32 & 0.31 & 0.24 & \textbf{0.15} \\
2 KB & 8 & \textbf{0.11} & 0.15 & 0.12 \\
2 KB & 16 & 0.14 & \textbf{0.14} & \textbf{0.10} \\
2 KB & 32 & 0.22 & 0.17 & 0.11 \\
\bottomrule
\end{tabular}
\caption{Direct-mapped cache, average memory access time}
\label{tab_cache_direct_mem}
\end{table}
The data in bold give the best block size for different memory
technologies. As expected, memories with a higher latency and
bandwidth perform better with larger block sizes. For small block
sizes, the latency clearly dominates the access time. Although the
SRAM has half the bandwidth of the SDRAM and a quarter of the DDR,
with a block size of 8 bytes, it is faster than the DRAM memories.
In most cases a block size of 16 bytes is the fastest solution and
we will therefore use this configuration for comparison with the
following cache solutions.
\subsubsection{Fixed Block Cache}
Cache performance for single method per block architectures is shown
in Table~\ref{tab_cache_block}. The measurements for a simple 8 byte
prefetch queue are also given, for reference. With prefetching, we
would expect to see an MBIB of about 1. The 37\% overhead results
from the fact that the prefetch queue fetches 4 bytes a time and has
to buffer a minimum of 3 bytes for the instruction fetch stage. On a
branch or return, the queue is flushed and these bytes are lost.
A single block that has to be filled on every invoke and return
requires considerable overheads. More than twice the amount of data
is read from the main memory than is consumed by the processor.
However, the memory transaction count is 16 times lower than with
simple prefetching, which can compensate for the large MBIB for main
memories with high latency.
The solution with two blocks for two methods performs almost twice
as well as the simple one method cache. This is due to the fact
that, for all leaves in the call tree, the caller method can be
found on return. If the block count is doubled again, the number of
misses is reduced by a further 25\%, but the cache size also
doubles. For this measurement, an LRU replacement policy applies for
the two and four block caches.
\begin{table}
\centering
\begin{tabular}{lcccc}
\toprule
Type & Cache size & MBIB & MTIB \\
\midrule
Prefetch & 8 B & 1.37 & 0.342 \\
Single method & 1 KB & 2.32 & 0.021 \\
Two blocks & 2 KB & 1.21 & 0.013 \\
Four blocks & 4 KB & 0.90 & 0.010 \\
\bottomrule
\end{tabular}
\caption{Fixed block cache}
\label{tab_cache_block}
\end{table}
The same memory parameters as in the previous section are also used
in Table~\ref{tab_cache_block_mem}. With the high latency of the
DRAMs, even the simple one block cache is a faster (and more
accurately predictable) solution than a prefetch queue. As MBIB and
MTBI show the same trend as a function of the number of blocks, this
is reflected in the access time in all three memory examples.
\begin{table}
\centering
\begin{tabular}{lcccc}
\toprule
Type & Cache size & SRAM & SDRAM & DDR \\
\midrule
Prefetch & 8 B & 1.02 & 2.05 & 1.71 \\
Single Method & 1 KB & 1.18 & 0.69 & 0.39 \\
Two blocks & 2 KB & 0.62 & 0.37 & 0.21 \\
Four blocks & 4 KB & 0.46 & 0.27 & 0.16 \\
\bottomrule
\end{tabular}
\caption{Fixed block cache, average memory access time}
\label{tab_cache_block_mem}
\end{table}
\subsubsection{Variable Block Cache}
\tablename~\ref{tab_cache_var_block} shows the cache performance of
the proposed solution, i.e.\ of a method cache with several blocks
per method, for different cache sizes and number of blocks. For this
measurement, a \emph{next block} replacement policy applies.
\begin{table}
\centering
\begin{tabular}{cd{2.0}cc}
\toprule
Cache size & \cc{Block count} & MBIB & MTIB \\
\midrule
1 KB & 8 & 0.80 & 0.009 \\
1 KB & 16 & 0.71 & 0.008 \\
1 KB & 32 & 0.70 & 0.008 \\
1 KB & 64 & 0.70 & 0.008 \\
2 KB & 8 & 0.73 & 0.008 \\
2 KB & 16 & 0.37 & 0.004 \\
2 KB & 32 & 0.24 & 0.003 \\
2 KB & 64 & 0.12 & 0.001 \\
4 KB & 8 & 0.73 & 0.008 \\
4 KB & 16 & 0.25 & 0.003 \\
4 KB & 32 & 0.01 & 0.000 \\
4 KB & 64 & 0.00 & 0.000 \\
\bottomrule
\end{tabular}
\caption{Variable block cache}
\label{tab_cache_var_block}
\end{table}
In this scenario, as the MBIB is very high at a cache size of 1KB
and almost independent of the block count, the cache capacity is
seen to be clearly dominant. The most interesting cache size with
this benchmark is 2KB. Here, we can see the influence of the number
of blocks on both performance parameters. Both values benefit from
more blocks. However, a higher block count requires more time or
more hardware for the hit detection. With a cache size of 4KB and
enough blocks, the kernel of the application completely fits into
the variable block cache, as we have seen with a 4KB traditional
cache. From the gap between 16 and 32 blocks (within the 4KB cache),
we can say that the application consists of fewer than 32 different
methods.
It can be seen that even the smallest configuration with a cache
size of 1KB and only 8 blocks outperforms fixed block caches with 2
or 4KB in both parameters (MBIB and MTIB). Compared with the fixed
block solutions, MTIB is low in all configurations. This is due to
the better hit rate, as indicated by the lower MBIB.
In most configurations, MBIB is higher than for the direct-mapped
cache. It is very interesting to note that, in all configurations
(even the small 1KB cache), MTIB is lower than in all 1KB and 2KB
configurations of the direct-mapped cache. This is a result of the
complete method transfers when a miss occurs and is clearly an
advantage for main memory systems with high latency.
As in the previous examples, Table~\ref{tab_cache_var_block_mem}
shows the average memory access time per instruction byte for three
different main memories.
\begin{table}
\centering
\begin{tabular}{cd{2.0}ccc}
\toprule
Cache size & \cc{Block count} & SRAM & SDRAM & DDR \\
\midrule
1 KB & 8 & 0.41 & 0.24 & 0.14 \\
1 KB & 16 & 0.36 & 0.22 & 0.12 \\
1 KB & 32 & 0.36 & 0.21 & 0.12 \\
1 KB & 64 & 0.36 & 0.21 & 0.12 \\
2 KB & 8 & 0.37 & 0.22 & 0.13 \\
2 KB & 16 & 0.19 & 0.11 & 0.06 \\
2 KB & 32 & 0.12 & 0.08 & 0.04 \\
2 KB & 64 & 0.06 & 0.04 & 0.02 \\
\bottomrule
\end{tabular}
\caption{Variable block cache, average memory access time}
\label{tab_cache_var_block_mem}
\end{table}
In the DRAM configurations, the variable block cache directly
benefits from the low MTBI. When comparing the values between SDRAM
and DDR, we can see that the bandwidth affects the memory access
time in a way that is approximately linear. The high latency of
these memories is completely hidden. The configuration with 16 or
more blocks and dynamic RAMs outperforms the direct-mapped cache of
the same size. As expected, a memory with low latency (the SRAM in
this example) depends on the MBIB values. The variable block cache
is slower than the direct-mapped cache in the 1KB configuration
because of the higher MBIB (0.7 compared to 0.3-0.6), and performs
very similarly at a cache size of 2KB.
In Table~\ref{tab_cache_compared}, the different cache solutions
with a size of 2KB are summarized. The detail results of all caches
can be found in Appendix~\ref{appx:bench}.
%
\begin{table}
\centering
\begin{tabular}{lcc}
\toprule
Cache type & MBIB & MTIB \\
\midrule
Single method & 2.32 & 0.021 \\
Two blocks & 1.21 & 0.013 \\
Variable block (16) & 0.37 & 0.004 \\
Variable block (32) & 0.24 & 0.003 \\
Direct-mapped & 0.25 & 0.015 \\
\bottomrule
\end{tabular}
\caption{Caches compared}
\label{tab_cache_compared}
\end{table}
%
All full method caches with two or more blocks have a lower MTIB
than a conventional cache solution. This becomes more significant
with increasing latency in main memories. The MBIB value is only
quite high for one or two methods in the cache. However, the most
surprising result is that the variable block cache with 32 blocks
outperforms a direct-mapped cache of the same size at both values.
We can see that predictability is indirectly related to performance
-- a trend we had anticipated. The most predictable solution with a
single method cache performs very poorly compared to a conventional
direct-mapped cache. If we accept a slightly more complex WCET
analysis (taking a small part of the call tree into account), we can
use the two-block cache that is about two times better.
With the variable block cache, it could be argued that the WCET
analysis becomes too complex, but it is nevertheless simpler than
that with the direct-mapped cache. However, every hit in the
two-block cache will also be a hit in a variable block cache (of the
same size). A tradeoff might be to analyze the program by assuming a
two-block cache but using a version of the variable block cache. The
additional performance gain can than be used by non- or soft
real-time parts of an application.
\subsection{Summary}
In this section, we have extended the single cache performance
measurement \emph{miss rate} to a two value set, memory read and
transaction rate, in order to perform a more detailed evaluation of
different cache architectures. From the properties of the Java
language -- usually small methods and relative branches -- we
derived the novel idea of a \emph{method cache}, i.e.\ a cache
organization in which whole methods are loaded into the cache on
method invocation and the return from a method. This cache
organization is time-predictable, as all cache misses are lumped
together in these two instructions. Using only one block for a
single method introduces considerable overheads in comparison with a
conventional cache, but is very simple to analyze. We extended this
cache to hold more methods, with one block per method and several
smaller blocks per method.
Comparing these organizations quantitatively with a benchmark
derived from a real-time application, we have seen that the
variable block cache performs similarly to (and in one configuration
even better than) a direct-mapped cache, in respect of the bytes
that have to be filled on a cache miss. In all configurations and
sizes of the variable block cache, the number of memory
transactions, which relates to memory latency, is lower than in a
traditional cache.
Only filling the cache on method invocation and return simplifies
WCET analysis and removes another source of uncertainty, as there is
no competition for the main memory access between instruction cache
and data cache.
%\subsubsection{Future Work}
%\begin{itemize}
% \item Variation of direct-mapped with link addresses
% Varies between version of paper and now
% \item cache graphs: cycles per loop
% \item more benchmarks
%\end{itemize}
%\bibliographystyle{plain}
%\bibliography{../bib/mybib}
%
%\begin{table*}
% \centering
% \begin{tabular}{lcccccc}
% \toprule
% & & & & \multicolumn{3}{c}{Memory access time} \\
% \cmidrule{5-7}
% Type & Size & MBIB & MTIB & SRAM & SDRAM & DDR SDRAM \\
% \midrule
%
% Prefetch buffer & 8 B & 1.37 & 0.342 & 1.02 & 2.05 & 1.71 \\
% Single method cache & 1 KB & 2.32 & 0.021 & 1.18 & 0.69 & 0.39 \\
% Two block cache & 2 KB & 1.21 & 0.013 & 0.62 & 0.37 & 0.21 \\
% Four block cache & 4 KB & 0.90 & 0.010 & 0.46 & 0.27 & 0.16 \\
% Direct mapped 8 bytes & 1 KB & 0.28 & 0.035 & 0.18 & 0.25 & 0.19 \\
% Direct mapped 16 bytes & 1 KB & 0.38 & 0.024 & 0.22 & 0.22 & 0.16 \\
% Direct mapped 32 bytes & 1 KB & 0.58 & 0.018 & 0.31 & 0.24 & 0.15 \\
% Direct mapped 8 bytes & 2 KB & 0.17 & 0.022 & 0.11 & 0.15 & 0.12 \\
% Direct mapped 16 bytes & 2 KB & 0.25 & 0.015 & 0.14 & 0.14 & 0.10 \\
% Direct mapped 32 bytes & 2 KB & 0.41 & 0.013 & 0.22 & 0.17 & 0.11 \\
% Direct mapped 8 bytes & 4 KB & 0.00 & 0.001 & 0.00 & 0.00 & 0.00 \\
% Direct mapped 16 bytes & 4 KB & 0.01 & 0.000 & 0.00 & 0.00 & 0.00 \\
% Direct mapped 32 bytes & 4 KB & 0.01 & 0.000 & 0.00 & 0.00 & 0.00 \\
% Variable block cache 8 blocks & 1 KB & 0.80 & 0.009 & 0.41 & 0.24 & 0.14 \\
% Variable block cache 16 blocks & 1 KB & 0.71 & 0.008 & 0.36 & 0.22 & 0.12 \\
% Variable block cache 32 blocks & 1 KB & 0.70 & 0.008 & 0.36 & 0.21 & 0.12 \\
% Variable block cache 64 blocks & 1 KB & 0.70 & 0.008 & 0.36 & 0.21 & 0.12 \\
% Variable block cache 8 blocks & 2 KB & 0.73 & 0.008 & 0.37 & 0.22 & 0.13 \\
% Variable block cache 16 blocks & 2 KB & 0.37 & 0.004 & 0.19 & 0.11 & 0.06 \\
% Variable block cache 32 blocks & 2 KB & 0.24 & 0.003 & 0.12 & 0.08 & 0.04 \\
% Variable block cache 64 blocks & 2 KB & 0.12 & 0.001 & 0.06 & 0.04 & 0.02 \\
% Variable block cache 8 blocks & 4 KB & 0.73 & 0.008 & 0.37 & 0.22 & 0.13 \\
% Variable block cache 16 blocks & 4 KB & 0.25 & 0.003 & 0.13 & 0.08 & 0.05 \\
% Variable block cache 32 blocks & 4 KB & 0.01 & 0.000 & 0.00 & 0.00 & 0.00 \\
% Variable block cache 64 blocks & 4 KB & 0.00 & 0.000 & 0.00 & 0.00 & 0.00 \\
%
% \bottomrule
%
% \end{tabular}
% \caption[Cache performance compared]{Cache performance in MBIB and MTIB of all variations of
% the method cache and a conventional direct mapped cache. Average
% memory access time per instruction byte for three different main
% memory technologies. All numbers are in clock cycles.
% }
% \label{tab_cache_all}
%\end{table*}
%
%
%
%\end{document}
|
 |