LOGIN   :::   RECOVER PASS   :::   GET ACCOUNT    
Browse
  • Projects
  • Code (CVS)
  • Forums
  • News
  • Articles
  • Polls
  •  
    OpenCores
  • FAQ
  • CVS HowTo
  • Mission
  • Media
  • Tools
  • Advertise
  • Mirrors
  • Logos
  • Contact us
  • Job Opportunity
  •  
    Tools
  • Search
      
  • Download Cores (CVSGet)
  •  
    More
  • Wishbone
  • Perlilog
  • EDA tools
  • OpenTech CD
  •  
    Navigation: All forums > Cvs-checkins > Message List > Message Post

    Message

    Reply | Reply all
    Date Prev | Date Next | Thread Prev | Thread Next Date Index | Thread Index

    From: cvs at opencores.org<cvs@o...>
    Date: Sat Aug 25 19:59:22 CEST 2007
    Subject: [cvs-checkins] MODIFIED: jop ...
    Top
    Date: 00/07/08 25:19:59

    Added: jop/doc/book/cache cache.tex
    Log:
    Handbook update


    Revision Changes Path
    1.1 jop/doc/book/cache/cache.tex

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/cache/cache.tex?rev=1.1&content-type=text/x-cvsweb-markup

    Index: cache.tex
    ===================================================================
    \label{sec:cache}


    Worst-case execution time (WCET) analysis \cite{84850} of real-time
    programs is essential for any schedulability analysis. To provide a
    low WCET value, a good processor model is necessary. However, the
    architectural advancement in modern processor designs is dominated
    by the rule: '\emph{Make the common case fast}`. This is the
    opposite of '\emph{Reduce the worst case}` and complicates WCET
    analysis.

    Cache memory for the instructions and data is a classic example of
    this paradigm. Avoiding or ignoring this feature in real-time
    systems, due to its unpredictable behavior, results in a very
    pessimistic WCET value. Plenty of effort has gone into research into
    integrating the instruction cache in the timing analysis of tasks
    \cite{Arnold1994, Healy1995, 225068} and the influence of the cache
    on task preemption \cite{279589, RTAS96*204}. The influence of
    different cache architectures on WCET analysis is described in
    \cite{Heckmann:IEEE2003}.

    We will tackle this problem from the architectural side -- an
    instruction cache organization in which simpler and more accurate
    WCET analysis is more important than average case performance.

    In this section, we will propose a method cache with a novel
    replacement policy. In Java bytecode only relative branches exist,
    and a method is therefore only left when a return instruction has
    been executed\footnote{An uncaught exception also results in a
    method exit.}. It has been observed that methods are typically short
    (see Section~\ref{sec:bench:jvm:methods}) in Java applications.
    These properties are utilized by a cache architecture that stores
    complete methods. A complete method is loaded into the cache on both
    invocation and return. This cache fill strategy lumps all cache
    misses together and is very simple to analyze.

    \subsection{Cache Performance}

    In real-time systems we prefer time-predictable architectures over
    those with a high average performance. However, performance is still
    important. In this section, we will give a short overview of the
    formulas from \cite{Hennessy02} that are used to calculate the
    cache's influence on execution time. We will extend the single
    measurement \emph{miss rate} to a two value set, memory read and
    transaction rate, that is architecture independent and better
    reflects the two properties (bandwidth and latency) of the main
    memory. To evaluate cache performance, \(MEM_{clk}\) memory stall
    cycles are added to the CPU execution time (\(t_{exe}\)) equation:
    %
    \begin{align}
    \nonumber
    t_{exe} & = (CPU_{clk}+MEM_{clk}) \times t_{clk} \\
    \nonumber
    MEM_{clk}& = Misses \times MP_{clk}
    \end{align}
    %
    The miss penalty \(MP_{clk}\) is the cost per miss, measured in
    clock cycles. When the instruction count $IC$ is given as the number
    of instructions executed, $CPI$ the average clock cycles per
    instruction and the number of misses per instruction, we obtain the
    following result:
    %
    \begin{align}
    \nonumber
    CPU_{clk} & = IC \times CPI_{exe} \\
    \nonumber
    MEM_{clk} & = IC \times \frac{Misses}{Instruction} \times MP_{clk} \\
    \nonumber
    t_{exe} & = IC \times (CPI_{exe}+\frac{Misses}{Instruction} \times
    MP_{clk}) \times t_{clk}
    \end{align}
    %
    As this section is only concerned with the instruction cache, we
    will split the memory stall cycles into misses caused by the
    instruction fetch and misses caused by data access.
    %
    \[CPI = CPI_{exe} + CPI_{IM} + CPI_{DM}\]
    %
    $CPI_{exe}$ is the average number of clock cycles per instruction,
    given an ideal memory system without any stalls. $CPI_{IM}$ are the
    additional clock cycles caused by instruction cache misses and
    $CPI_{DM}$ the data miss portion of the CPI. This split between
    instruction and data portions of the CPI better reflects the split
    of the cache between instruction and data cache found in actual
    processors. The misses per instruction are often given as misses per
    1000 instructions. However, there are several drawbacks to using a
    single number:
    %
    \begin{description} \item[Architecture dependent:] The average number of memory accesses per instruction differs greatly between a RISC processor and the Java Virtual Machine (JVM). A typical RISC processor needs one memory word (4 bytes) per instruction word, and about 40\% of the instructions \cite{Hennessy02} are \emph{load} or \emph{store} instructions. Using the example of a 32-bit RISC processor, this results in 5.6 bytes memory access per instruction. The average length of a JVM bytecode instruction is 1.7 bytes and about 18\% of the instructions access the memory for data load and store. \item[Block size dependent:] Misses per instruction depends subtly on the block size. On a single cache miss, a whole block of the cache is filled. Therefore, the probability that a future instruction request is a hit is higher with a larger block size. However, a larger block size results in a higher miss penalty as more memory is transferred. \end{description} % Main memory is usually composed of DRAMs. Access time to this memory is measured in terms of latency (the time taken to access the first word of a larger block) and bandwidth (the number of bytes read or written in a single request per time unit). These two values, along with the block size of a cache, are used to calculate the miss penalty: % \[MP_{clk} = Latency + \frac{Block\;size}{Bandwidth}\] % To better evaluate different cache organizations and different instruction sets (RISC versus JVM), we will introduce two performance measurements -- memory bytes read per instruction byte and memory transactions per instruction byte: % \begin{align} \nonumber MBIB & = \frac{Memory\;bytes\;read}{Instruction\;bytes}\\ \nonumber MTIB & = \frac{Memory\;transactions}{Instruction\;bytes} \end{align} % These two measures are closely related to memory bandwidth and latency. With these two values and the properties of the main memory, we can calculate the average memory cycles per instruction byte \(MCIB\) and \(CPI_{IM}\), i.e.\ the values we are concerned in this section. % \begin{align} \nonumber MCIB & = (\frac{MBIB}{Bandwith}+MTIB \times Latency)\\ \nonumber CPI_{IM} & = MCIB \times {Instruction\;length} \end{align} % The misses per instruction can be converted to MBIB and MTIB when the following parameters are known: the average instruction length of the architecture, the block size of the cache and the miss penalty in latency and bandwidth. We will examine this further in the following example: We use the following architecture to illustrate the conversion: a RISC architecture with a 4 bytes instruction length, an 8KB instruction cache with 64-byte blocks and a miss rate of 8.16 per 1000 instructions \cite{Hennessy02}. The miss penalty is 100 clock cycles. The memory system is assumed to deliver one word (4 bytes) per cycle. Firstly, we need to calculate the latency of the memory system. \begin{equation} \nonumber \begin{split} Latency & = MP_{clk} - \frac{Block size}{Bandwidth} \\ & = 100 - \frac{64}{4} = 84 \mbox{ clock cycles} \end{split} \end{equation} % With \(Miss\;rate=\frac{Cache\;miss}{Cache\;access}\), we obtain MBIB. % \begin{equation} \nonumber \begin{split} MBIB & = \frac{Memory\;bytes\;read}{Instruction\;bytes} \\ & = \frac{Cache\;miss \times Block\;size} {Cache\;access \times Instruction\;length} \\ & = Miss\;rate \times \frac{Block\;size}{Instruction\;length} \\ & = 8.16 \times 10^{-3} \times \frac{65}{4}\\ & = 0.131 \end{split} \end{equation} % MTIB is calculated in a similar way: % \begin{equation} \nonumber \begin{split} MTIB & = \frac{Memory\;transactions}{Instruction\;bytes}\\ & = \frac{Cache\;miss} {Cache\;access \times Instruction\;length} \\ & = \frac{Miss\;rate}{Instruction\;length}\\ & = \frac{8.16 \times 10^{-3}}{4}\\ & = 2.04 \times 10^{-3} \end{split} \end{equation} % For a quick check, we can calculate \(CPI_{IM}\): % \begin{equation} \nonumber \begin{split} MCIB & = \frac{MBIB}{Bandwith}+MTIB \times Latency\\ & = \frac{0.131}{4}+2.04 \times 10^{-3} \times 84\\ & = 0.204\\ CPI_{IM} & = MCIB \times {Instruction\;length} \\ & = 0.204 \times 4\\ & = 0.816 \end{split} \end{equation} % This is the same value as that which we get from using the miss rate with the miss penalty: % \begin{equation} \nonumber \begin{split} CPI_{IM} & = Miss\;rate \times Miss\;penalty\\ & = 8.16 \times 10^{-3} \times 100\\ & = 0.816 \end{split} \end{equation} % However, MBIB and MTIB are architecture independent and better reflect the latency and bandwidth of the main memory. \subsection{Proposed Cache Solution} In this section, we will develop a solution for a predictable cache. Typical Java programs consist of short methods. There are no branches out of the method and all branches inside are relative. In the proposed architecture, the full code of a method is loaded into the cache before execution. The cache is filled on invocations and returns. This means that all cache fills are lumped together with a known execution time. The full loaded method and relative addressing inside a method also result in a simpler cache. Tag memory and address translation are not necessary. However, we will first discuss an even simpler solution -- no caching at all. Without an instruction cache, prefetching is mandatory, especially with a variable length instruction set. The issues surrounding prefetching are discussed in the next section. \subsubsection{Instruction Prefetching} A simple prefetch queue, as found in older processors, can increase instruction throughput after a multi-cycle bytecode is executed. However, for a stream of single-cycle bytecodes, prefetching is useless and the frequent occurrence of branches, method invocations, and method returns (see Section~\ref{sec:bench:jvm}) reduces the performance gain. Using a prefetch queue also results in execution time dependencies over a stream of instructions, which complicates timing analysis. For a variable length instruction set, prefetching is also not a straightforward option. The prefetching unit needs to guarantee the availability of a complete instruction for the fetch unit. As the actual length of the instruction is not known at this stage, the prefetch unit must be a minimum of $maximum\;length - 1$ bytes ahead of the requested instruction. This can lead to unnecessary memory transfers. The return instruction is a typical example of this. It is 1 byte long and the additional prefetched instruction bytes are never used. A memory interface with a bus width greater than one byte adds an artificial boundary to the instruction stream. For the purpose of this example, we are assuming a 4 byte memory interface. In this case we need an 8 byte prefetch buffer. On a branch to an address $address\;mod\;4 > 4-maximum\;instruction\;length$, two words need to be loaded from main memory before the processor can continue. A memory technology, such as synchronous DRAM, has a large latency for the first accessed word and then a high bandwidth for the following words. Prefetching that only loads small quantities (one or two words) from the memory is therefore impracticable with these memory technologies. \subsubsection{Single Method Cache} A single method cache, although less efficient than a conventional instruction cache, can be incorporated very easily into the WCET analysis. The time needed for the memory transfer must be added to the invoke and return instructions. % Wiederholung von Einlietung der proposed cache solution % könnte nach vorne gelegt werden, % dann aber ein, zwei Sätze über den single method cache The method cache also simplifies the hardware of the cache, as it means that no tag memory or address translation is necessary. Other parts of the processor are also smaller. The program counter, the associated adders and multiplexer are shorter than in a standard cache solution. For example, for a 1KB cache, the size of these units is only 10 bits, instead of 32 bits. The main disadvantage of this single method cache is the high overhead when a complete method is loaded into the cache and only a small fraction of the code is executed. This issue is similar to that encountered with unused data in a cache line. However, in extreme cases, this overhead can be very high. The second problem can be seen in following example: % \begin{samepage} \begin{lstlisting} foo() { a(); b(); } \end{lstlisting} \end{samepage} % This code sequence results in the following cache loads: % \begin{enumerate} \item method \code{foo} is loaded on invocation of \code{foo()} \item method \code{a} is loaded on invocation of \code{a()} \item method \code{foo} is loaded on return from \code{a()} \item method \code{b} is loaded on invocation of \code{b()} \item method \code{foo} is loaded on return from \code{b()} \end{enumerate} % The main drawback of the single method cache is the multiple cache fill of \code{foo()} on return from methods \code{a()} and \code{b()}. In a conventional cache design, if these three methods fit in the cache memory at the same time and there is no placement conflict, each method is only loaded once. This issue can be overcome by caching more than one method. The simplest solution is a two-block cache. \subsubsection{Two-Block Cache} The two-block cache can hold up to two methods in the cache. This results in having to decide which block is replaced on a cache miss. With only two blocks, Least-Recently Used (LRU) is trivial to implement. The code sequence now results in the cache loads and hits as shown in Table~\ref{tab_cache_two_block_example}. % \begin{table}[htbp] \centering \begin{tabular}{llll} \toprule Instruction & Block 1 & Block 2 & Cache \\ \midrule foo() & foo & -- & load \\ a() & foo & a & load \\ return & foo & a & hit \\ b() & foo & b & load \\ return & foo & b & hit \\ \bottomrule \end{tabular} \caption{Cache load and hit example with the two-block cache} \label{tab_cache_two_block_example} \end{table} % With the two-block cache, we have to double the cache memory or use both blocks for a single large method. The WCET analysis is slightly more complex than with a single block. A short history of the invocation sequence has to be used to find the cache fills and hits. However, a cache that can only hold two methods is still very restrictive. The next code sequence shows the conflict. \tablename~\ref{tab_cache_two_block_conflict} shows the resulting cache loads. % \begin{samepage} \begin{lstlisting} foo() { a(); } a() { b(); } \end{lstlisting} % \begin{table}[htbp] \centering \begin{tabular}{llll} \toprule Instruction & Block 1 & Block 2 & Cache \\ \midrule foo() & foo & -- & load \\ a() & foo & a & load \\ b() & b & a & load \\ return & b & a & hit \\ return & foo & a & load \\ \bottomrule \end{tabular} \caption{Cache conflict example with the two-block cache} \label{tab_cache_two_block_conflict} \end{table} \end{samepage} % A memory (similar to the tag memory) with one word per block is used to store a reference to the cached method. However, this memory can be slower than the tag memory as it is only accessed on invocation or return, rather than on every cache access. \subsubsection{More Blocks} We can improve the hit rate by adding more blocks to the cache. If only one block per method is used, the cache size increases with the number of blocks. With more than two blocks, LRU replacement policy means that another word is needed for every block containing a use counter that is updated on every invoke and return. During replacement, this list is searched for the LRU block. Hit detection involves a search through the list of the method references of the blocks. If this search is done in microcode, it imposes a limit on the maximum number of blocks. \subsubsection{Variable Block Cache} \label{sec:cache:var} Several cache blocks, all of the size as the largest method, are a waste of cache memory. Using smaller block sizes and allowing a method to span over several blocks, the blocks become very similar to cache lines. The main difference from a conventional cache is that the blocks for a method are all loaded at once and need to be consecutive. Choosing the block size is now a major design decision. Smaller block sizes allow better memory usage, but the search time for a hit also increases. With varying block numbers per method, an LRU replacement becomes impractical. When the method found to be LRU is smaller than the loaded method, this new method invalidates two cached methods. For the replacement, we will use a pointer $next$ that indicates the start of the blocks to be replaced on a cache miss. Two practical replace policies are: % \begin{description} \item [Next block:]At the very first beginning, $next$ points to the first block. When a method of length $l$ is loaded into the block $n$, $next$ is updated to $(n+l)\;mod\;block\;count$. \item [Stack oriented:]$next$ is updated in the same way as before on a method load. It is also updated on a method return -- independent of a resulting hit or miss -- to point to the first block of the leaving method. \end{description} % We will show the operation of these different replacement policies in an example with three methods: a(), b() and c() of block sizes 2, 2 and 1. The cache consists of 4 blocks and is therefore too small to hold all the methods during the execution of the code fragment shown in Listing~\ref{lst:cache:replace}. % \begin{samepage} \begin{lstlisting}[float,caption={Code fragment for the replacement example}, label=lst:cache:replace] a() { for (;;) { b(); c(); } ... } \end{lstlisting} \end{samepage} % Tables \ref{tab_cache_replace_next} and \ref{tab_cache_replace_stack} show the cache content during program execution for both replacement policies. The content of the cache blocks is shown after the execution of the invoke or return instruction. An uppercase letter indicates that this block has been newly loaded. A right arrow depicts the block to be replaced on a cache miss (the \emph{next} pointer). The last row shows the number of blocks that are filled during the execution of the program. % too long for book format % %\begin{table*} % \centering %\begin{tt} % \begin{tabular}{lrrrrrrrrrrrrr} % \toprule % % &a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret &c() &ret \\ % \midrule % \rm{Block 1}&A &$\to$a &$\to$a &C &c &B &b &b &$\to$- &B &b &b &A \\ % \rm{Block 2}&A &a &a &$\to$- &A &$\to$a &$\to$a &C &c &B &b &b &$\to$- \\ % \rm{Block 3}&$\to$- &B &b &b &A &a &a &$\to$- &A &$\to$a &$\to$a &C &c \\ % \rm{Block 4}&- &B &b &b &$\to$- &B &b &b &A &a &a &$\to$- &A \\ % \midrule % \rm{Fill} &2 &4 & &5 &7 &9 & &11 &13 &15 & &16 &18 \\ % \bottomrule % % \end{tabular} %\end{tt} % \caption{Next block replacement policy} % \label{tab_cache_replace_next} %\end{table*} % %\begin{table*} % \centering %\begin{tt} % \begin{tabular}{lrrrrrrrrrrrrr} % \toprule % % &a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret &c() &ret \\ % \midrule % \rm{Block 1}&A &$\to$a &a &a &a &$\to$a &a &a &a &$\to$a &a &a &a \\ % \rm{Block 2}&A &a &a &a &a &a &a &a &a &a &a &a &a \\ % \rm{Block 3}&$\to$- &B &$\to$b &C &$\to$c &B &$\to$b &C &$\to$c &B &$\to$b &C &$\to$c \\ % \rm{Block 4}&- &B &b &$\to$- &- &B &b &$\to$- &- &B &b &$\to$- &- \\ % \midrule % \rm{Fill} &2 &4 & &5 & &7 & &8 & &10 & &11 & \\ % \bottomrule % % \end{tabular} %\end{tt} % \caption{Stack oriented replacement policy} % \label{tab_cache_replace_stack} %\end{table*} \begin{table} \centering \begin{tt} \begin{tabular}{lrrrrrrrrrrr} \toprule &a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret \\ \midrule \rm{Block 1}&A &$\to$a &$\to$a &C &c &B &b &b &$\to$- &B &b \\ \rm{Block 2}&A &a &a &$\to$- &A &$\to$a &$\to$a &C &c &B &b \\ \rm{Block 3}&$\to$- &B &b &b &A &a &a &$\to$- &A &$\to$a &$\to$a \\ \rm{Block 4}&- &B &b &b &$\to$- &B &b &b &A &a &a \\ \midrule \rm{Fill} &2 &4 & &5 &7 &9 & &11 &13 &15 & \\ \bottomrule \end{tabular} \end{tt} \caption{Next block replacement policy} \label{tab_cache_replace_next} \end{table} \begin{table} \centering \begin{tt} \begin{tabular}{lrrrrrrrrrrr} \toprule &a() &b() &ret &c() &ret &b() &ret &c() &ret &b() &ret \\ \midrule \rm{Block 1}&A &$\to$a &a &a &a &$\to$a &a &a &a &$\to$a &a \\ \rm{Block 2}&A &a &a &a &a &a &a &a &a &a &a \\ \rm{Block 3}&$\to$- &B &$\to$b &C &$\to$c &B &$\to$b &C &$\to$c &B &$\to$b \\ \rm{Block 4}&- &B &b &$\to$- &- &B &b &$\to$- &- &B &b \\ \midrule \rm{Fill} &2 &4 & &5 & &7 & &8 & &10 & \\ \bottomrule \end{tabular} \end{tt} \caption{Stack oriented replacement policy} \label{tab_cache_replace_stack} \end{table} In this example, the stack oriented approach needs fewer fills, as only methods b() and c() are exchanged and method a() stays in the cache. However, if, for example, method b() is the size of one block, all methods can be held in the cache using the the \emph{next block} policy, but b() and c() would be still exchanged using the \emph{stack} policy. Therefore, the first approach is used in the proposed cache. \subsection{WCET Analysis} \label{sec:cache:wcet} The proposed instruction cache is designed to simplify WCET analysis. Due to the fact that all cache misses are only included in two instructions (\emph{invoke} and \emph{return}), the instruction cache can be ignored on all other instructions. The time needed to load a complete method is calculated using the memory properties (latency and bandwidth) and the length of the method. On an invoke, the length of the invoked method is used, and on a return, the method length of the caller is used to calculate the load time. With a single method cache this calculation can be further simplified. For every invoke there is a corresponding return. That means that the time needed for the cache load on return can be included in the time for the invoke instruction. This is simpler because both methods, the caller and the callee, are known at the occurrence of the invoke instruction. The information about which method was the caller need not be stored for the return instruction to be analyzed. With more than one method in the cache, a cache hit detection has to be performed as part of the WCET analysis. If there are only two blocks, this is trivial, as (i) a hit on invoke is only possible if the method is the same as the last invoked (e.g.\ a single method in a loop) and (ii) a hit on return is only possible when the method is a leaf in the call tree. In the latter case, it is always a hit. When the cache contains more blocks (i.e.\ more than two methods can be cached), a part of the call tree has to be taken into account for hit detection. The variable block cache further complicates the analysis, as the method length also determines the cache content. However, this analysis is still simpler than a cache modeling of a direct-mapped instruction cache, as cache block replacement depends on the call tree instead of instruction addresses. In traditional caches, data access and instruction cache fill requests can compete for the main memory bus. For example, a load or store at the end of the processor pipeline competes with an instruction fetch that results in a cache miss. One of the two instructions is stalled for additional cycles by the other instruction. With a data cache, this situation can be even worse. The worst-case scenario for the memory stall time for an instruction fetch or a data load is two miss penalties when both cache reads are a miss. This unpredictable behavior leads to very pessimistic WCET bounds. A \emph{method cache}, with cache fills only on invoke and return, does not interfere with data access to the main memory. Data in the main memory is accessed with \emph{getfield} and \emph{putfield}, instructions that never overlap with \emph{invoke} and \emph{return}. This property removes another uncertainty found in traditional cache designs. \subsection{Caches Compared} In this section, we will compare the different cache architectures in a quantitative way. Although our primary concern is predictability, performance remains important. We will therefore first present the results from a conventional direct-mapped instruction cache. These measurements will then provide a baseline for the evaluation of the proposed architecture. Cache performance varies with different application domains. As the proposed system is intended for real-time applications, the benchmark for these tests should reflect this fact. However, there are no standard benchmarks available for embedded real-time systems. A real-time application was therefore adapted to create this benchmark. The application is from one node of a distributed motor control system \cite{jop:wises03} (see also Section~\ref{sec:app:kfl}). A simulation of the environment (sensors and actors) and the communication system (commands from the master station) forms part of the benchmark for simulating the real-world workload. The data for all measurements was captured using a simulation of JOP and running the application for 500,000 clock cycles. During this time, the major loop of the application was executed several hundred times, effectively rendering any misses during the initialization code irrelevant to the measurements. \subsubsection{Direct-Mapped Cache} \tablename~\ref{tab_cache_direct} gives the memory bytes and memory transactions per instruction byte for a standard direct-mapped cache. As we can see from the values for a cache size of 4KB, the kernel of the application is small enough to fit completely into the 4KB cache. The cache performs better (i.e.\ fewer bytes are transferred) with smaller block sizes. With smaller block sizes, the chance of unused data being read is reduced and the larger number of blocks reduces conflict misses. However, reducing the block size also increases memory transactions (MTIB), which directly relates to memory latency. \begin{table} \centering \begin{tabular}{cd{2.0}cc} \toprule Cache size & \cc{Block size} & MBIB & MTIB \\ \midrule 1 KB & 8 & 0.28 & 0.035 \\ 1 KB & 16 & 0.38 & 0.024 \\ 1 KB & 32 & 0.58 & 0.018 \\ 2 KB & 8 & 0.17 & 0.022 \\ 2 KB & 16 & 0.25 & 0.015 \\ 2 KB & 32 & 0.41 & 0.013 \\ 4 KB & 8 & 0.00 & 0.001 \\ 4 KB & 16 & 0.01 & 0.000 \\ 4 KB & 32 & 0.01 & 0.000 \\ \bottomrule \end{tabular} \caption{Direct-mapped cache} \label{tab_cache_direct} \end{table} Which configuration performs best depends on the relationship between memory bandwidth and memory latency. Examples of average memory access times in cycles per instruction byte for different memory technologies are provided in Table~\ref{tab_cache_direct_mem}. The third column shows the cache performance for a Static RAM (SRAM) that is very common in embedded systems. A latency of 1 clock cycle and an access time of 2 clock cycles per 32-bit word are assumed. For the synchronous DRAM (SDRAM) in the forth column, a latency of 5 cycles (3 cycle for the row address and 2 cycle CAS latency) is assumed. The memory delivers one word (4 bytes) per cycle. The Double Data Rate (DDR) SDRAM in the last column has an enhanced latency of 4.5 cycles and transfers data on both the rising and falling edge of the clock signal. % % simulate 32-bits (DDR) SDRAM % % SDRAM: 5 cycle latency: 3 cycle row address and 2 cycle CAS latency % 4 Bytes / cycle (0.25 / Byte) % % DDR: 4.5 cycle latency: 2 cycle row address and 2.5 cycle CAS latency % 4 Bytes / 0.5 cycle (0.125 / Byte) % % SRAM 15 ns at 100 MHz: 1 cycle latency, 2 cycles / word % \begin{table} \centering \begin{tabular}{cd{2.0}ccc} \toprule Cache size & \cc{Block size} & SRAM & SDRAM & DDR \\ \midrule 1 KB & 8 & \textbf{0.18} & 0.25 & 0.19 \\ 1 KB & 16 & 0.22 & \textbf{0.22} & 0.16 \\ 1 KB & 32 & 0.31 & 0.24 & \textbf{0.15} \\ 2 KB & 8 & \textbf{0.11} & 0.15 & 0.12 \\ 2 KB & 16 & 0.14 & \textbf{0.14} & \textbf{0.10} \\ 2 KB & 32 & 0.22 & 0.17 & 0.11 \\ \bottomrule \end{tabular} \caption{Direct-mapped cache, average memory access time} \label{tab_cache_direct_mem} \end{table} The data in bold give the best block size for different memory technologies. As expected, memories with a higher latency and bandwidth perform better with larger block sizes. For small block sizes, the latency clearly dominates the access time. Although the SRAM has half the bandwidth of the SDRAM and a quarter of the DDR, with a block size of 8 bytes, it is faster than the DRAM memories. In most cases a block size of 16 bytes is the fastest solution and we will therefore use this configuration for comparison with the following cache solutions. \subsubsection{Fixed Block Cache} Cache performance for single method per block architectures is shown in Table~\ref{tab_cache_block}. The measurements for a simple 8 byte prefetch queue are also given, for reference. With prefetching, we would expect to see an MBIB of about 1. The 37\% overhead results from the fact that the prefetch queue fetches 4 bytes a time and has to buffer a minimum of 3 bytes for the instruction fetch stage. On a branch or return, the queue is flushed and these bytes are lost. A single block that has to be filled on every invoke and return requires considerable overheads. More than twice the amount of data is read from the main memory than is consumed by the processor. However, the memory transaction count is 16 times lower than with simple prefetching, which can compensate for the large MBIB for main memories with high latency. The solution with two blocks for two methods performs almost twice as well as the simple one method cache. This is due to the fact that, for all leaves in the call tree, the caller method can be found on return. If the block count is doubled again, the number of misses is reduced by a further 25\%, but the cache size also doubles. For this measurement, an LRU replacement policy applies for the two and four block caches. \begin{table} \centering \begin{tabular}{lcccc} \toprule Type & Cache size & MBIB & MTIB \\ \midrule Prefetch & 8 B & 1.37 & 0.342 \\ Single method & 1 KB & 2.32 & 0.021 \\ Two blocks & 2 KB & 1.21 & 0.013 \\ Four blocks & 4 KB & 0.90 & 0.010 \\ \bottomrule \end{tabular} \caption{Fixed block cache} \label{tab_cache_block} \end{table} The same memory parameters as in the previous section are also used in Table~\ref{tab_cache_block_mem}. With the high latency of the DRAMs, even the simple one block cache is a faster (and more accurately predictable) solution than a prefetch queue. As MBIB and MTBI show the same trend as a function of the number of blocks, this is reflected in the access time in all three memory examples. \begin{table} \centering \begin{tabular}{lcccc} \toprule Type & Cache size & SRAM & SDRAM & DDR \\ \midrule Prefetch & 8 B & 1.02 & 2.05 & 1.71 \\ Single Method & 1 KB & 1.18 & 0.69 & 0.39 \\ Two blocks & 2 KB & 0.62 & 0.37 & 0.21 \\ Four blocks & 4 KB & 0.46 & 0.27 & 0.16 \\ \bottomrule \end{tabular} \caption{Fixed block cache, average memory access time} \label{tab_cache_block_mem} \end{table} \subsubsection{Variable Block Cache} \tablename~\ref{tab_cache_var_block} shows the cache performance of the proposed solution, i.e.\ of a method cache with several blocks per method, for different cache sizes and number of blocks. For this measurement, a \emph{next block} replacement policy applies. \begin{table} \centering \begin{tabular}{cd{2.0}cc} \toprule Cache size & \cc{Block count} & MBIB & MTIB \\ \midrule 1 KB & 8 & 0.80 & 0.009 \\ 1 KB & 16 & 0.71 & 0.008 \\ 1 KB & 32 & 0.70 & 0.008 \\ 1 KB & 64 & 0.70 & 0.008 \\ 2 KB & 8 & 0.73 & 0.008 \\ 2 KB & 16 & 0.37 & 0.004 \\ 2 KB & 32 & 0.24 & 0.003 \\ 2 KB & 64 & 0.12 & 0.001 \\ 4 KB & 8 & 0.73 & 0.008 \\ 4 KB & 16 & 0.25 & 0.003 \\ 4 KB & 32 & 0.01 & 0.000 \\ 4 KB & 64 & 0.00 & 0.000 \\ \bottomrule \end{tabular} \caption{Variable block cache} \label{tab_cache_var_block} \end{table} In this scenario, as the MBIB is very high at a cache size of 1KB and almost independent of the block count, the cache capacity is seen to be clearly dominant. The most interesting cache size with this benchmark is 2KB. Here, we can see the influence of the number of blocks on both performance parameters. Both values benefit from more blocks. However, a higher block count requires more time or more hardware for the hit detection. With a cache size of 4KB and enough blocks, the kernel of the application completely fits into the variable block cache, as we have seen with a 4KB traditional cache. From the gap between 16 and 32 blocks (within the 4KB cache), we can say that the application consists of fewer than 32 different methods. It can be seen that even the smallest configuration with a cache size of 1KB and only 8 blocks outperforms fixed block caches with 2 or 4KB in both parameters (MBIB and MTIB). Compared with the fixed block solutions, MTIB is low in all configurations. This is due to the better hit rate, as indicated by the lower MBIB. In most configurations, MBIB is higher than for the direct-mapped cache. It is very interesting to note that, in all configurations (even the small 1KB cache), MTIB is lower than in all 1KB and 2KB configurations of the direct-mapped cache. This is a result of the complete method transfers when a miss occurs and is clearly an advantage for main memory systems with high latency. As in the previous examples, Table~\ref{tab_cache_var_block_mem} shows the average memory access time per instruction byte for three different main memories. \begin{table} \centering \begin{tabular}{cd{2.0}ccc} \toprule Cache size & \cc{Block count} & SRAM & SDRAM & DDR \\ \midrule 1 KB & 8 & 0.41 & 0.24 & 0.14 \\ 1 KB & 16 & 0.36 & 0.22 & 0.12 \\ 1 KB & 32 & 0.36 & 0.21 & 0.12 \\ 1 KB & 64 & 0.36 & 0.21 & 0.12 \\ 2 KB & 8 & 0.37 & 0.22 & 0.13 \\ 2 KB & 16 & 0.19 & 0.11 & 0.06 \\ 2 KB & 32 & 0.12 & 0.08 & 0.04 \\ 2 KB & 64 & 0.06 & 0.04 & 0.02 \\ \bottomrule \end{tabular} \caption{Variable block cache, average memory access time} \label{tab_cache_var_block_mem} \end{table} In the DRAM configurations, the variable block cache directly benefits from the low MTBI. When comparing the values between SDRAM and DDR, we can see that the bandwidth affects the memory access time in a way that is approximately linear. The high latency of these memories is completely hidden. The configuration with 16 or more blocks and dynamic RAMs outperforms the direct-mapped cache of the same size. As expected, a memory with low latency (the SRAM in this example) depends on the MBIB values. The variable block cache is slower than the direct-mapped cache in the 1KB configuration because of the higher MBIB (0.7 compared to 0.3-0.6), and performs very similarly at a cache size of 2KB. In Table~\ref{tab_cache_compared}, the different cache solutions with a size of 2KB are summarized. The detail results of all caches can be found in Appendix~\ref{appx:bench}. % \begin{table} \centering \begin{tabular}{lcc} \toprule Cache type & MBIB & MTIB \\ \midrule Single method & 2.32 & 0.021 \\ Two blocks & 1.21 & 0.013 \\ Variable block (16) & 0.37 & 0.004 \\ Variable block (32) & 0.24 & 0.003 \\ Direct-mapped & 0.25 & 0.015 \\ \bottomrule \end{tabular} \caption{Caches compared} \label{tab_cache_compared} \end{table} % All full method caches with two or more blocks have a lower MTIB than a conventional cache solution. This becomes more significant with increasing latency in main memories. The MBIB value is only quite high for one or two methods in the cache. However, the most surprising result is that the variable block cache with 32 blocks outperforms a direct-mapped cache of the same size at both values. We can see that predictability is indirectly related to performance -- a trend we had anticipated. The most predictable solution with a single method cache performs very poorly compared to a conventional direct-mapped cache. If we accept a slightly more complex WCET analysis (taking a small part of the call tree into account), we can use the two-block cache that is about two times better. With the variable block cache, it could be argued that the WCET analysis becomes too complex, but it is nevertheless simpler than that with the direct-mapped cache. However, every hit in the two-block cache will also be a hit in a variable block cache (of the same size). A tradeoff might be to analyze the program by assuming a two-block cache but using a version of the variable block cache. The additional performance gain can than be used by non- or soft real-time parts of an application. \subsection{Summary} In this section, we have extended the single cache performance measurement \emph{miss rate} to a two value set, memory read and transaction rate, in order to perform a more detailed evaluation of different cache architectures. From the properties of the Java language -- usually small methods and relative branches -- we derived the novel idea of a \emph{method cache}, i.e.\ a cache organization in which whole methods are loaded into the cache on method invocation and the return from a method. This cache organization is time-predictable, as all cache misses are lumped together in these two instructions. Using only one block for a single method introduces considerable overheads in comparison with a conventional cache, but is very simple to analyze. We extended this cache to hold more methods, with one block per method and several smaller blocks per method. Comparing these organizations quantitatively with a benchmark derived from a real-time application, we have seen that the variable block cache performs similarly to (and in one configuration even better than) a direct-mapped cache, in respect of the bytes that have to be filled on a cache miss. In all configurations and sizes of the variable block cache, the number of memory transactions, which relates to memory latency, is lower than in a traditional cache. Only filling the cache on method invocation and return simplifies WCET analysis and removes another source of uncertainty, as there is no competition for the main memory access between instruction cache and data cache. %\subsubsection{Future Work} %\begin{itemize} % \item Variation of direct-mapped with link addresses % Varies between version of paper and now % \item cache graphs: cycles per loop % \item more benchmarks %\end{itemize} %\bibliographystyle{plain} %\bibliography{../bib/mybib} % %\begin{table*} % \centering % \begin{tabular}{lcccccc} % \toprule % & & & & \multicolumn{3}{c}{Memory access time} \\ % \cmidrule{5-7} % Type & Size & MBIB & MTIB & SRAM & SDRAM & DDR SDRAM \\ % \midrule % % Prefetch buffer & 8 B & 1.37 & 0.342 & 1.02 & 2.05 & 1.71 \\ % Single method cache & 1 KB & 2.32 & 0.021 & 1.18 & 0.69 & 0.39 \\ % Two block cache & 2 KB & 1.21 & 0.013 & 0.62 & 0.37 & 0.21 \\ % Four block cache & 4 KB & 0.90 & 0.010 & 0.46 & 0.27 & 0.16 \\ % Direct mapped 8 bytes & 1 KB & 0.28 & 0.035 & 0.18 & 0.25 & 0.19 \\ % Direct mapped 16 bytes & 1 KB & 0.38 & 0.024 & 0.22 & 0.22 & 0.16 \\ % Direct mapped 32 bytes & 1 KB & 0.58 & 0.018 & 0.31 & 0.24 & 0.15 \\ % Direct mapped 8 bytes & 2 KB & 0.17 & 0.022 & 0.11 & 0.15 & 0.12 \\ % Direct mapped 16 bytes & 2 KB & 0.25 & 0.015 & 0.14 & 0.14 & 0.10 \\ % Direct mapped 32 bytes & 2 KB & 0.41 & 0.013 & 0.22 & 0.17 & 0.11 \\ % Direct mapped 8 bytes & 4 KB & 0.00 & 0.001 & 0.00 & 0.00 & 0.00 \\ % Direct mapped 16 bytes & 4 KB & 0.01 & 0.000 & 0.00 & 0.00 & 0.00 \\ % Direct mapped 32 bytes & 4 KB & 0.01 & 0.000 & 0.00 & 0.00 & 0.00 \\ % Variable block cache 8 blocks & 1 KB & 0.80 & 0.009 & 0.41 & 0.24 & 0.14 \\ % Variable block cache 16 blocks & 1 KB & 0.71 & 0.008 & 0.36 & 0.22 & 0.12 \\ % Variable block cache 32 blocks & 1 KB & 0.70 & 0.008 & 0.36 & 0.21 & 0.12 \\ % Variable block cache 64 blocks & 1 KB & 0.70 & 0.008 & 0.36 & 0.21 & 0.12 \\ % Variable block cache 8 blocks & 2 KB & 0.73 & 0.008 & 0.37 & 0.22 & 0.13 \\ % Variable block cache 16 blocks & 2 KB & 0.37 & 0.004 & 0.19 & 0.11 & 0.06 \\ % Variable block cache 32 blocks & 2 KB & 0.24 & 0.003 & 0.12 & 0.08 & 0.04 \\ % Variable block cache 64 blocks & 2 KB & 0.12 & 0.001 & 0.06 & 0.04 & 0.02 \\ % Variable block cache 8 blocks & 4 KB & 0.73 & 0.008 & 0.37 & 0.22 & 0.13 \\ % Variable block cache 16 blocks & 4 KB & 0.25 & 0.003 & 0.13 & 0.08 & 0.05 \\ % Variable block cache 32 blocks & 4 KB & 0.01 & 0.000 & 0.00 & 0.00 & 0.00 \\ % Variable block cache 64 blocks & 4 KB & 0.00 & 0.000 & 0.00 & 0.00 & 0.00 \\ % % \bottomrule % % \end{tabular} % \caption[Cache performance compared]{Cache performance in MBIB and MTIB of all variations of % the method cache and a conventional direct mapped cache. Average % memory access time per instruction byte for three different main % memory technologies. All numbers are in clock cycles. % } % \label{tab_cache_all} %\end{table*} % % % %\end{document}

     
    Copyright (c) 1999 OPENCORES.ORG. All rights reserved.