LOGIN   :::   RECOVER PASS   :::   GET ACCOUNT    
Browse
  • Projects
  • Code (CVS)
  • Forums
  • News
  • Articles
  • Polls
  •  
    OpenCores
  • FAQ
  • CVS HowTo
  • Mission
  • Media
  • Tools
  • Advertise
  • Mirrors
  • Logos
  • Contact us
  • Find Resources
  • Job Opportunity
  •  
    Tools
  • Search
      
  • Download Cores (CVSGet)
  •  
    More
  • Wishbone
  • Perlilog
  • EDA tools
  • OpenTech CD
  •  
    Navigation: All forums > Cvs-checkins > Message List > Message Post

    Message

    Reply | Reply all
    Date Prev | Date Next | Thread Prev | Thread Next Date Index | Thread Index

    From: cvs at opencores.org<cvs@o...>
    Date: Sat Aug 25 19:58:56 CEST 2007
    Subject: [cvs-checkins] MODIFIED: jop ...
    Top
    Date: 00/07/08 25:19:58

    Added: jop/doc/book/arch arch_bcfetch.pdf arch_decaddr.pdf
    arch_execute.pdf arch_fetch.pdf
    arch_indirection.pdf arch_jop_block.pdf
    arch_meth300.pdf arch_meth32.pdf arch_pipeline.pdf
    arch_intro.tex arch_jop.tex hwsw_codesign.tex
    jvm_bench.tex rt_predict.tex
    Log:
    Handbook update


    Revision Changes Path
    1.1 jop/doc/book/arch/arch_bcfetch.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_bcfetch.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_decaddr.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_decaddr.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_execute.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_execute.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_fetch.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_fetch.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_indirection.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_indirection.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_jop_block.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_jop_block.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_meth300.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_meth300.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_meth32.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_meth32.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_pipeline.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_pipeline.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/arch/arch_intro.tex

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_intro.tex?rev=1.1&content-type=text/x-cvsweb-markup

    Index: arch_intro.tex
    ===================================================================
    This chapter presents the architecture for JOP and the motivation
    behind the various different design decisions we faced. First, we
    benchmark the JVM, in order to extract execution frequencies for the
    different bytecodes. These values will then guide the processor
    design.

    Pipelined instruction processing calls for a high memory bandwidth.
    Caches are needed in order to avoid bottlenecks resulting from the
    main memory bandwidth. As seen in Chapter~\ref{chap:java}, there are
    two memory areas that are frequently accessed by the JVM: the stack
    and the method area. In this chapter, we will present
    time-predictable cache solutions for both areas.



    1.1 jop/doc/book/arch/arch_jop.tex

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_jop.tex?rev=1.1&content-type=text/x-cvsweb-markup

    Index: arch_jop.tex
    =================================================================== \section{Overview of JOP} This section gives an overview of JOP architecture. \figurename~\ref{fig:arch:jop:block} shows JOP's major function units. A typical configuration of JOP contains the processor core, a memory interface and a number of IO devices. The module extension provides the link between the processor core, and the memory and IO modules. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_jop_block} \caption{Block diagram of JOP} \label{fig:arch:jop:block} \end{figure} The processor core contains the four pipeline stages \emph{bytecode fetch}, \emph{microcode fetch}, \emph{decode} and \emph{execute}. The ports to the other modules are the address and data bus for the bytecode instructions, the two top elements of the stack (A and B), input to the top-of-stack (Data) and a number of control signals. There is no direct connection between the processor core and the external world. The memory interface provides a connection between the main memory and the processor core. It also contains the bytecode cache. The extension module controls data read and write. The \emph{busy} signal is used by the microcode instruction \code{wait}\footnote{The busy signal can also be used to stall the whole processor pipeline. This was the change made to JOP by Flavius Gruian \cite{jop:sac05}. However, in this synchronization mode, the concurrency between the memory access module and the main pipeline is lost.} to synchronize the processor core with the memory unit. The core reads bytecode instructions through dedicated buses (BC address and BC data) from the memory subsystem. The request for a method to be placed in the cache is performed through the extension module, but the cache hit detection and load is performed by the memory interface independently of the processor core (and therefore concurrently). The I/O interface contains peripheral devices, such as the system time and timer interrupt, a serial interface and application-specific devices. Read and write to and from this module are controlled by the extension module. All external devices\footnote{The external device can be as simple as a line driver for the serial interface that forms part of the interface module, or a complete bus interface, such as the ISA bus used to connect e.g.\ an Ethernet chip.} are connected to the I/O interface. The extension module performs three functions: (a) it contains hardware accelerators (such as the multiplier unit in this example), (b) the control for the memory and the I/O module, and (c) the multiplexer for the read data that is loaded in the top-of-stack register. The write data from the top-of-stack (A) is connected directly to all modules. The division of the processor into those four modules greatly simplifies the adaptation of JOP for different application domains or hardware platforms. Porting JOP to a new FPGA board usually results in changes in the memory module alone. Using the same board for different applications only involves making changes to the I/O module. JOP has been ported to several different FPGAs and prototyping boards and has been used in different applications (see Chapter~\ref{chap:results}), but it never proved necessary to change the processor core. \section{Microcode} \label{sec:microcode} The following discussion concerns two different instruction sets: \emph{bytecode} and \emph{microcode}. Bytecodes are the instructions that make up a compiled Java program. These instructions are executed by a Java virtual machine. The JVM does not assume any particular implementation technology. Microcode is the native instruction set for JOP. Bytecodes are translated, during their execution, into JOP microcode. Both instruction sets are designed for an extended\footnote{An extended stack machine is one in which there are instructions available to access elements deeper down in the stack.} stack machine. \subsection{Translation of Bytecodes to Microcode} To date, no hardware implementation of the JVM exists that is capable of executing \emph{all} bytecodes in hardware alone. This is due to the following: some bytecodes, such as \code{new}, which creates and initializes a new object, are too complex to implement in hardware. These bytecodes have to be emulated by software. To build a self contained JVM without an underlying operating system, direct access to the memory and I/O devices is necessary. There are no bytecodes defined for low-level access. These low-level services are usually implemented in \emph{native} functions, which means that another language (C) is native to the processor. However, for a Java processor, bytecode is the \emph{native} language. One way to solve this problem is to implement simple bytecodes in hardware and to emulate the more complex and \emph{native} functions in software with a different instruction set (sometimes called microcode). However, a processor with two different instruction sets results in a complex design. Another common solution, used in Sun's picoJava \cite{pjMicroArch}, is to execute a subset of the bytecode native and to use a software trap to execute the remainder. This solution entails an overhead (a minimum of 16 cycles in picoJava, see \ref{subsec:related:picojava}) for the software trap. In JOP, this problem is solved in a much simpler way. JOP has a single \emph{native} instruction set, the so-called microcode. During execution, every Java bytecode is translated to either one, or a sequence of microcode instructions. This translation merely adds one pipeline stage to the core processor and results in no execution overheads. With this solution, we are free to define the JOP instruction set to map smoothly to the stack architecture of the JVM, and to find an instruction coding that can be implemented with minimal hardware. \figurename~\ref{fig_arch_data_flow} gives an example of this data flow from the Java program counter to JOP microcode. The fetched bytecode acts as an index for the jump table. The jump table contains the start addresses for the JVM implementation in microcode. This address is loaded into the JOP program counter for every bytecode executed. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_indirection} \caption{Data flow from the Java program counter to JOP microcode} \label{fig_arch_data_flow} \end{figure} Every bytecode is translated to an address in the microcode that implements the JVM. If there exists an equivalent JOP instruction for the bytecode, it is executed in one cycle and the next bytecode is translated. For a more complex bytecode, JOP just continues to execute microcode in the subsequent cycles. The end of this sequence is coded in the microcode instruction (as the \emph{nxt} bit). %This translation needs an extra pipeline stage but has zero %overheads for complex JVM instructions. \subsection{Compact Microcode} For the JVM to be implemented efficiently, the microcode has to \emph{fit} to the Java bytecode. Since the JVM is a stack machine, the microcode is also stack-oriented. However, the JVM is not a pure stack machine. Method parameters and local variables are defined as \emph{locals}. These locals can reside in a stack frame of the method and are accessed with an offset relative to the start of this \emph{locals} area. Additional local variables (16) are available at the microcode level. These variables serve as scratch variables, like registers in a conventional CPU. However, arithmetic and logic operations are performed on the stack. Some bytecodes, such as ALU operations and the short form access to \emph{locals}, are directly implemented by an equivalent microcode instruction (with a different encoding). Additional instructions are available to access internal registers, main memory and I/O devices. A relative conditional branch (zero/non zero of TOS) performs control flow decisions at the microcode level. For optimum use of the available memory resources, all instructions are 8 bits long. There are no variable-length instructions and every instruction, with the exception of \code{wait}, is executed in a single cycle. To keep the instruction set this dense, two concepts are applied: Two types of operands, immediate values and branch distances, normally force an instruction set to be longer than 8 bits. The instruction set is either expanded to 16 or 32 bits, as in typical RISC processors, or allowed to be of variable length at byte boundaries. A first implementation of the JVM with a 16-bit instruction set showed that only a small number of different constants are necessary for immediate values and relative branch distances. In the current realization of JOP, the different immediate values are collected while the microcode is being assembled and are put into the initialization file for the local RAM. These constants are accessed indirectly in the same way as the local variables. They are similar to initialized variables, apart from the fact that there are no operations to change their value during runtime, which would serve no purpose and would waste instruction codes. A similar solution is used for branch distances. The assembler generates a VHDL file with a table for all found branch constants. This table is indexed using instruction bits during runtime. These indirections during runtime make it possible to retain an 8-bit instruction set, and provide 16 different immediate values and 32 different branch constants. For a general purpose instruction set, these indirections would impose too many restrictions. As the microcode only implements the JVM, this solution is a viable option. To simplify the logic for instruction decoding, the instruction coding is carefully chosen. For example, one bit in the instruction specifies whether the instruction will increment or decrement the stack pointer. The offset to access the \emph{locals} is directly encoded in the instruction. This is not the case for the original encoding of the equivalent bytecodes (e.g. \emph{iload\_0} is 0x1a and \emph{iload\_1} is 0x1b). Whenever a multiplexer depends on an instruction, the selection is directly encoded in the instruction. \subsection{Instruction Set} JOP implements 43 different microcode instructions. These instructions are encoded in 8 bits. With the addition of the \emph{nxt} and \emph{opd} bits in every instruction, the effective instruction length is 10 bits. \begin{description} \item[Bytecode equivalent:] These instructions are direct implementations of bytecodes and result in one cycle execution time for the bytecode (except \code{st} and \code{ld}): \code{pop}, \code{and}, \code{or}, \code{xor}, \code{add}, \code{sub}, \code{st$<$n$>$}, \code{st}, \code{ushr}, \code{shl}, \code{shr}, \code{nop}, \code{ld$<$n$>$}, \code{ld}, \code{dup} \item[Local memory access:] The first 16 words in the internal stack memory are reserved for internal variables. The next 16 words contain constants. These memory locations are accessed using the following instructions: \code{stm}, \code{ldm}, \code{ldi} \item[Register manipulation:] The stack pointer, the variable pointer and the Java program counter are loaded or stored with: \code{stvp}, \code{stjpc}, \code{stsp}, \code{ldvp}, \code{ldjpc}, \code{ldsp} \item[Bytecode operand:] The operand is loaded from the bytecode RAM, converted to a 32-bit word and pushed on the stack with: \code{ld\_opd\_8s}, \code{ld\_opd\_8u}, \code{ld\_opd\_16s}, \code{ld\_opd\_16u} \item[External memory access:] The autonomous memory subsystem is accessed using the following instructions: \code{stmra}, \code{stmwa}, \code{stmwd}, \code{wait}, \code{ldmrd}, \code{stbcrd}, \code{ldbcstart} \item[IO device access:] The following instructions permit access to the IO subsystem: \code{stioa}, \code{stiod}, \code{ldiod} \item[Multiplier:] The multiplier is accessed with: \code{stmul}, \code{ldmul} \item[Microcode branches:] Two conditional branches in microcode are available: \code{bz}, \code{bnz} \item[Bytecode branch:] All 17 bytecode branch instructions are mapped to one instruction: \code{jbr} \end{description} % A detailed description of the microcode instructions can be found in Appendix~\ref{appx:jop:instr}. \subsection{Bytecode Example} The example in Listing~\ref{lst:arch:micro1} shows the implementation of a single cycle bytecode and an infrequent bytecode as a sequence of JOP instructions. In this example, the \code{dup} bytecode is mapped to the equivalent \code{dup} microcode and executed in a single cycle, whereas \code{dup\_x1} takes five cycles to execute, and after the last instruction (\code{ldm a nxt}), the first instruction for the next bytecode is executed. \begin{lstlisting}[caption={Implementation of \code{dup} and \code{dup\_x1}}, label=lst:arch:micro1] dup: dup nxt // 1 to 1 mapping // a and b are scratch variables for the // JVM code. dup_x1: stm a // save TOS stm b // and TOS-1 ldm a // duplicate former TOS ldm b // restore TOS-1 ldm a nxt // restore TOS and fetch next bytecode \end{lstlisting} Some bytecodes are followed by operands of between one and three bytes in length (except \code{lookupswitch} and \code{tableswitch}). Due to pipelining, the first operand byte that follows the bytecode instruction is available when the first microcode instruction enters the execution stage. If this is a one-byte long operand, it is ready to be accessed. The increment of the Java program counter after the read of an operand byte is coded in the JOP instruction (an \emph{opd} bit similar to the \emph{nxt} bit). In Listing~\ref{lst:arch:micro2}, the implementation of \code{sipush} is shown. The bytecode is followed by a two-byte operand. Since the access to bytecode memory is only one byte per cycle, \emph{opd} and \emph{nxt} are not allowed at the same time. This implies a minimum execution time of $n+1$ cycles for a bytecode with $n$ operand bytes. \begin{lstlisting}[caption={Bytecode operand load}, label=lst:arch:micro2] sipush: nop opd // fetch next byte nop opd // and one more ld_opd_16s nxt // load 16 bit operand \end{lstlisting} \subsection{Flexible Implementation of Bytecodes} As mentioned above, some Java bytecodes are very complex. One solution already described is to emulate them through a sequence of microcode instructions. However, some of the more complex bytecodes are very seldom used. To further reduce the resource implications for JOP, in this case local memory, bytecodes can even be implemented by \emph{using} Java bytecodes. During the assembly of the JVM, all labels that represent an entry point for the bytecode implementation are used to generate the translation table. For all bytecodes for which no such label is found, i.e.\ there is no implementation in microcode, a \emph{not-implemented} address is generated. The instruction sequence at this address invokes a static method from a system class (\code{com.jopdesign.sys.JVM}). This class contains 256 static methods, one for each possible bytecode, ordered by the bytecode value. The bytecode is used as the index in the method table of this system class. As described in Section~\ref{sec:hwsw:co}, this feature also allows for the easy configuration of resource usage versus performance. \subsection{Summary} In order to handle the great variation in the complexity of Java bytecodes we have proposed a translation to a different instruction set, the so-called microcode. This microcode is still an instruction set for a stack machine, but more RISC-like than the CISC-like JVM bytecodes. In the next section we will see how this translation is handled in JOP's pipeline and how it can simplify interrupt handling. %\subsubsection{old text:} %There is a great variation in complexity of Java bytecodes, the %instructions of the JVM. There are simple instructions like %arithmetic and logic operations on the stack. However, the semantics %of instructions like \emph{new} or \emph{invokestatic} can result in %class loading (even over a network) and verification. Because of %this variation, not every JVM instruction can be implemented in %hardware. One common solution, used in Suns picoJava %\cite{pjMicroArch}, is to execute a subset of the bytecode native %and trap on the more complex one. This solution has a minimum %overhead of 16 cycles in picoJava for the software trap. % %The approach to this problem in JOP is different. JOP has its own %instruction set (the so called microcode). Some bytecodes have a 1 %to 1 mapping to JOP instructions, for the more complex a sequence of %JOP instructions is necessary. Every bytecode is translated to an %address in the microcode that implements the JVM. \section{The Processor Pipeline} \label{sec:pipeline} JOP is a fully pipelined architecture with single cycle execution of microcode instructions and a novel approach to mapping Java bytecode to these instructions. Figure~\ref{fig_arch_pipeline} shows the datapath for JOP. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_pipeline} \caption{Datapath of JOP} \label{fig_arch_pipeline} \end{figure} Three stages form the JOP core, executing microcode instructions. An additional stage in the front of the core pipeline fetches Java bytecodes -- the instructions of the JVM -- and translates these bytecodes into addresses in microcode. Bytecode branches are also decoded and executed in this stage. The second pipeline stage fetches JOP instructions from the internal microcode memory and executes microcode branches. Besides the usual decode function, the third pipeline stage also generates addresses for the stack RAM. As every stack machine instruction has either \emph{pop} or \emph{push} characteristics, it is possible to generate fill or spill addresses for the \emph{following} instruction at this stage. The last pipeline stage performs ALU operations, load, store and stack spill or fill. At the execution stage, operations are performed with the two topmost elements of the stack. The stack architecture allows for a short pipeline, which results in short branch delays. Two branch delay slots are available after a conditional microcode branch. The method cache (\emph{Bytecode RAM}), microcode ROM, and stack RAM are implemented with single cycle access in the FPGA's internal memories. %Through full pipelining each JOP instruction is executed in one %cycle. %Every JOP instruction takes one cycle. Conditional branches have an %additional delay of two cycles. These two branch delay slots can be %used for instructions or \emph{nop}. \subsection{Java Bytecode Fetch} In the first pipeline stage, as shown in Figure~\ref{fig_arch_bc_fetch}, the Java bytecodes are fetched from the internal memory (\emph{Bytecode RAM}). The bytecode is mapped through the translation table into the address (\emph{jpaddr}) for the microcode ROM. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_bcfetch} \caption{Java bytecode fetch} \label{fig_arch_bc_fetch} \end{figure} The fetched bytecode results in an absolute jump in the microcode (the second stage). If the bytecode is mapped one-to-one with a JOP instruction, the following fetched bytecode again results in a jump in the microcode in the following cycle. If the bytecode is a complex one, JOP continues to execute microcode. At the end of this instruction sequence the next bytecode, and therefore the new jump address, is requested (signal \emph{nxt}). The bytecode RAM serves as instruction cache and is filled on method invoke and return. Details about this time-predictable instruction cache can be found in Section~\ref{sec:cache}. The bytecode is also stored in a register for later use as an operand (requested by signal \emph{opd}). Bytecode branches are also decoded and executed in this stage. Since \emph{jpc} is also used to read the operands, the program counter is saved in \emph{jpcbr} during an instruction fetch. \emph{jinstr} is used to decode the branch type and \emph{jpcbr} to calculate the branch target address. \subsection{JOP Instruction Fetch} The second pipeline stage, as shown in Figure~\ref{fig_arch_fetch}, fetches JOP instructions from the internal microcode memory and executes microcode branches. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_fetch} \caption{JOP instruction fetch} \label{fig_arch_fetch} \end{figure} The JOP microcode, which implements the JVM, is stored in the microcode ROM. The program counter \emph{pc} is incremented during normal execution. If the instruction is labeled with \emph{nxt} a new bytecode is requested from the first stage and \emph{pc} is loaded with \emph{jpaddr}. \emph{jpaddr} is the starting address for the implementation of that bytecode. The label \emph{nxt} is the flag that marks the end of the microcode instruction stream for one bytecode. Another flag, \emph{opd}, indicates that a bytecode operand needs to be fetched in the first pipeline stage. Both flags are stored in a table that is indexed by the program counter. \emph{brdly} contains the target address for a conditional branch. The same offset is shared by a number of branch destinations. A table (\emph{branch offset}) is used to store these relative offsets. This indirection means that only 5 bits need to be used in the instruction coding for branch targets and thereby allow greater offsets. The three tables \emph{BC fetch table}, \emph{branch offset} and \emph{translation table} (from the bytecode fetch stage) are generated during the assembly of the JVM code. The outputs are plain VHDL files. For an implementation in an FPGA, recompiling the design after changing the JVM implementation is a straightforward operation. For an ASIC with a loadable JVM, it is necessary to implement a different solution. FPGAs available to date do not allow asynchronous memory access. They therefore force us to use the registers in the memory blocks. However, the output of these registers is not accessible. To avoid having to create an additional pipeline stage just for a register-register move, the read address register of the microcode ROM is clocked on the negative edge. An alternative solution for this problem would be to use the output of the multiplexer for the \emph{pc} and the read address register of the memory. However, this solution results in a longer critical path, as the multiplexer can no longer be combined with the flip-flops that form the \emph{pc} in the same LCs. This is an example of how implementation technology (the FPGA) can influence the architecture. \subsection{Decode and Address Generation} Besides the usual decode function, the third pipeline, as shown in Figure~\ref{fig_arch_decode}, also generates addresses for the stack RAM. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_decaddr} \caption{Decode and address generation} \label{fig_arch_decode} \end{figure} As we can see in Section~\ref{sec:stack} Table~\ref{tab_stack_address}, read and write addresses are either relative to the stack pointer or to the variable pointer. The selection of the pre-calculated address can be performed in the decode stage. When an address relative to the stack pointer is used (either as read or as write address, never for both) the stack pointer is also decremented or incremented in the decode stage. Stack machine instructions can be categorized from a stack manipulation perspective as either \emph{pop} or \emph{push}. This allows us to generate fill or spill TOS-1 addresses for the \emph{following} instruction during the decode stage, thereby saving one extra pipeline stage. \subsection{Execute} At the execution stage, as shown in Figure~\ref{fig_arch_exe}, operations are performed using two discrete registers: TOS and TOS-1, labeled \emph{A} and \emph{B}. \begin{figure} \centering \includegraphics[scale=\picscale]{arch/arch_execute} \caption{Execution stage} \label{fig_arch_exe} \end{figure} Each arithmetic/logical operation is performed with registers \emph{A} and \emph{B} as the source, and register \emph{A} as the destination. All load operations (local variables, internal register, external memory and periphery) result in a value being loaded into register \emph{A}. There is therefore no need for a write-back pipeline stage. Register \emph{A} is also the source for the store operations. Register \emph{B} is never accessed directly. It is read as an implicit operand or for stack spill on push instructions. It is written during the stack spill with the content of the stack RAM or the stack fill with the content of register \emph{A}. Beside the Java stack, the stack RAM also contains microcode variables and constants. This resource-sharing arrangement not only reduces the number of memory blocks needed for the processor, but also the number of data paths to and from the register \nolinebreak \emph{A}. The inverted clock on data-in and on the write address register of the stack RAM is used, for the same reason, as on the read address register of the microcode ROM. A stack machine with two explicit registers for the two topmost stack elements and automatic fill/spill needs neither an extra write-back stage nor any data forwarding. Details of this two-level stack architecture are described in Section~\ref{sec:stack}. %\subsection{Pipeline Example} % %\emph{\textbf{Show from Java code downto FPGA HW.}} % %Is in brown bock. \subsection{Interrupt Logic} \label{sec:interrupt} Interrupts are considered hard to handle in a pipelined processor, meaning implementation tends to be complex (and therefore resource consuming). In JOP, the bytecode-microcode translation is used cleverly to avoid having to handle interrupts in the core pipeline. Interrupts are implemented as special bytecodes. These bytecodes are inserted by the hardware in the Java instruction stream. When an interrupt is pending and the next fetched byte from the bytecode RAM is an instruction (as indicated by the \emph{nxt} bit in the microcode), the associated special bytecode is used instead of the instruction from the bytecode RAM. The result is that interrupts are accepted at bytecode boundaries. The worst-case preemption delay is the execution time of the \emph{slowest} bytecode that is implemented in microcode. Bytecodes that are implemented in Java can be interrupted. The implementation of interrupts at the bytecode-microcode mapping stage keeps interrupts transparent in the core pipeline and avoids complex logic. Interrupt handlers can be implemented in the same way as standard bytecodes are implemented i.e.\ in microcode or Java. This special bytecode can result in a call of a JVM internal method in the context of the interrupted thread. This mechanism implicitly stores almost the complete context of the current active thread on the stack. \subsection{Summary} In this section, we have analyzed JOP's pipeline. The core of the stack machine constitutes a three-stage pipeline. In the following section, we will see that this organization is an optimal solution for the stack access pattern of the JVM. An additional pipeline stage in front of this core pipeline stage performs bytecode fetch and the translation to microcode. This organization has zero overheads for more complex bytecodes and results in the short pipeline that is necessary for any processor without branch prediction. This additional translation stage also presents an elegant way of incorporating interrupts virtually \emph{for free}. 1.1 jop/doc/book/arch/hwsw_codesign.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/hwsw_codesign.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: hwsw_codesign.tex =================================================================== \section{HW/SW Codesign} \label{sec:hwsw:co} Using a hardware description language and loading the design in an FPGA the former strict border between hardware and software gets blurred. Is configuring an FPGA not more like loading a program for execution? This looser distinction makes it possible to move functions easily between hardware and software resulting in a highly configurable design. If speed is an issue, more functions are realized in hardware. If cost is the primary concern these functions are moved to software and a smaller FPGA can be used. Let us examine these possibilities on a relatively expensive function: multiplication. In Java bytecode \code{imul} performs a 32 bit signed multiplication with a 32 bit result. There are no exceptions on overflow. Since 32 bit single cycle multiplications are far beyond the possibilities of current, mainstream FPGAs the first solution is a sequential multiplier. \paragraph{Sequential Booth Multiplier in VHDL} \begin{lstlisting}[float, caption={Booth multiplier in VHDL}, language=VHDL, label=lst:arch:hwsw:vhdl] process(clk, wr_a, wr_b) variable count : integer range 0 to width; variable pa : signed(64) downto 0); variable a_1 : std_logic; alias p : signed(32 downto 0) is pa(64 downto 32); begin if rising_edge(clk) then if wr_a='1' then p := (others => '0'); pa(width-1 downto 0) := signed(din); elsif wr_b='1' then b <= din; a_1 := '0'; count := width; else if count > 0 then case std_ulogic_vector'(pa(0), a_1) is when "01" => p := p + signed(b); when "10" => p := p - signed(b); when others => null; end case; a_1 := pa(0); pa := shift_right(pa, 1); count := count - 1; end if; end if; end if; dout <= std_logic_vector(pa(31 downto 0)); end process; \end{lstlisting} % Listing~\ref{lst:arch:hwsw:vhdl} shows the VHDL code of the multiplier. Two microcode instructions are used to access this function: \code{stmul} stores the two operands (from TOS and TOS-1) and starts the sequential multiplier. After 33 cycles, the result is loaded with \code{ldmul}. Listing~\ref{lst:arch:hwsw:micro} shows the microcode for \code{imul}. \begin{lstlisting}[float, caption={Microcode to access the Booth multiplier}, label=lst:arch:hwsw:micro] imul: stmul // store both operands and start pop // pop second operand ldi 5 // 6*5+3 cycles wait imul_loop: // wait loop dup nop bnz imul_loop ldi -1 // decrement in branch slot add pop // remove counter ldmul nxt // load result \end{lstlisting} \paragraph{Multiplication in Microcode} If we run out of resources in the FPGA, we can move the function to microcode. The implementation of \code{imul} is almost identical with the Java code in Listing~\ref{lst:arch:hwsw:java} and needs 73 microcode instructions. \paragraph{Bytecode imul in Java} Microcode is stored in an embedded memory block of the FPGA. This is also a resource of the FPGA. We can move the code to external memory by implementing \code{imul} in Java bytecode. Bytecodes not implemented in microcode result in a static Java method call from a special class (\code{com.jopdesign.sys.JVM}). This class has prototypes for each bytecode ordered by the bytecode value. This allows us to find the right method by indexing the method table with the value of the bytecode. Listing~\ref{lst:arch:hwsw:java} shows the Java method for \code{imul}. The additional overhead for this implementation is a call and return with cache refills. \begin{lstlisting}[float, caption={Implementation of bytecode \code{imul} in Java}, label=lst:arch:hwsw:java] public static int imul(int a, int b) { int c, i; boolean neg = false; if (a<0) { neg = true; a = -a; } if (b<0) { neg = !neg; b = -b; } c = 0; for (i=0; i<32; ++i) { c <<= 1; if ((a & 0x80000000)!=0) c += b; a <<= 1; } if (neg) c = -c; return c; } \end{lstlisting} \paragraph{Implementations Compared} \tablename~\ref{tab_arch_hwsw_compared} lists the resource usage and execution time for the three implementations. Execution time is measured with both operands negative, the worst-case execution time for the software implementations. The implementation in Java is slower than the microcode implementation as the Java method is loaded from main memory into the bytecode cache. \begin{table} \centering \begin{tabular}{ld{2}d{3}d{0}} \toprule & \cc{Hardware} & \cc{Microcode} & \cc{Time} \\ & \cc{[LC]} & \cc{[Byte]} & \cc{[Cycle]} \\ \midrule VHDL & 156 & 10 & 35 \\ Microcode & 0 & 73 & 750 \\ Java & 0 & 0 & ~2,300 \\ \bottomrule \end{tabular} \caption{Different implementations of \code{imul} compared} \label{tab_arch_hwsw_compared} \end{table} Only a few lines of code have to be changed to select one of the three implementations. The shown principle can also be applied to other expensive bytecodes: e.g.\ \code{idiv}, \code{ishr}, \code{iushr} and \code{ishl}. As a result, the resource usage of JOP is highly configurable and can be selected for each application according to the needs of the application. Treating VHDL as a software language allows easy movement of function blocks between hardware and software. 1.1 jop/doc/book/arch/jvm_bench.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/jvm_bench.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: jvm_bench.tex =================================================================== \section{Benchmarking the JVM} \label{sec:bench:jvm} The rationale behind this section is best introduced with the warning from Computer Architecture: A Quantitative Approach \cite{Hennessy02} p. 63: \begin{quote} Virtually every practicing computer architect knows Amdahl’s Law. Despite this, we almost all occasionally fall into the trap of expending tremendous effort optimizing some aspect of a system before we measure its usage. Only when the overall speedup is unrewarding do we recall that we should have measured the usage of that feature before we spent so much effort enhancing it! \end{quote} % We measured how Java programs use the bytecode instruction set and explored the typical and worst-case method sizes. Our measurements and other reports are presented in the following sections. \subsection{Bytecode Frequency} The dynamic instruction frequency is the main measurement for determining a processor implementation. We can identify those instructions that should be fast. For seldom-used instructions, a trade-off can be made between performance and hardware resources. Many reports have been written about JVM bytecode frequencies (e.g.\ \cite{Greg2002, 365338, 624084}). Most of these reports provide only a coarse categorization of the bytecodes. For example, the bytecodes \code{iload\_n} (load an \code{int} from a local variable) and \code{getfield} (fetch a field from an object) are combined in one instruction category. However, these instructions are very different in terms of their implementation complexity. We have chosen a fine-grained categorization of the bytecodes to gain greater insight into the bytecode usage. In Table~\ref{tab_java_instr_cat} all 201 bytecode instructions are listed by category. \begin{table} \centering \begin{tabular}{ll} \toprule Type & Bytecode \\ \midrule load & aload, dload, fload, iload, lload \\ load (short) & aload\_0, aload\_1, aload\_2, aload\_3, \\ & dload\_0, dload\_1, dload\_2, dload\_3, \\ & fload\_0, fload\_1, fload\_2, fload\_3, \\ & iload\_0, iload\_1, iload\_2, iload\_3, \\ & lload\_0, lload\_1, lload\_2, lload\_3 \\ store & astore, dstore, fstore, istore, lstore \\ store (short) & astore\_0, astore\_1, astore\_2, astore\_3, \\ & dstore\_0, dstore\_1, dstore\_2, dstore\_3, \\ & fstore\_0, fstore\_1, fstore\_2, fstore\_3, \\ & istore\_0, istore\_1, istore\_2, istore\_3, \\ & lstore\_0, lstore\_1, lstore\_2, lstore\_3 \\ const & bipush, ldc, ldc\_w, ldc2\_w, sipush \\ const (short) & aconst\_null, dconst\_0, dconst\_1, fconst\_0, fconst\_1, fconst\_2, \\ & iconst\_0, iconst\_1, iconst\_2, iconst\_3, iconst\_4, iconst\_5, \\ & iconst\_m1, lconst\_0, lconst\_1 \\ get & getfield, getstatic \\ put & putfield, putstatic \\ alu & dadd, ddiv, dmul, dneg, drem, dsub, \\ & fadd, fdiv, fmul, fneg, frem, fsub, \\ & iadd, iand, idiv, imul, ineg, ior, irem, ishl, ishr, isub, iushr, ixor, \\ & ladd, land, ldiv, lmul, lneg, lor, lrem, lshl, lshr, lsub, lushr, lxor \\ iinc & iinc \\ stack & dup, dup\_x1, dup\_x2, dup2, dup2\_x1, dup2\_x2, pop, pop2, swap \\ array & aaload, aastore, baload, bastore, caload, castore, daload, dastore, \\ & faload, fastore, iaload, iastore, laload, lastore, saload, sastore \\ branch & goto, goto\_w, if\_acmpeq, if\_acmpne, if\_icmpeq, \\ & if\_icmpge, if\_icmpgt, if\_icmple, if\_icmplt, if\_icmpne, \\ & ifeq, ifge, ifgt, ifle, iflt, ifne, ifnonnull, ifnull \\ compare & dcmpg, dcmpl, fcmpg, fcmpl, lcmp \\ switch & lookupswitch, tableswitch \\ call & invokeinterface, invokespecial, invokestatic, invokevirtual \\ return & areturn, dreturn, freturn, ireturn, lreturn, return \\ conversion & d2f, d2i, d2l, f2d, f2i, f2l, i2b, i2c, i2d, i2f, i2l, i2s, l2d, l2f, l2i \\ new & anewarray, multianewarray, new, newarray \\ other & arraylength, athrow, checkcast, instanceof, jsr, jsr\_w, \\ & monitorenter, monitorexit, nop, ret, wide \\ \bottomrule \end{tabular} \caption[Categories of JVM bytecodes]{The 201 Java bytecodes and their assignment to different categories} \label{tab_java_instr_cat} \end{table} Three different applications were run on an instrumented JVM to measure dynamic bytecode frequency. The results were compared with the results from the above-mentioned reports. In Table~\ref{tab_java_instr_frequ} the dynamic instruction count for the three different benchmarks is shown. The last column is the average of the three tests weighted by the individual instructions count. Kaffe \cite{kaffe} is an independent implementation of the JVM distributed under the GNU Public License. Kaffe was instrumented to collect data on dynamic bytecode usage. Three different applications were used as benchmarks to obtain the dynamic instruction count: JLex, KCJ and javac. JLex \cite{jlex} is a lexical analyzer generator, written for Java in Java. The data was collected by running JLex with the provided \code{sample.lex} as the input file. KJC \cite{kcj} is a Java compiler in Java, freely available under the terms of the GNU General Public License. javac is the Sun Java compiler. Both compilers were compiling part of the KJC sources during the benchmark. These benchmarks are similar to the benchmarks used in other reports and the results are therefore comparable. However, typical embedded applications can result in a slightly different instruction set usage pattern. Embedded applications are usually tightly connected with the environment and are therefore not available as stand-alone programs to serve as benchmark. An embedded application that was developed on JOP was adapted to serve as benchmark for Section~\ref{sec:cache} and Chapter~\ref{chap:results}. %19,572,165 + 951,138,375 + 341,926,231 = 1,312,636,771 % JLex.Main kaffe JLex.Main JLex/sample.lex % at.dms.kjc.Main: % kaffe -cp kjc-2.1B-bin.jar at.dms.kjc.Main -d classes kopi-2.1B/src/kjc/*.java % com.sun.tools.javac.Main: % kaffe -cp /usr/lib/java/lib/tools.jar com.sun.tools.javac.Main -d classes kopi-2.1B/src/kjc/*.java % stops with compile error \begin{table} \centering \begin{tabular}{lrrrr} \toprule & JLex & KJC & javac & Average \\ \midrule load (short) & 32.72 & 31.45 & 27.24 & 30.37 \\ get & 12.02 & 14.39 & 17.04 & 15.04 \\ branch & 11.26 & 10.40 & 10.71 & 10.49 \\ invoke & 6.87 & 6.31 & 4.24 & 5.77 \\ return & 6.82 & 6.20 & 4.17 & 5.68 \\ load & 7.59 & 4.19 & 7.48 & 5.09 \\ alu & 2.60 & 4.43 & 4.74 & 4.48 \\ const (short) & 4.61 & 4.26 & 4.74 & 4.39 \\ array & 4.22 & 4.07 & 3.22 & 3.85 \\ put & 0.78 & 2.14 & 3.65 & 2.52 \\ iinc & 1.81 & 2.38 & 1.41 & 2.12 \\ stack & 1.30 & 2.11 & 2.11 & 2.10 \\ store (short) & 2.61 & 2.18 & 1.71 & 2.06 \\ other & 1.63 & 2.22 & 1.21 & 1.95 \\ const & 0.85 & 1.56 & 2.80 & 1.87 \\ store & 2.05 & 0.85 & 1.94 & 1.15 \\ conversion & 0.02 & 0.36 & 0.58 & 0.42 \\ switch & 0.00 & 0.20 & 0.60 & 0.30 \\ new & 0.08 & 0.28 & 0.20 & 0.25 \\ compare & 0.14 & 0.03 & 0.22 & 0.08 \\ \bottomrule \end{tabular} \caption{Dynamic bytecode frequency in \%} \label{tab_java_instr_frequ} \end{table} \begin{table} \centering \begin{tabular}{lrlr} \toprule \multicolumn{2}{c}{JLex, KJC and javac} & \multicolumn{2}{c}{SPEC and Java Grande} \\ \cmidrule(lr){1-2} \cmidrule(lr){3-4} Instruction & Frequency & Instruction & Frequency \\ \midrule load (short) & 30.37 & acnst & 0.07 \\ load & 5.09 & aload & 16.23 \\ const (short) & 4.39 & fcnst & 0.33 \\ const & 1.87 & fload & 6.33 \\ & & icnst & 3.21 \\ & & iload & 18.06 \\ \midrule load \& const & \textbf{41.72}& & \textbf{44.77} \\ \midrule get & 15.04 & field & 11.12 \\ put & 2.52 & & \\ \midrule field access & \textbf{17.56} & & \textbf{11.12} \\ \midrule branch & 10.49 & cjump & 5.67 \\ compare & 0.08 & ujump & 0.51 \\ \midrule control & \textbf{10.57}& & \textbf{6.18} \\ \midrule invoke & \textbf{5.77} & fcall & \textbf{3.63} \\ \midrule return & \textbf{5.68} & retrn & \textbf{2.07} \\ \bottomrule \end{tabular} \caption[Dynamic bytecode frequency comared] {Dynamic bytecode frequency compared with the measurements from \cite{Dowling2002}} \label{tab:java:instr:frequ:comp} \end{table} In \cite{Dowling2002} the relationship between static and dynamic instruction frequency of 19 programs from the SPECjvm98 \cite{SPECJvm98} and Java Grande benchmark suits were measured. The bytecodes categories were chosen different from the above measurements, but detailed enough to verify our own measurements. \tablename~\ref{tab:java:instr:frequ:comp} shows the average dynamic execution frequency in percent\footnote{The values do not add up to 100\% as only the most significant bytecode categories are shown} of selected bytecode categories from the SPEC and Java Grande benchmarks, compared with the results obtained by our measurements. The numbers in bold are categories or sums of categories that are comparable. The frequency of the load \& const instructions is very similar to that in our measurements. However, field access, control instructions and method invocations are more frequent in our measurements. The higher count on field access instructions and method invocation can result from a more object oriented programming style in our selected applications than in the SPEC and Java Grande benchmarks. The big difference, not seen in our measurements, between the invoke and return frequency in the SPEC and Java Grande benchmarks is not explained in \cite{Dowling2002}. In all measurements, the load of local variables and constants onto the stack accounts for more than 40\% of instructions executed. This feature shows that an efficient realization of the local variable memory area, the stack and the transfer between these memory areas is mandatory. The next most executed bytecodes (\code{getfield} and \code{getstatic}) are the instructions that load an object or class field onto the operand stack. To account for these frequent instructions, the class layout for the runtime system has to be optimized for quick resolution of field addresses (i.e.\ minimum memory indirections). The frequency of branches is comparable with the SPECint2000 measurements on RISC processors \cite{Hennessy02}. With such a high branch frequency, a processor without branch prediction logic is put under pressure in terms of pipeline length. It is interesting to note that there are more method invoke instructions than return instructions. Two facts are responsible for this difference: native methods are invoked by a bytecode, but the return is inside the native methods; and an exception can result in a method exit without return. \subsection{Methods Types and Length} \label{sec:bench:jvm:methods} \tablename~\ref{tab_java_meth_type} shows the number of dynamic method calls of the Java Grande and \linebreak[4]SPECjvm98 benchmarks. It can be seen that the distribution of method types depends on the application type. Usage of virtual methods and interfaces is common in OO programming. Static methods result from the simple translation of procedural programs to Java. \begin{table} \centering \begin{tabular}{lrrrr} \toprule & virtual & special & static & interface \\ \midrule Java Grande & 57.1 & 8.7 & 34.2 & 0.0 \\ SPEC JVM98 & 81.0 & 10.9 & 2.9 & 5.2 \\ \bottomrule \end{tabular} \caption{Types of different dynamic method calls for two benchmarks (from \cite{Power2002})} \label{tab_java_meth_type} \end{table} As a basis for the proposed cache solution in Section~\ref{sec:cache}, we will explore static distribution of method sizes. In the JVM, only relative branches are defined. The conditional branches and goto have an offset of 16 bits, resulting in a practical limit of the method length of 32KB. Although there is a goto instruction with a wide index (\emph{goto\_w}) that takes a 4-byte branch offset, other factors (e.g.\ indices in the exception table) limit the size of a method to 65535 bytes. Radhakrishnan et al.\ \cite{365338} measured the dynamic method size of the SPEC suit. They observed a `tri-nodal' distribution, where most of the methods were 1, 9, or 26 bytecodes long. No explanation is given for the sizes of 9 or 26. The explanation of the 1 bytecode long methods as \emph{wrapper methods} is wrong. For a wrapper method, the method needs to contain a minimum of two instructions (an invoke and a return). A single instruction method can \emph{only} contain a return. However, this observation is in sharp contrast to the measurements obtained by Power and Waldron in \cite{Power2002}. \begin{table} \centering \begin{tabular}{rrrr} \toprule Length & Methods & Percentage & Cumulative \\ & & & percentage \\ \midrule 1 & 1,388 & 1.94 & 1.94 \\ 2 & 1,580 & 2.21 & 4.16 \\ 4 & 1,871 & 2.62 & 6.78 \\ 8 & 16,192 & 22.67 & 29.45 \\ 16 & 12,363 & 17.31 & 46.76 \\ 32 & 12,638 & 17.70 & 64.45 \\ 64 & 11,178 & 15.65 & 80.10 \\ 128 & 7,287 & 10.20 & 90.31 \\ 256 & 4,304 & 6.03 & 96.33\\ 512 & 1,727 & 2.42 & 98.75 \\ 1,024 & 592 & 0.83 & 99.58 \\ 2,048 & 175 & 0.25 & 99.83 \\ 4,096 & 75 & 0.11 & 99.93 \\ 8,192 & 37 & 0.05 & 99.98 \\ 16,384 & 11 & 0.02 & 100.00 \\ 32,768 & 1 & 0.00 & 100.00 \\ 65,536 & 0 & 0.00 & 100.00 \\ \bottomrule \end{tabular} \caption{Static method count of different sizes from the runtime library (JDK 1.4).} \label{tab_java_jdk_static_size} \end{table} In Table~\ref{tab_java_jdk_static_size}, the number of methods of different sizes in the Java runtime library (JDK 1.4) is shown. The library consists of 71419 methods, the largest being 16706 bytes. The size is classified by powers of 2 because we are interested in the size of cache memory for complete methods. In the table, the row of, for example, size 32 includes all methods of a size from 17 to 32 bytes. It can be seen that methods are typically very short. In fact, 99\% of the methods are less than 513 bytes in size. This property is important for the proposed method cache in Section~\ref{sec:cache}, where a complete method has to fit into the instruction cache. All larger methods are different kinds of initialization functions, in most cases \code{$<$clinit$>$()}\footnote{The class or interface initialization method is static and the special name \codefoot{$<$clinit$>$} is supplied by the compiler. These initialization methods are invoked implicitly by the JVM. The definition when these methods get invoked is problematic for the WCET analysis (see Section~\ref{para:restrict:clinit}).}. The large class initialization methods typically result from the initialization of arrays with constant data. This is necessary because of the lack of initialized data segments, such as the BSS in C, in the Java class file. These initialization methods contain straight-line code and can therefore be split to smaller methods automatically, if necessary. \begin{figure} \centering \includegraphics[width=\excelwidth]{arch/arch_meth32} \caption[Static method count for methods of size up to 32 bytes] {Static method count for methods of size up to 32 bytes in the JDK 1.4 runtime library. The horizontal axis indicates the method size.} \label{fig_java_meth32} \end{figure} \begin{figure} \centering \includegraphics[width=\excelwidth]{arch/arch_meth300} \caption[Static method count from the JDK 1.4 runtime library] {Static method count from the JDK 1.4 runtime library. The horizontal axis indicates the method size in bytes.} \label{fig_java_meth300} \end{figure} In Figure~\ref{fig_java_meth32}, the distribution of small methods up to a size of 32 bytes is shown. \figurename~\ref{fig_java_meth300} shows the method count for methods up to 300 bytes. As expected, we see fewer methods as size increases. We observed no surprise in the distribution, unlike the `tri-nodal' distribution in \cite{365338}. The only method size that is very common is 5 bytes. These methods are the typical setter and getter methods in object-oriented programming as shown in Listing~\ref{lst:arch:java:getval}. \begin{lstlisting}[float, caption={Bytecodes for a getter method},label=lst:arch:java:getval] private int val; public int getVal() { return val; } public int getVal(); Code: 0: aload_0 1: getfield #2; //Field val:I 4: ireturn \end{lstlisting} The method \code{getVal()} translates to three bytecodes of 1, 3 and 1 bytes in length respectively. These methods should show up in \cite{365338} as a peak at 3 bytecodes. The static distribution of method sizes in an application (javac, the Java compiler) is quite similar to the distribution in the library. In the class file that contains the Java compiler, 98\% of the methods are smaller than 513 bytes, and the larger methods are class initializers. \subsection{Summary} In this section, we performed dynamic measurements on the JVM instruction set. We saw that more than 40\% of the executed instructions are local variables or constants loads onto the stack. This high frequency of stack access calls for an efficient implementation of the stack, as described in Section~\ref{sec:stack}. In addition, we have statically measured method sizes. Methods are typically very short. 30\% of the methods are shorter than 9 bytes and 99\% account for methods of up to 512 bytes. The maximum length is further limited by the definition of the class file. We will use this property in the proposed \emph{method cache} in Section~\ref{sec:cache}. Instruction-usage data is an important input for the design of a processor architecture, as seen in the following sections. %\subsection{Some more properties} %Number of local variables per method. %\subsubsection{Some Comments} % %* Clarification of the quick instructions %* A word about javac: absolute not optimized code generations. %Simplifies JIT, but not so good for Java processors or an %interpreting JVM %\dots CPI is not so easy to measure for the JVM. JVM = runtime %infrastructure (GC, new). What is measured with JIT? 1.1 jop/doc/book/arch/rt_predict.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/rt_predict.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: rt_predict.tex =================================================================== \section{Real-Time Predictability} \label{sec:rtpredict} %\subsection{Time Predictable Architecture} % %Worst case execution time (WCET) analysis of real-time programs is %essential for any schedulability analysis. To provide a tight WCET %value a good model of the processor is necessary. However, the %architectural advancement in modern processor designs is dominated %by the rule: '\emph{Make the common case fast}`. This is the %opposite to: '\emph{Reduce the worst case}` and complicates WCET %analysis. JOP was designed to provide an architecture that can be %exactly modeled. Execution time of bytecodes is known cycle %accurate. It is possible to analyze the timing on the bytecode level %without the uncertainties of an interpreting JVM or generated native %code from ahead-of-time compilers for Java. % % %Common architectural components, such as branch prediction and %branch target buffers enhance average performance, but have usually %a very pessimistic WCET. In JOP, branch prediction is avoided. This %results in pressure on the pipeline length. The processor has a %minimal pipeline length of four stages resulting in a four cycle %execution time for a bytecode branch. % %The stack is a frequently accessed memory area and therefore %implemented in on-chip memory and serving as data cache. Data %exchange between internal stack and main memory is under microcode %control and therefore WCET analyzable. % %Instruction cache is mandatory to bridge the growing gap between CPU %speed and main memory access time. Standard cache organizations %improve average execution time, but are hard to predict for worst %case execution time (WCET) analysis. We tackled this problem from %the architectural side: An instruction cache organization where %simpler and more accurate WCET analysis is more important than %average case performance. % %JOP has a novel instruction cache: A \emph{method cache}. A complete %method is loaded in the cache on invocation and on return. This %cache fill strategy lumps all cache misses together and is very %simple to analyze. Cache block replacement depends on the call tree %instead of instruction addresses and is therefore WCET analyzable. % % %Comparing this cache organization quantitative with a benchmark %derived from a real-time application we have seen that the proposed %\emph{method cache} performs similar or even better to a traditional %direct mapped cache with respect to the bytes that have to be filled %an a cache miss. The number of memory transactions, which result in a %high miss penalty on memories with high latency, are lower with the %proposed cache solution than in a traditional cache. % % General-purpose processors are optimized for average throughput and non real-time operating systems are responsible for fair and efficient scheduling of resources. Real-time systems need a processor with low and known WCET of instructions. Real-time operating systems have properties, such as fast interrupt response, rapid context switch, short blocking times and a scheduler that implements a simple, in most cases strict priority driven, scheduling algorithm. This section describes design decisions for JOP to support such real-time systems. \subsection{Interrupts} Interrupts are usually associated with low-level programming of device drivers. The priorities of interrupts and their handler functions are above task priorities and yield to an immediate context switch. In this form, interrupts cannot be integrated in a schedule with \emph{normal} tasks. The execution time of the interrupt handler has to be integrated in the schedulability analysis as additional blocking time. A better solution is to handle interrupts, which represent external events, as schedulable objects with priority levels in the range of real-time tasks, as suggested in the RTSJ. \paragraph{The Timer Interrupt} The timer or clock interrupt has a different semantic than other interrupts. The main purpose of the timer interrupt is representation of time and release of periodic or time triggered tasks. One common implementation is a clock tick. The interrupt occurs at a regular interval (e.g.\ 10 ms) and a decision has to be taken whether a task has to be released. This approach is simple to implement, but there are two major drawbacks: The resolution of timed events is bound by the resolution of the clock tick and clock ticks without a task switch are a waste of execution time. A better approach, used in JOP, is to generate timer interrupts at the release times of the tasks. The scheduler is now responsible for reprogramming the timer after each occurrence of a timer interrupt. The list of sleeping threads has to be searched to find the nearest release time in the future of a higher priority thread than the one that will be released now. This time is used for the next timer interrupt. \paragraph{External Events} Hardware interrupts, other than the timer interrupt, are represented as asynchronous events with an associated thread. This means that the event is a \emph{normal} schedulable object under the control of the scheduler. With a minimum interarrival time, enforced by hardware, these events can be incorporated into the priority assignment and schedulability analysis in the same way as periodic tasks. \paragraph{Software Interrupts} The common software generated interrupts, such as illegal memory access or divide by zero, are represented by Java runtime exceptions and need no special handler. They can be detected with a try-catch block. Asynchronous notification from the program is supported, in the same way as an external event, as a schedulable object with an associated thread. The event is triggered through the call of \code{fire()}. The thread with the handler is placed in the runnable state and scheduled according to priority. \paragraph{Hardware Failures} Serious hardware failures, such as illegal opcode or parity error from the memory systems, lead to a shutdown of the system. However, a \emph{last try} to call a handler that changes the state of the system to a safe state and inform an upper level system, can improve the integrity of the overall system. \subsection{Task Switch} An important issue in real-time systems is the time for a task switch. A task switch consists of two actions: \begin{itemize} \item \emph{Scheduling} is the selection of the task order and timing \item \emph{Dispatching} is the term for the context switch between tasks \end{itemize} \paragraph{Scheduling} Most real-time systems use a fixed-priority preemptive scheduler. Tasks with the same priority are usually scheduled in a FIFO order. Two common ways to assign priorities are rate monotonic or, in a more general form, deadline monotonic assignment. When two tasks get the same priority, we can choose one of them and assign a higher priority to that task and the task set is still schedulable. We get a strictly monotonic priority order and do not have to deal with FIFO order. This eliminates queues for each priority level and results in a single, priority ordered task list. Strictly fixed priority schedulers suffer from a problem called \emph{priority inversion} \cite{626613}. The problem where a low priority task blocks a high priority task on a shared resource is solved by raising the priority of the low priority task. Two standard priority inversion avoidance protocols are common: % \begin{description} \item[Priority Inheritance Protocol:] A lock assigns the priority of the highest-priority waiting task to the task holding the lock until that task releases the resource. \item[Priority Ceiling Emulation Protocol:] A lock gets a priority assigned above the priority of the highest-priority task that will ever acquire the lock. Every task will be immediately assigned the priority of that lock when acquiring it. \end{description} % The priority inheritance protocol is more complex to implement and the time when the priority of a task is raised is not so obvious. It is not raised because the task does anything, but because another task reaches some point in its execution path. Using priority ceiling emulation with unique priorities, different from task priorities, the priority order is still strictly monotonic. The priority ordered task list is expanded with slots for each lock. If a task acquires a lock, it is placed in the corresponding slot. With this extension to the task list, scheduling is still simple and can be efficiently implemented. \paragraph{Dispatching} The time for a context switch depends on the size of the state of the tasks. For a stack machine it is not so obvious what belongs to the state of a task. If the stack resides in main memory, only a few registers (e.g. program counter and stack pointer) need to be saved and restored. However, the stack is a frequently accessed memory region of the JVM. The stack can be seen as a data cache and should be placed near the execution unit (in this case, \emph{near} means on the chip and not in external memory). However, on-chip memory is usually too small to hold the stack content for all tasks. This means that the stack is part of the execution context and has to be saved and restored on a context switch. In JOP, the stack is placed in local (on-chip) FPGA memory with single cycle access time. With this configuration, the next question is how much of the stack to place there. Either the complete stack of a thread or only the stack frame of the current method can reside locally. If the complete stack of a thread is stored in local memory, the invocation of methods and returns are fast, but the context is large. For fast context switches, it is preferable to have only a short stack in local memory. This results in less data being transferred to and from main memory, but more memory transfers on method invocation and return. The local stack can be further divided into small pieces, each holding only one stack frame of one thread. During the context switch, only the stack pointer needs to be saved and restored. The outcome of this is a very fast context switch, although the size of the local memory limits the maximum number of threads. Since JOP is a soft-core processor, these different solutions can be configured for different application requirements. It is even possible to mix of these policies: some stack slots can be assigned to \emph{important} threads, while the remaining threads share one slot. This stack slot only needs to be exchanged with the main memory when switching \emph{to} a less \emph{important} thread. \subsection{Architectural Design Decisions} In hard real-time systems, meeting temporal requirements is of the same importance as functional correctness. This results in different architectural constraints than in a design for a non real-time system. A low upper bound of the execution time is of premium importance. Good average execution time is useless for a pure hard real-time system. Common architectural components, found in general purpose processors to enhance average performance, are usually problematic for the WCET analysis. A pragmatic approach to this problem is to ignore these features for the analysis. With a processor designed for real-time applications, these features have to be substituted by predictable architecture enhancements. \paragraph{Branch Prediction} As the pipelines of current general-purpose processors get longer to support higher clock rates the penalty of branches get too high. This is compensated by branch prediction logic with branch target buffers. However, the upper bound of the branch execution time is the same as without this feature. In JOP, branch prediction is avoided. This results in pressure on the pipeline length. The core processor has a pipeline length of as little as three stages resulting in a branch execution time of three cycles in microcode. The two slots in the branch delay can be filled with instructions or \emph{nop}. With the additional bytecode fetch and translation stage, the overall pipeline is four stages and results in a four cycle execution time for a bytecode branch. \paragraph{Caches and Instruction Prefetch} To reduce the growing gap between the clock frequency of the processor and memory access times multi-level cache architectures are commonly used. Since even a single level cache is problematic for WCET analysis, more levels in the memory architecture are almost not analyzable. The additional levels also increase the latency of memory access on a cache miss. In a stack machine, the stack is a frequently accessed memory area. This makes the stack an ideal candidate to be placed near the execution unit in the memory hierarchy. In JOP the stack is implemented as internal memory with the two top elements as explicit registers. This single cycle memory can be seen as a data cache. However, unlike in picoJava, this limited memory is not automatically spilled and filled. Automatically spill and fill introduces unpredictable access to the main memory. Data exchange between internal stack and main memory is under program control and can be done on method invocation/return or on a thread switch. The next most accessed memory area is the code area. A simple prefetch queue, as it is found in older processors, could increase instruction throughput after executing a multi-cycle bytecode. For a stream of single cycle bytecodes, prefetching is useless and the frequent occurrence of branches and method invocations, about 12--23\% (see Section~\ref{sec:bench:jvm}) in typical Java programs, reduces the performance gain. The prefetch queue also results in (probably unbounded) execution time dependencies over a stream of instructions, which complicates timing analysis. JOP has a method cache with a novel replace policy. Since typical methods in Java programs are short and there are only relative branches in a method, a complete method is loaded in the cache on invocation and on return. This cache fill strategy lumps all cache misses together and is very simple to analyze. It also simplifies the hardware of the cache since no tag memory or address translation is necessary. The \emph{romizer} tool JavaCodeCompact checks the maximum allowed method size. Section~\ref{sec:cache} describes the proposed cache solution in detail. Memory areas for the heap and class description with the constant pool are not cached in JOP. \paragraph{Superscalar Processors} A superscalar processor consists of several execution units and tries to extract instruction level parallelism (ILP) with out of order execution. Again, this is a nightmare for timing analysis. The code for a stack machine has less implicit parallelism than a register machine. One form of enhancement, usually implemented in stack machines, is instruction folding. The instruction stream is scanned to find frequent patterns like load-load-add-store and substitutes these four instructions with one, RISC-like, operation. There are two issues with instruction folding in JOP: The combined instruction needs two read and one write access to the stack in a single cycle. This would result in doubling of the internal memory usage in the FPGA. It also needs, at minimum, four bytes read access to the method cache. To overcome word boundaries, prefetching has to be introduced after the method cache. This results in an additional pipeline stage, time dependency of instructions with a more complex analysis and more hardware resources for the multiplexers. Programs for embedded and real-time systems are usually multi-threaded. In future work, it will be investigated if the additional hardware resources needed for ILP can be better used with additional processor cores utilizing this implicit thread-level parallelism. \paragraph{Garbage Collection} As use of the heap is avoided in hard real-time systems, no garbage collector is implemented. Without a garbage collector, the memory layout of objects can be simplified. Every reference points directly to the object. No indirection through a handle, which would simplify memory compaction in the garbage collector, is needed. This reduces access time to object fields and methods. \paragraph{Time-Predictable Instructions} A good model of a processor with accurate timing information is essential for a tight WCET analysis. The architecture of JOP and the microcode are designed with this in mind. Execution time of bytecodes is known cycle accurately (see Section~\ref{sec:wcet} and Appendix~\ref{appx:bytecode}). It is possible to analyze the WCET on the bytecode level \cite{R:Bernat:2000a} without the uncertainties of an interpreting JVM \cite{R:Bate:2000a} or generated native code from ahead-of-time compilers for Java. \subsection{Summary} In this section, we argued that, while common techniques in processor architectures increase average throughput, they are not feasible for real-time systems. The influence of these architectural enhancements is at best hardly WCET-analyzable. The proposed alternatives influence the processor architecture, as described in earlier sections, as well as the software architecture that will be described in Section~\ref{sec:rtprof}. However, the most important architectural enhancement for pipelined machines is caching, which is necessary even in embedded systems. We have shown in Section~\ref{sec:stack} how a time-predictable data cache for a stack machine can be implemented. In the following section, we will propose a time-predictable cache for instructions.

     
    Copyright (c) 1999 OPENCORES.ORG. All rights reserved.