|
Message
From: cvs at opencores.org<cvs@o...>
Date: Sat Aug 25 19:58:56 CEST 2007
Subject: [cvs-checkins] MODIFIED: jop ...
Date: 00/07/08 25:19:58 Added: jop/doc/book/arch arch_bcfetch.pdf arch_decaddr.pdf arch_execute.pdf arch_fetch.pdf arch_indirection.pdf arch_jop_block.pdf arch_meth300.pdf arch_meth32.pdf arch_pipeline.pdf arch_intro.tex arch_jop.tex hwsw_codesign.tex jvm_bench.tex rt_predict.tex Log: Handbook update Revision Changes Path 1.1 jop/doc/book/arch/arch_bcfetch.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_bcfetch.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_decaddr.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_decaddr.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_execute.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_execute.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_fetch.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_fetch.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_indirection.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_indirection.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_jop_block.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_jop_block.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_meth300.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_meth300.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_meth32.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_meth32.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_pipeline.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_pipeline.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/arch/arch_intro.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_intro.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: arch_intro.tex =================================================================== This chapter presents the architecture for JOP and the motivation behind the various different design decisions we faced. First, we benchmark the JVM, in order to extract execution frequencies for the different bytecodes. These values will then guide the processor design. Pipelined instruction processing calls for a high memory bandwidth. Caches are needed in order to avoid bottlenecks resulting from the main memory bandwidth. As seen in Chapter~\ref{chap:java}, there are two memory areas that are frequently accessed by the JVM: the stack and the method area. In this chapter, we will present time-predictable cache solutions for both areas. 1.1 jop/doc/book/arch/arch_jop.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/arch_jop.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: arch_jop.tex ===================================================================
\section{Overview of JOP}
This section gives an overview of JOP architecture.
\figurename~\ref{fig:arch:jop:block} shows JOP's major function
units. A typical configuration of JOP contains the processor core, a
memory interface and a number of IO devices. The module extension
provides the link between the processor core, and the memory and IO
modules.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_jop_block}
\caption{Block diagram of JOP}
\label{fig:arch:jop:block}
\end{figure}
The processor core contains the four pipeline stages \emph{bytecode
fetch}, \emph{microcode fetch}, \emph{decode} and \emph{execute}.
The ports to the other modules are the address and data bus for the
bytecode instructions, the two top elements of the stack (A and B),
input to the top-of-stack (Data) and a number of control signals.
There is no direct connection between the processor core and the
external world.
The memory interface provides a connection between the main memory
and the processor core. It also contains the bytecode cache. The
extension module controls data read and write. The \emph{busy}
signal is used by the microcode instruction \code{wait}\footnote{The
busy signal can also be used to stall the whole processor pipeline.
This was the change made to JOP by Flavius Gruian \cite{jop:sac05}.
However, in this synchronization mode, the concurrency between the
memory access module and the main pipeline is lost.} to synchronize
the processor core with the memory unit. The core reads bytecode
instructions through dedicated buses (BC address and BC data) from
the memory subsystem. The request for a method to be placed in the
cache is performed through the extension module, but the cache hit
detection and load is performed by the memory interface
independently of the processor core (and therefore concurrently).
The I/O interface contains peripheral devices, such as the system
time and timer interrupt, a serial interface and
application-specific devices. Read and write to and from this module
are controlled by the extension module. All external
devices\footnote{The external device can be as simple as a line
driver for the serial interface that forms part of the interface
module, or a complete bus interface, such as the ISA bus used to
connect e.g.\ an Ethernet chip.} are connected to the I/O interface.
The extension module performs three functions: (a) it contains
hardware accelerators (such as the multiplier unit in this example),
(b) the control for the memory and the I/O module, and (c) the
multiplexer for the read data that is loaded in the top-of-stack
register. The write data from the top-of-stack (A) is connected
directly to all modules.
The division of the processor into those four modules greatly
simplifies the adaptation of JOP for different application domains
or hardware platforms. Porting JOP to a new FPGA board usually
results in changes in the memory module alone. Using the same board
for different applications only involves making changes to the I/O
module. JOP has been ported to several different FPGAs and
prototyping boards and has been used in different applications (see
Chapter~\ref{chap:results}), but it never proved necessary to change
the processor core.
\section{Microcode}
\label{sec:microcode}
The following discussion concerns two different instruction sets:
\emph{bytecode} and \emph{microcode}. Bytecodes are the instructions
that make up a compiled Java program. These instructions are
executed by a Java virtual machine. The JVM does not assume any
particular implementation technology. Microcode is the native
instruction set for JOP. Bytecodes are translated, during their
execution, into JOP microcode. Both instruction sets are designed
for an extended\footnote{An extended stack machine is one in which
there are instructions available to access elements deeper down in
the stack.} stack machine.
\subsection{Translation of Bytecodes to Microcode}
To date, no hardware implementation of the JVM exists that is
capable of executing \emph{all} bytecodes in hardware alone. This is
due to the following: some bytecodes, such as \code{new}, which
creates and initializes a new object, are too complex to implement
in hardware. These bytecodes have to be emulated by software.
To build a self contained JVM without an underlying operating
system, direct access to the memory and I/O devices is necessary.
There are no bytecodes defined for low-level access. These low-level
services are usually implemented in \emph{native} functions, which
means that another language (C) is native to the processor. However,
for a Java processor, bytecode is the \emph{native} language.
One way to solve this problem is to implement simple bytecodes in
hardware and to emulate the more complex and \emph{native} functions
in software with a different instruction set (sometimes called
microcode). However, a processor with two different instruction sets
results in a complex design.
Another common solution, used in Sun's picoJava \cite{pjMicroArch},
is to execute a subset of the bytecode native and to use a software
trap to execute the remainder. This solution entails an overhead (a
minimum of 16 cycles in picoJava, see \ref{subsec:related:picojava})
for the software trap.
In JOP, this problem is solved in a much simpler way. JOP has a
single \emph{native} instruction set, the so-called microcode.
During execution, every Java bytecode is translated to either one,
or a sequence of microcode instructions. This translation merely
adds one pipeline stage to the core processor and results in no
execution overheads. With this solution, we are free to define the
JOP instruction set to map smoothly to the stack architecture of the
JVM, and to find an instruction coding that can be implemented with
minimal hardware.
\figurename~\ref{fig_arch_data_flow} gives an example of this data
flow from the Java program counter to JOP microcode. The fetched
bytecode acts as an index for the jump table. The jump table
contains the start addresses for the JVM implementation in
microcode. This address is loaded into the JOP program counter for
every bytecode executed.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_indirection}
\caption{Data flow from the Java program counter to JOP microcode}
\label{fig_arch_data_flow}
\end{figure}
Every bytecode is translated to an address in the microcode that
implements the JVM. If there exists an equivalent JOP instruction
for the bytecode, it is executed in one cycle and the next bytecode
is translated. For a more complex bytecode, JOP just continues to
execute microcode in the subsequent cycles. The end of this sequence
is coded in the microcode instruction (as the \emph{nxt} bit).
%This translation needs an extra pipeline stage but has zero
%overheads for complex JVM instructions.
\subsection{Compact Microcode}
For the JVM to be implemented efficiently, the microcode has to
\emph{fit} to the Java bytecode. Since the JVM is a stack machine,
the microcode is also stack-oriented. However, the JVM is not a pure
stack machine. Method parameters and local variables are defined as
\emph{locals}. These locals can reside in a stack frame of the
method and are accessed with an offset relative to the start of this
\emph{locals} area.
Additional local variables (16) are available at the microcode
level. These variables serve as scratch variables, like registers in
a conventional CPU. However, arithmetic and logic operations are
performed on the stack.
Some bytecodes, such as ALU operations and the short form access to
\emph{locals}, are directly implemented by an equivalent microcode
instruction (with a different encoding). Additional instructions are
available to access internal registers, main memory and I/O devices.
A relative conditional branch (zero/non zero of TOS) performs
control flow decisions at the microcode level. For optimum use of
the available memory resources, all instructions are 8 bits long.
There are no variable-length instructions and every instruction,
with the exception of \code{wait}, is executed in a single cycle. To
keep the instruction set this dense, two concepts are applied:
Two types of operands, immediate values and branch distances,
normally force an instruction set to be longer than 8 bits. The
instruction set is either expanded to 16 or 32 bits, as in typical
RISC processors, or allowed to be of variable length at byte
boundaries. A first implementation of the JVM with a 16-bit
instruction set showed that only a small number of different
constants are necessary for immediate values and relative branch
distances.
In the current realization of JOP, the different immediate values
are collected while the microcode is being assembled and are put
into the initialization file for the local RAM. These constants are
accessed indirectly in the same way as the local variables. They are
similar to initialized variables, apart from the fact that there are
no operations to change their value during runtime, which would
serve no purpose and would waste instruction codes.
A similar solution is used for branch distances. The assembler
generates a VHDL file with a table for all found branch constants.
This table is indexed using instruction bits during runtime. These
indirections during runtime make it possible to retain an 8-bit
instruction set, and provide 16 different immediate values and 32
different branch constants. For a general purpose instruction set,
these indirections would impose too many restrictions. As the
microcode only implements the JVM, this solution is a viable option.
To simplify the logic for instruction decoding, the instruction
coding is carefully chosen. For example, one bit in the instruction
specifies whether the instruction will increment or decrement the
stack pointer. The offset to access the \emph{locals} is directly
encoded in the instruction. This is not the case for the original
encoding of the equivalent bytecodes (e.g. \emph{iload\_0} is 0x1a
and \emph{iload\_1} is 0x1b). Whenever a multiplexer depends on an
instruction, the selection is directly encoded in the instruction.
\subsection{Instruction Set}
JOP implements 43 different microcode instructions. These
instructions are encoded in 8 bits. With the addition of the
\emph{nxt} and \emph{opd} bits in every instruction, the effective
instruction length is 10 bits.
\begin{description}
\item[Bytecode equivalent:]
These instructions are direct implementations of bytecodes and
result in one cycle execution time for the bytecode (except
\code{st} and \code{ld}): \code{pop}, \code{and}, \code{or},
\code{xor}, \code{add}, \code{sub}, \code{st$<$n$>$}, \code{st},
\code{ushr}, \code{shl}, \code{shr}, \code{nop}, \code{ld$<$n$>$},
\code{ld}, \code{dup}
\item[Local memory access:]
The first 16 words in the internal stack memory are reserved for
internal variables. The next 16 words contain constants. These
memory locations are accessed using the following instructions:
\code{stm}, \code{ldm}, \code{ldi}
\item[Register manipulation:]
The stack pointer, the variable pointer and the Java program counter
are loaded or stored with: \code{stvp}, \code{stjpc}, \code{stsp},
\code{ldvp}, \code{ldjpc}, \code{ldsp}
\item[Bytecode operand:]
The operand is loaded from the bytecode RAM, converted to a 32-bit
word and pushed on the stack with: \code{ld\_opd\_8s},
\code{ld\_opd\_8u}, \code{ld\_opd\_16s}, \code{ld\_opd\_16u}
\item[External memory access:]
The autonomous memory subsystem is accessed using the following
instructions: \code{stmra}, \code{stmwa}, \code{stmwd}, \code{wait},
\code{ldmrd}, \code{stbcrd}, \code{ldbcstart}
\item[IO device access:]
The following instructions permit access to the IO subsystem:
\code{stioa}, \code{stiod}, \code{ldiod}
\item[Multiplier:]
The multiplier is accessed with: \code{stmul}, \code{ldmul}
\item[Microcode branches:]
Two conditional branches in microcode are available: \code{bz},
\code{bnz}
\item[Bytecode branch:]
All 17 bytecode branch instructions are mapped to one instruction:
\code{jbr}
\end{description}
%
A detailed description of the microcode instructions can be found in
Appendix~\ref{appx:jop:instr}.
\subsection{Bytecode Example}
The example in Listing~\ref{lst:arch:micro1} shows the
implementation of a single cycle bytecode and an infrequent bytecode
as a sequence of JOP instructions. In this example, the \code{dup}
bytecode is mapped to the equivalent \code{dup} microcode and
executed in a single cycle, whereas \code{dup\_x1} takes five cycles
to execute, and after the last instruction (\code{ldm a nxt}), the
first instruction for the next bytecode is executed.
\begin{lstlisting}[caption={Implementation of \code{dup} and \code{dup\_x1}},
label=lst:arch:micro1]
dup: dup nxt // 1 to 1 mapping
// a and b are scratch variables for the
// JVM code.
dup_x1: stm a // save TOS
stm b // and TOS-1
ldm a // duplicate former TOS
ldm b // restore TOS-1
ldm a nxt // restore TOS and fetch next bytecode
\end{lstlisting}
Some bytecodes are followed by operands of between one and three
bytes in length (except \code{lookupswitch} and \code{tableswitch}).
Due to pipelining, the first operand byte that follows the bytecode
instruction is available when the first microcode instruction enters
the execution stage. If this is a one-byte long operand, it is ready
to be accessed. The increment of the Java program counter after the
read of an operand byte is coded in the JOP instruction (an
\emph{opd} bit similar to the \emph{nxt} bit).
In Listing~\ref{lst:arch:micro2}, the implementation of
\code{sipush} is shown. The bytecode is followed by a two-byte
operand. Since the access to bytecode memory is only one byte per
cycle, \emph{opd} and \emph{nxt} are not allowed at the same time.
This implies a minimum execution time of $n+1$ cycles for a bytecode
with $n$ operand bytes.
\begin{lstlisting}[caption={Bytecode operand load},
label=lst:arch:micro2]
sipush: nop opd // fetch next byte
nop opd // and one more
ld_opd_16s nxt // load 16 bit operand
\end{lstlisting}
\subsection{Flexible Implementation of Bytecodes}
As mentioned above, some Java bytecodes are very complex. One
solution already described is to emulate them through a sequence of
microcode instructions. However, some of the more complex bytecodes
are very seldom used. To further reduce the resource implications
for JOP, in this case local memory, bytecodes can even be
implemented by \emph{using} Java bytecodes. During the assembly of
the JVM, all labels that represent an entry point for the bytecode
implementation are used to generate the translation table. For all
bytecodes for which no such label is found, i.e.\ there is no
implementation in microcode, a \emph{not-implemented} address is
generated. The instruction sequence at this address invokes a static
method from a system class (\code{com.jopdesign.sys.JVM}). This
class contains 256 static methods, one for each possible bytecode,
ordered by the bytecode value. The bytecode is used as the index in
the method table of this system class. As described in
Section~\ref{sec:hwsw:co}, this feature also allows for the easy
configuration of resource usage versus performance.
\subsection{Summary}
In order to handle the great variation in the complexity of Java
bytecodes we have proposed a translation to a different instruction
set, the so-called microcode. This microcode is still an instruction
set for a stack machine, but more RISC-like than the CISC-like JVM
bytecodes.
In the next section we will see how this translation is handled in
JOP's pipeline and how it can simplify interrupt handling.
%\subsubsection{old text:}
%There is a great variation in complexity of Java bytecodes, the
%instructions of the JVM. There are simple instructions like
%arithmetic and logic operations on the stack. However, the semantics
%of instructions like \emph{new} or \emph{invokestatic} can result in
%class loading (even over a network) and verification. Because of
%this variation, not every JVM instruction can be implemented in
%hardware. One common solution, used in Suns picoJava
%\cite{pjMicroArch}, is to execute a subset of the bytecode native
%and trap on the more complex one. This solution has a minimum
%overhead of 16 cycles in picoJava for the software trap.
%
%The approach to this problem in JOP is different. JOP has its own
%instruction set (the so called microcode). Some bytecodes have a 1
%to 1 mapping to JOP instructions, for the more complex a sequence of
%JOP instructions is necessary. Every bytecode is translated to an
%address in the microcode that implements the JVM.
\section{The Processor Pipeline}
\label{sec:pipeline}
JOP is a fully pipelined architecture with single cycle execution of
microcode instructions and a novel approach to mapping Java bytecode
to these instructions. Figure~\ref{fig_arch_pipeline} shows the
datapath for JOP.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_pipeline}
\caption{Datapath of JOP}
\label{fig_arch_pipeline}
\end{figure}
Three stages form the JOP core, executing microcode instructions. An
additional stage in the front of the core pipeline fetches Java
bytecodes -- the instructions of the JVM -- and translates these
bytecodes into addresses in microcode. Bytecode branches are also
decoded and executed in this stage. The second pipeline stage
fetches JOP instructions from the internal microcode memory and
executes microcode branches. Besides the usual decode function, the
third pipeline stage also generates addresses for the stack RAM. As
every stack machine instruction has either \emph{pop} or \emph{push}
characteristics, it is possible to generate fill or spill addresses
for the \emph{following} instruction at this stage. The last
pipeline stage performs ALU operations, load, store and stack spill
or fill. At the execution stage, operations are performed with the
two topmost elements of the stack.
The stack architecture allows for a short pipeline, which results in
short branch delays. Two branch delay slots are available after a
conditional microcode branch.
The method cache (\emph{Bytecode RAM}), microcode ROM, and stack RAM
are implemented with single cycle access in the FPGA's internal
memories.
%Through full pipelining each JOP instruction is executed in one
%cycle.
%Every JOP instruction takes one cycle. Conditional branches have an
%additional delay of two cycles. These two branch delay slots can be
%used for instructions or \emph{nop}.
\subsection{Java Bytecode Fetch}
In the first pipeline stage, as shown in
Figure~\ref{fig_arch_bc_fetch}, the Java bytecodes are fetched from
the internal memory (\emph{Bytecode RAM}). The bytecode is mapped
through the translation table into the address (\emph{jpaddr}) for
the microcode ROM.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_bcfetch}
\caption{Java bytecode fetch}
\label{fig_arch_bc_fetch}
\end{figure}
The fetched bytecode results in an absolute jump in the microcode
(the second stage). If the bytecode is mapped one-to-one with a JOP
instruction, the following fetched bytecode again results in a jump
in the microcode in the following cycle. If the bytecode is a
complex one, JOP continues to execute microcode. At the end of this
instruction sequence the next bytecode, and therefore the new jump
address, is requested (signal \emph{nxt}).
The bytecode RAM serves as instruction cache and is filled on method
invoke and return. Details about this time-predictable instruction
cache can be found in Section~\ref{sec:cache}.
The bytecode is also stored in a register for later use as an
operand (requested by signal \emph{opd}). Bytecode branches are also
decoded and executed in this stage. Since \emph{jpc} is also used to
read the operands, the program counter is saved in \emph{jpcbr}
during an instruction fetch. \emph{jinstr} is used to decode the
branch type and \emph{jpcbr} to calculate the branch target address.
\subsection{JOP Instruction Fetch}
The second pipeline stage, as shown in Figure~\ref{fig_arch_fetch},
fetches JOP instructions from the internal microcode memory and
executes microcode branches.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_fetch}
\caption{JOP instruction fetch}
\label{fig_arch_fetch}
\end{figure}
The JOP microcode, which implements the JVM, is stored in the
microcode ROM. The program counter \emph{pc} is incremented during
normal execution. If the instruction is labeled with \emph{nxt} a
new bytecode is requested from the first stage and \emph{pc} is
loaded with \emph{jpaddr}. \emph{jpaddr} is the starting address for
the implementation of that bytecode. The label \emph{nxt} is the
flag that marks the end of the microcode instruction stream for one
bytecode. Another flag, \emph{opd}, indicates that a bytecode
operand needs to be fetched in the first pipeline stage. Both flags
are stored in a table that is indexed by the program counter.
\emph{brdly} contains the target address for a conditional branch.
The same offset is shared by a number of branch destinations. A
table (\emph{branch offset}) is used to store these relative
offsets. This indirection means that only 5 bits need to be used in
the instruction coding for branch targets and thereby allow greater
offsets. The three tables \emph{BC fetch table}, \emph{branch
offset} and \emph{translation table} (from the bytecode fetch stage)
are generated during the assembly of the JVM code. The outputs are
plain VHDL files. For an implementation in an FPGA, recompiling the
design after changing the JVM implementation is a straightforward
operation. For an ASIC with a loadable JVM, it is necessary to
implement a different solution.
FPGAs available to date do not allow asynchronous memory access.
They therefore force us to use the registers in the memory blocks.
However, the output of these registers is not accessible. To avoid
having to create an additional pipeline stage just for a
register-register move, the read address register of the microcode
ROM is clocked on the negative edge.
An alternative solution for this problem would be to use the output
of the multiplexer for the \emph{pc} and the read address register
of the memory. However, this solution results in a longer critical
path, as the multiplexer can no longer be combined with the
flip-flops that form the \emph{pc} in the same LCs. This is an
example of how implementation technology (the FPGA) can influence
the architecture.
\subsection{Decode and Address Generation}
Besides the usual decode function, the third pipeline, as shown in
Figure~\ref{fig_arch_decode}, also generates addresses for the stack
RAM.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_decaddr}
\caption{Decode and address generation}
\label{fig_arch_decode}
\end{figure}
As we can see in Section~\ref{sec:stack}
Table~\ref{tab_stack_address}, read and write addresses are either
relative to the stack pointer or to the variable pointer. The
selection of the pre-calculated address can be performed in the
decode stage. When an address relative to the stack pointer is used
(either as read or as write address, never for both) the stack
pointer is also decremented or incremented in the decode stage.
Stack machine instructions can be categorized from a stack
manipulation perspective as either \emph{pop} or \emph{push}. This
allows us to generate fill or spill TOS-1 addresses for the
\emph{following} instruction during the decode stage, thereby saving
one extra pipeline stage.
\subsection{Execute}
At the execution stage, as shown in Figure~\ref{fig_arch_exe},
operations are performed using two discrete registers: TOS and
TOS-1, labeled \emph{A} and \emph{B}.
\begin{figure}
\centering
\includegraphics[scale=\picscale]{arch/arch_execute}
\caption{Execution stage}
\label{fig_arch_exe}
\end{figure}
Each arithmetic/logical operation is performed with registers
\emph{A} and \emph{B} as the source, and register \emph{A} as the
destination. All load operations (local variables, internal
register, external memory and periphery) result in a value being
loaded into register \emph{A}. There is therefore no need for a
write-back pipeline stage. Register \emph{A} is also the source for
the store operations. Register \emph{B} is never accessed directly.
It is read as an implicit operand or for stack spill on push
instructions. It is written during the stack spill with the content
of the stack RAM or the stack fill with the content of register
\emph{A}.
Beside the Java stack, the stack RAM also contains microcode
variables and constants. This resource-sharing arrangement not only
reduces the number of memory blocks needed for the processor, but
also the number of data paths to and from the register \nolinebreak
\emph{A}.
The inverted clock on data-in and on the write address register of
the stack RAM is used, for the same reason, as on the read address
register of the microcode ROM.
A stack machine with two explicit registers for the two topmost
stack elements and automatic fill/spill needs neither an extra
write-back stage nor any data forwarding. Details of this two-level
stack architecture are described in Section~\ref{sec:stack}.
%\subsection{Pipeline Example}
%
%\emph{\textbf{Show from Java code downto FPGA HW.}}
%
%Is in brown bock.
\subsection{Interrupt Logic}
\label{sec:interrupt}
Interrupts are considered hard to handle in a pipelined processor,
meaning implementation tends to be complex (and therefore resource
consuming). In JOP, the bytecode-microcode translation is used
cleverly to avoid having to handle interrupts in the core pipeline.
Interrupts are implemented as special bytecodes. These bytecodes are
inserted by the hardware in the Java instruction stream. When an
interrupt is pending and the next fetched byte from the bytecode RAM
is an instruction (as indicated by the \emph{nxt} bit in the
microcode), the associated special bytecode is used instead of the
instruction from the bytecode RAM. The result is that interrupts are
accepted at bytecode boundaries. The worst-case preemption delay is
the execution time of the \emph{slowest} bytecode that is
implemented in microcode. Bytecodes that are implemented in Java can
be interrupted.
The implementation of interrupts at the bytecode-microcode mapping
stage keeps interrupts transparent in the core pipeline and avoids
complex logic. Interrupt handlers can be implemented in the same way
as standard bytecodes are implemented i.e.\ in microcode or Java.
This special bytecode can result in a call of a JVM internal method
in the context of the interrupted thread. This mechanism implicitly
stores almost the complete context of the current active thread on
the stack.
\subsection{Summary}
In this section, we have analyzed JOP's pipeline. The core of the
stack machine constitutes a three-stage pipeline. In the following
section, we will see that this organization is an optimal solution
for the stack access pattern of the JVM.
An additional pipeline stage in front of this core pipeline stage
performs bytecode fetch and the translation to microcode. This
organization has zero overheads for more complex bytecodes and
results in the short pipeline that is necessary for any processor
without branch prediction. This additional translation stage also
presents an elegant way of incorporating interrupts virtually
\emph{for free}.
1.1 jop/doc/book/arch/hwsw_codesign.tex
http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/hwsw_codesign.tex?rev=1.1&content-type=text/x-cvsweb-markup
Index: hwsw_codesign.tex
===================================================================
\section{HW/SW Codesign}
\label{sec:hwsw:co}
Using a hardware description language and loading the design in an
FPGA the former strict border between hardware and software gets
blurred. Is configuring an FPGA not more like loading a program for
execution?
This looser distinction makes it possible to move functions easily
between hardware and software resulting in a highly configurable
design. If speed is an issue, more functions are realized in
hardware. If cost is the primary concern these functions are moved
to software and a smaller FPGA can be used. Let us examine these
possibilities on a relatively expensive function: multiplication.
In Java bytecode \code{imul} performs a 32 bit signed multiplication
with a 32 bit result. There are no exceptions on overflow. Since 32
bit single cycle multiplications are far beyond the possibilities of
current, mainstream FPGAs the first solution is a sequential
multiplier.
\paragraph{Sequential Booth Multiplier in VHDL}
\begin{lstlisting}[float, caption={Booth multiplier in VHDL},
language=VHDL, label=lst:arch:hwsw:vhdl]
process(clk, wr_a, wr_b)
variable count : integer range 0 to width;
variable pa : signed(64) downto 0);
variable a_1 : std_logic;
alias p : signed(32 downto 0)
is pa(64 downto 32);
begin
if rising_edge(clk) then
if wr_a='1' then
p := (others => '0');
pa(width-1 downto 0) := signed(din);
elsif wr_b='1' then
b <= din;
a_1 := '0';
count := width;
else
if count > 0 then
case std_ulogic_vector'(pa(0), a_1) is
when "01" =>
p := p + signed(b);
when "10" =>
p := p - signed(b);
when others =>
null;
end case;
a_1 := pa(0);
pa := shift_right(pa, 1);
count := count - 1;
end if;
end if;
end if;
dout <= std_logic_vector(pa(31 downto 0));
end process;
\end{lstlisting}
%
Listing~\ref{lst:arch:hwsw:vhdl} shows the VHDL code of the
multiplier. Two microcode instructions are used to access this
function: \code{stmul} stores the two operands (from TOS and TOS-1)
and starts the sequential multiplier. After 33 cycles, the result is
loaded with \code{ldmul}. Listing~\ref{lst:arch:hwsw:micro} shows
the microcode for \code{imul}.
\begin{lstlisting}[float, caption={Microcode to access the Booth multiplier},
label=lst:arch:hwsw:micro]
imul:
stmul // store both operands and start
pop // pop second operand
ldi 5 // 6*5+3 cycles wait
imul_loop: // wait loop
dup
nop
bnz imul_loop
ldi -1 // decrement in branch slot
add
pop // remove counter
ldmul nxt // load result
\end{lstlisting}
\paragraph{Multiplication in Microcode}
If we run out of resources in the FPGA, we can move the function to
microcode. The implementation of \code{imul} is almost identical
with the Java code in Listing~\ref{lst:arch:hwsw:java} and needs 73
microcode instructions.
\paragraph{Bytecode imul in Java}
Microcode is stored in an embedded memory block of the FPGA. This is
also a resource of the FPGA. We can move the code to external memory
by implementing \code{imul} in Java bytecode. Bytecodes not
implemented in microcode result in a static Java method call from a
special class (\code{com.jopdesign.sys.JVM}). This class has
prototypes for each bytecode ordered by the bytecode value. This
allows us to find the right method by indexing the method table with
the value of the bytecode. Listing~\ref{lst:arch:hwsw:java} shows
the Java method for \code{imul}. The additional overhead for this
implementation is a call and return with cache refills.
\begin{lstlisting}[float, caption={Implementation of bytecode \code{imul} in Java},
label=lst:arch:hwsw:java]
public static int imul(int a, int b) {
int c, i;
boolean neg = false;
if (a<0) {
neg = true;
a = -a;
}
if (b<0) {
neg = !neg;
b = -b;
}
c = 0;
for (i=0; i<32; ++i) {
c <<= 1;
if ((a & 0x80000000)!=0) c += b;
a <<= 1;
}
if (neg) c = -c;
return c;
}
\end{lstlisting}
\paragraph{Implementations Compared}
\tablename~\ref{tab_arch_hwsw_compared} lists the resource usage and
execution time for the three implementations. Execution time is
measured with both operands negative, the worst-case execution time
for the software implementations. The implementation in Java is
slower than the microcode implementation as the Java method is
loaded from main memory into the bytecode cache.
\begin{table}
\centering
\begin{tabular}{ld{2}d{3}d{0}}
\toprule
& \cc{Hardware} & \cc{Microcode} & \cc{Time} \\
& \cc{[LC]} & \cc{[Byte]} & \cc{[Cycle]} \\
\midrule
VHDL & 156 & 10 & 35 \\
Microcode & 0 & 73 & 750 \\
Java & 0 & 0 & ~2,300 \\
\bottomrule
\end{tabular}
\caption{Different implementations of \code{imul} compared}
\label{tab_arch_hwsw_compared}
\end{table}
Only a few lines of code have to be changed to select one of the
three implementations. The shown principle can also be applied to
other expensive bytecodes: e.g.\ \code{idiv}, \code{ishr},
\code{iushr} and \code{ishl}. As a result, the resource usage of JOP
is highly configurable and can be selected for each application
according to the needs of the application. Treating VHDL as a
software language allows easy movement of function blocks between
hardware and software.
1.1 jop/doc/book/arch/jvm_bench.tex
http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/jvm_bench.tex?rev=1.1&content-type=text/x-cvsweb-markup
Index: jvm_bench.tex
===================================================================
\section{Benchmarking the JVM}
\label{sec:bench:jvm}
The rationale behind this section is best introduced with the
warning from Computer Architecture: A Quantitative Approach
\cite{Hennessy02} p. 63:
\begin{quote}
Virtually every practicing computer architect knows Amdahls Law.
Despite this, we almost all occasionally fall into the trap of
expending tremendous effort optimizing some aspect of a system
before we measure its usage. Only when the overall speedup is
unrewarding do we recall that we should have measured the usage of
that feature before we spent so much effort enhancing it!
\end{quote}
%
We measured how Java programs use the bytecode instruction set and
explored the typical and worst-case method sizes. Our measurements
and other reports are presented in the following sections.
\subsection{Bytecode Frequency}
The dynamic instruction frequency is the main measurement for
determining a processor implementation. We can identify those
instructions that should be fast. For seldom-used instructions, a
trade-off can be made between performance and hardware resources.
Many reports have been written about JVM bytecode frequencies (e.g.\
\cite{Greg2002, 365338, 624084}). Most of these reports provide only
a coarse categorization of the bytecodes. For example, the bytecodes
\code{iload\_n} (load an \code{int} from a local variable) and
\code{getfield} (fetch a field from an object) are combined in one
instruction category. However, these instructions are very different
in terms of their implementation complexity. We have chosen a
fine-grained categorization of the bytecodes to gain greater insight
into the bytecode usage. In Table~\ref{tab_java_instr_cat} all 201
bytecode instructions are listed by category.
\begin{table}
\centering
\begin{tabular}{ll}
\toprule
Type & Bytecode \\
\midrule
load
& aload, dload, fload, iload, lload \\
load (short)
& aload\_0, aload\_1, aload\_2, aload\_3, \\
& dload\_0, dload\_1, dload\_2, dload\_3, \\
& fload\_0, fload\_1, fload\_2, fload\_3, \\
& iload\_0, iload\_1, iload\_2, iload\_3, \\
& lload\_0, lload\_1, lload\_2, lload\_3 \\
store
& astore, dstore, fstore, istore, lstore \\
store (short)
& astore\_0, astore\_1, astore\_2, astore\_3, \\
& dstore\_0, dstore\_1, dstore\_2, dstore\_3, \\
& fstore\_0, fstore\_1, fstore\_2, fstore\_3, \\
& istore\_0, istore\_1, istore\_2, istore\_3, \\
& lstore\_0, lstore\_1, lstore\_2, lstore\_3 \\
const
& bipush, ldc, ldc\_w, ldc2\_w, sipush \\
const (short)
& aconst\_null, dconst\_0, dconst\_1, fconst\_0, fconst\_1, fconst\_2, \\
& iconst\_0, iconst\_1, iconst\_2, iconst\_3, iconst\_4, iconst\_5, \\
& iconst\_m1, lconst\_0, lconst\_1 \\
get
& getfield, getstatic \\
put
& putfield, putstatic \\
alu
& dadd, ddiv, dmul, dneg, drem, dsub, \\
& fadd, fdiv, fmul, fneg, frem, fsub, \\
& iadd, iand, idiv, imul, ineg, ior, irem, ishl, ishr, isub, iushr, ixor, \\
& ladd, land, ldiv, lmul, lneg, lor, lrem, lshl, lshr, lsub, lushr, lxor \\
iinc
& iinc \\
stack
& dup, dup\_x1, dup\_x2, dup2, dup2\_x1, dup2\_x2, pop, pop2, swap \\
array
& aaload, aastore, baload, bastore, caload, castore, daload, dastore, \\
& faload, fastore, iaload, iastore, laload, lastore, saload, sastore \\
branch
& goto, goto\_w, if\_acmpeq, if\_acmpne, if\_icmpeq, \\
& if\_icmpge, if\_icmpgt, if\_icmple, if\_icmplt, if\_icmpne, \\
& ifeq, ifge, ifgt, ifle, iflt, ifne, ifnonnull, ifnull \\
compare
& dcmpg, dcmpl, fcmpg, fcmpl, lcmp \\
switch
& lookupswitch, tableswitch \\
call
& invokeinterface, invokespecial, invokestatic, invokevirtual \\
return
& areturn, dreturn, freturn, ireturn, lreturn, return \\
conversion
& d2f, d2i, d2l, f2d, f2i, f2l, i2b, i2c, i2d, i2f, i2l, i2s, l2d, l2f, l2i \\
new
& anewarray, multianewarray, new, newarray \\
other
& arraylength, athrow, checkcast, instanceof, jsr, jsr\_w, \\
& monitorenter, monitorexit, nop, ret, wide \\
\bottomrule
\end{tabular}
\caption[Categories of JVM bytecodes]{The 201 Java bytecodes and
their assignment to different categories}
\label{tab_java_instr_cat}
\end{table}
Three different applications were run on an instrumented JVM to
measure dynamic bytecode frequency. The results were compared with
the results from the above-mentioned reports. In
Table~\ref{tab_java_instr_frequ} the dynamic instruction count for
the three different benchmarks is shown. The last column is the
average of the three tests weighted by the individual instructions
count.
Kaffe \cite{kaffe} is an independent implementation of the JVM
distributed under the GNU Public License. Kaffe was instrumented to
collect data on dynamic bytecode usage. Three different applications
were used as benchmarks to obtain the dynamic instruction count:
JLex, KCJ and javac. JLex \cite{jlex} is a lexical analyzer
generator, written for Java in Java. The data was collected by
running JLex with the provided \code{sample.lex} as the input file.
KJC \cite{kcj} is a Java compiler in Java, freely available under
the terms of the GNU General Public License. javac is the Sun Java
compiler. Both compilers were compiling part of the KJC sources
during the benchmark. These benchmarks are similar to the benchmarks
used in other reports and the results are therefore comparable.
However, typical embedded applications can result in a slightly
different instruction set usage pattern. Embedded applications are
usually tightly connected with the environment and are therefore not
available as stand-alone programs to serve as benchmark. An embedded
application that was developed on JOP was adapted to serve as
benchmark for Section~\ref{sec:cache} and
Chapter~\ref{chap:results}.
%19,572,165 + 951,138,375 + 341,926,231 = 1,312,636,771
% JLex.Main kaffe JLex.Main JLex/sample.lex
% at.dms.kjc.Main:
% kaffe -cp kjc-2.1B-bin.jar at.dms.kjc.Main -d classes kopi-2.1B/src/kjc/*.java
% com.sun.tools.javac.Main:
% kaffe -cp /usr/lib/java/lib/tools.jar com.sun.tools.javac.Main -d classes kopi-2.1B/src/kjc/*.java
% stops with compile error
\begin{table}
\centering
\begin{tabular}{lrrrr}
\toprule
& JLex & KJC & javac & Average \\
\midrule
load (short) & 32.72 & 31.45 & 27.24 & 30.37 \\
get & 12.02 & 14.39 & 17.04 & 15.04 \\
branch & 11.26 & 10.40 & 10.71 & 10.49 \\
invoke & 6.87 & 6.31 & 4.24 & 5.77 \\
return & 6.82 & 6.20 & 4.17 & 5.68 \\
load & 7.59 & 4.19 & 7.48 & 5.09 \\
alu & 2.60 & 4.43 & 4.74 & 4.48 \\
const (short) & 4.61 & 4.26 & 4.74 & 4.39 \\
array & 4.22 & 4.07 & 3.22 & 3.85 \\
put & 0.78 & 2.14 & 3.65 & 2.52 \\
iinc & 1.81 & 2.38 & 1.41 & 2.12 \\
stack & 1.30 & 2.11 & 2.11 & 2.10 \\
store (short) & 2.61 & 2.18 & 1.71 & 2.06 \\
other & 1.63 & 2.22 & 1.21 & 1.95 \\
const & 0.85 & 1.56 & 2.80 & 1.87 \\
store & 2.05 & 0.85 & 1.94 & 1.15 \\
conversion & 0.02 & 0.36 & 0.58 & 0.42 \\
switch & 0.00 & 0.20 & 0.60 & 0.30 \\
new & 0.08 & 0.28 & 0.20 & 0.25 \\
compare & 0.14 & 0.03 & 0.22 & 0.08 \\
\bottomrule
\end{tabular}
\caption{Dynamic bytecode frequency in \%}
\label{tab_java_instr_frequ}
\end{table}
\begin{table}
\centering
\begin{tabular}{lrlr}
\toprule
\multicolumn{2}{c}{JLex, KJC and javac} &
\multicolumn{2}{c}{SPEC and Java Grande} \\
\cmidrule(lr){1-2} \cmidrule(lr){3-4}
Instruction & Frequency & Instruction & Frequency \\
\midrule
load (short) & 30.37 & acnst & 0.07 \\
load & 5.09 & aload & 16.23 \\
const (short) & 4.39 & fcnst & 0.33 \\
const & 1.87 & fload & 6.33 \\
& & icnst & 3.21 \\
& & iload & 18.06 \\
\midrule
load \& const & \textbf{41.72}& & \textbf{44.77} \\
\midrule
get & 15.04 & field & 11.12 \\
put & 2.52 & & \\
\midrule
field access & \textbf{17.56} & & \textbf{11.12} \\
\midrule
branch & 10.49 & cjump & 5.67 \\
compare & 0.08 & ujump & 0.51 \\
\midrule
control & \textbf{10.57}& & \textbf{6.18} \\
\midrule
invoke & \textbf{5.77} & fcall & \textbf{3.63} \\
\midrule
return & \textbf{5.68} & retrn & \textbf{2.07} \\
\bottomrule
\end{tabular}
\caption[Dynamic bytecode frequency comared]
{Dynamic bytecode frequency compared with the
measurements from \cite{Dowling2002}}
\label{tab:java:instr:frequ:comp}
\end{table}
In \cite{Dowling2002} the relationship between static and dynamic
instruction frequency of 19 programs from the SPECjvm98
\cite{SPECJvm98} and Java Grande benchmark suits were measured. The
bytecodes categories were chosen different from the above
measurements, but detailed enough to verify our own measurements.
\tablename~\ref{tab:java:instr:frequ:comp} shows the average dynamic
execution frequency in percent\footnote{The values do not add up to
100\% as only the most significant bytecode categories are shown} of
selected bytecode categories from the SPEC and Java Grande
benchmarks, compared with the results obtained by our measurements.
The numbers in bold are categories or sums of categories that are
comparable. The frequency of the load \& const instructions is very
similar to that in our measurements. However, field access, control
instructions and method invocations are more frequent in our
measurements. The higher count on field access instructions and
method invocation can result from a more object oriented programming
style in our selected applications than in the SPEC and Java Grande
benchmarks. The big difference, not seen in our measurements,
between the invoke and return frequency in the SPEC and Java Grande
benchmarks is not explained in \cite{Dowling2002}.
In all measurements, the load of local variables and constants onto
the stack accounts for more than 40\% of instructions executed. This
feature shows that an efficient realization of the local variable
memory area, the stack and the transfer between these memory areas
is mandatory.
The next most executed bytecodes (\code{getfield} and
\code{getstatic}) are the instructions that load an object or class
field onto the operand stack. To account for these frequent
instructions, the class layout for the runtime system has to be
optimized for quick resolution of field addresses (i.e.\ minimum
memory indirections).
The frequency of branches is comparable with the SPECint2000
measurements on RISC processors \cite{Hennessy02}. With such a high
branch frequency, a processor without branch prediction logic is put
under pressure in terms of pipeline length.
It is interesting to note that there are more method invoke
instructions than return instructions. Two facts are responsible for
this difference: native methods are invoked by a bytecode, but the
return is inside the native methods; and an exception can result in
a method exit without return.
\subsection{Methods Types and Length}
\label{sec:bench:jvm:methods}
\tablename~\ref{tab_java_meth_type} shows the number of dynamic
method calls of the Java Grande and \linebreak[4]SPECjvm98
benchmarks. It can be seen that the distribution of method types
depends on the application type. Usage of virtual methods and
interfaces is common in OO programming. Static methods result from
the simple translation of procedural programs to Java.
\begin{table}
\centering
\begin{tabular}{lrrrr}
\toprule
& virtual & special & static & interface \\
\midrule
Java Grande & 57.1 & 8.7 & 34.2 & 0.0 \\
SPEC JVM98 & 81.0 & 10.9 & 2.9 & 5.2 \\
\bottomrule
\end{tabular}
\caption{Types of different dynamic method calls for two benchmarks (from \cite{Power2002})}
\label{tab_java_meth_type}
\end{table}
As a basis for the proposed cache solution in
Section~\ref{sec:cache}, we will explore static distribution of
method sizes. In the JVM, only relative branches are defined. The
conditional branches and goto have an offset of 16 bits, resulting
in a practical limit of the method length of 32KB. Although there is
a goto instruction with a wide index (\emph{goto\_w}) that takes a
4-byte branch offset, other factors (e.g.\ indices in the exception
table) limit the size of a method to 65535 bytes.
Radhakrishnan et al.\ \cite{365338} measured the dynamic method size
of the SPEC suit. They observed a `tri-nodal' distribution, where
most of the methods were 1, 9, or 26 bytecodes long. No explanation
is given for the sizes of 9 or 26. The explanation of the 1 bytecode
long methods as \emph{wrapper methods} is wrong. For a wrapper
method, the method needs to contain a minimum of two instructions
(an invoke and a return). A single instruction method can
\emph{only} contain a return. However, this observation is in sharp
contrast to the measurements obtained by Power and Waldron in
\cite{Power2002}.
\begin{table}
\centering
\begin{tabular}{rrrr}
\toprule
Length & Methods & Percentage & Cumulative \\
& & & percentage \\
\midrule
1 & 1,388 & 1.94 & 1.94 \\
2 & 1,580 & 2.21 & 4.16 \\
4 & 1,871 & 2.62 & 6.78 \\
8 & 16,192 & 22.67 & 29.45 \\
16 & 12,363 & 17.31 & 46.76 \\
32 & 12,638 & 17.70 & 64.45 \\
64 & 11,178 & 15.65 & 80.10 \\
128 & 7,287 & 10.20 & 90.31 \\
256 & 4,304 & 6.03 & 96.33\\
512 & 1,727 & 2.42 & 98.75 \\
1,024 & 592 & 0.83 & 99.58 \\
2,048 & 175 & 0.25 & 99.83 \\
4,096 & 75 & 0.11 & 99.93 \\
8,192 & 37 & 0.05 & 99.98 \\
16,384 & 11 & 0.02 & 100.00 \\
32,768 & 1 & 0.00 & 100.00 \\
65,536 & 0 & 0.00 & 100.00 \\
\bottomrule
\end{tabular}
\caption{Static method count of different sizes from the runtime library (JDK 1.4).}
\label{tab_java_jdk_static_size}
\end{table}
In Table~\ref{tab_java_jdk_static_size}, the number of methods of
different sizes in the Java runtime library (JDK 1.4) is shown. The
library consists of 71419 methods, the largest being 16706 bytes.
The size is classified by powers of 2 because we are interested in
the size of cache memory for complete methods. In the table, the row
of, for example, size 32 includes all methods of a size from 17 to
32 bytes. It can be seen that methods are typically very short. In
fact, 99\% of the methods are less than 513 bytes in size. This
property is important for the proposed method cache in
Section~\ref{sec:cache}, where a complete method has to fit into the
instruction cache.
All larger methods are different kinds of initialization functions,
in most cases \code{$<$clinit$>$()}\footnote{The class or interface
initialization method is static and the special name
\codefoot{$<$clinit$>$} is supplied by the compiler. These
initialization methods are invoked implicitly by the JVM. The
definition when these methods get invoked is problematic for the
WCET analysis (see Section~\ref{para:restrict:clinit}).}. The large
class initialization methods typically result from the
initialization of arrays with constant data. This is necessary
because of the lack of initialized data segments, such as the BSS in
C, in the Java class file. These initialization methods contain
straight-line code and can therefore be split to smaller methods
automatically, if necessary.
\begin{figure}
\centering
\includegraphics[width=\excelwidth]{arch/arch_meth32}
\caption[Static method count for methods of size up to 32 bytes]
{Static method count for methods of size up to 32 bytes in the JDK 1.4 runtime library.
The horizontal axis indicates the method size.}
\label{fig_java_meth32}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=\excelwidth]{arch/arch_meth300}
\caption[Static method count from the JDK 1.4 runtime library]
{Static method count from the JDK 1.4 runtime library.
The horizontal axis indicates the method size in bytes.}
\label{fig_java_meth300}
\end{figure}
In Figure~\ref{fig_java_meth32}, the distribution of small methods
up to a size of 32 bytes is shown.
\figurename~\ref{fig_java_meth300} shows the method count for
methods up to 300 bytes. As expected, we see fewer methods as size
increases. We observed no surprise in the distribution, unlike the
`tri-nodal' distribution in \cite{365338}. The only method size that
is very common is 5 bytes. These methods are the typical setter and
getter methods in object-oriented programming as shown in
Listing~\ref{lst:arch:java:getval}.
\begin{lstlisting}[float, caption={Bytecodes for a getter method},label=lst:arch:java:getval]
private int val;
public int getVal() {
return val;
}
public int getVal();
Code:
0: aload_0
1: getfield #2; //Field val:I
4: ireturn
\end{lstlisting}
The method \code{getVal()} translates to three bytecodes of 1, 3 and
1 bytes in length respectively. These methods should show up in
\cite{365338} as a peak at 3 bytecodes.
The static distribution of method sizes in an application (javac,
the Java compiler) is quite similar to the distribution in the
library. In the class file that contains the Java compiler, 98\% of
the methods are smaller than 513 bytes, and the larger methods are
class initializers.
\subsection{Summary}
In this section, we performed dynamic measurements on the JVM
instruction set. We saw that more than 40\% of the executed
instructions are local variables or constants loads onto the stack.
This high frequency of stack access calls for an efficient
implementation of the stack, as described in
Section~\ref{sec:stack}.
In addition, we have statically measured method sizes. Methods are
typically very short. 30\% of the methods are shorter than 9 bytes
and 99\% account for methods of up to 512 bytes. The maximum length
is further limited by the definition of the class file. We will use
this property in the proposed \emph{method cache} in
Section~\ref{sec:cache}.
Instruction-usage data is an important input for the design of a
processor architecture, as seen in the following sections.
%\subsection{Some more properties}
%Number of local variables per method.
%\subsubsection{Some Comments}
%
%* Clarification of the quick instructions
%* A word about javac: absolute not optimized code generations.
%Simplifies JIT, but not so good for Java processors or an
%interpreting JVM
%\dots CPI is not so easy to measure for the JVM. JVM = runtime
%infrastructure (GC, new). What is measured with JIT?
1.1 jop/doc/book/arch/rt_predict.tex
http://www.opencores.org/cvsweb.shtml/jop/doc/book/arch/rt_predict.tex?rev=1.1&content-type=text/x-cvsweb-markup
Index: rt_predict.tex
===================================================================
\section{Real-Time Predictability}
\label{sec:rtpredict}
%\subsection{Time Predictable Architecture}
%
%Worst case execution time (WCET) analysis of real-time programs is
%essential for any schedulability analysis. To provide a tight WCET
%value a good model of the processor is necessary. However, the
%architectural advancement in modern processor designs is dominated
%by the rule: '\emph{Make the common case fast}`. This is the
%opposite to: '\emph{Reduce the worst case}` and complicates WCET
%analysis. JOP was designed to provide an architecture that can be
%exactly modeled. Execution time of bytecodes is known cycle
%accurate. It is possible to analyze the timing on the bytecode level
%without the uncertainties of an interpreting JVM or generated native
%code from ahead-of-time compilers for Java.
%
%
%Common architectural components, such as branch prediction and
%branch target buffers enhance average performance, but have usually
%a very pessimistic WCET. In JOP, branch prediction is avoided. This
%results in pressure on the pipeline length. The processor has a
%minimal pipeline length of four stages resulting in a four cycle
%execution time for a bytecode branch.
%
%The stack is a frequently accessed memory area and therefore
%implemented in on-chip memory and serving as data cache. Data
%exchange between internal stack and main memory is under microcode
%control and therefore WCET analyzable.
%
%Instruction cache is mandatory to bridge the growing gap between CPU
%speed and main memory access time. Standard cache organizations
%improve average execution time, but are hard to predict for worst
%case execution time (WCET) analysis. We tackled this problem from
%the architectural side: An instruction cache organization where
%simpler and more accurate WCET analysis is more important than
%average case performance.
%
%JOP has a novel instruction cache: A \emph{method cache}. A complete
%method is loaded in the cache on invocation and on return. This
%cache fill strategy lumps all cache misses together and is very
%simple to analyze. Cache block replacement depends on the call tree
%instead of instruction addresses and is therefore WCET analyzable.
%
%
%Comparing this cache organization quantitative with a benchmark
%derived from a real-time application we have seen that the proposed
%\emph{method cache} performs similar or even better to a traditional
%direct mapped cache with respect to the bytes that have to be filled
%an a cache miss. The number of memory transactions, which result in a
%high miss penalty on memories with high latency, are lower with the
%proposed cache solution than in a traditional cache.
%
%
General-purpose processors are optimized for average throughput and
non real-time operating systems are responsible for fair and
efficient scheduling of resources. Real-time systems need a
processor with low and known WCET of instructions. Real-time
operating systems have properties, such as fast interrupt response,
rapid context switch, short blocking times and a scheduler that
implements a simple, in most cases strict priority driven,
scheduling algorithm. This section describes design decisions for
JOP to support such real-time systems.
\subsection{Interrupts}
Interrupts are usually associated with low-level programming of
device drivers. The priorities of interrupts and their handler
functions are above task priorities and yield to an immediate
context switch. In this form, interrupts cannot be integrated in a
schedule with \emph{normal} tasks. The execution time of the
interrupt handler has to be integrated in the schedulability
analysis as additional blocking time. A better solution is to handle
interrupts, which represent external events, as schedulable objects
with priority levels in the range of real-time tasks, as suggested
in the RTSJ.
\paragraph{The Timer Interrupt}
The timer or clock interrupt has a different semantic than other
interrupts. The main purpose of the timer interrupt is
representation of time and release of periodic or time triggered
tasks. One common implementation is a clock tick. The interrupt
occurs at a regular interval (e.g.\ 10 ms) and a decision has to be
taken whether a task has to be released. This approach is simple to
implement, but there are two major drawbacks: The resolution of
timed events is bound by the resolution of the clock tick and clock
ticks without a task switch are a waste of execution time.
A better approach, used in JOP, is to generate timer interrupts at
the release times of the tasks. The scheduler is now responsible for
reprogramming the timer after each occurrence of a timer interrupt.
The list of sleeping threads has to be searched to find the nearest
release time in the future of a higher priority thread than the one
that will be released now. This time is used for the next timer
interrupt.
\paragraph{External Events}
Hardware interrupts, other than the timer interrupt, are represented
as asynchronous events with an associated thread. This means that
the event is a \emph{normal} schedulable object under the control of
the scheduler. With a minimum interarrival time, enforced by
hardware, these events can be incorporated into the priority
assignment and schedulability analysis in the same way as periodic
tasks.
\paragraph{Software Interrupts}
The common software generated interrupts, such as illegal memory
access or divide by zero, are represented by Java runtime exceptions
and need no special handler. They can be detected with a try-catch
block.
Asynchronous notification from the program is supported, in the same
way as an external event, as a schedulable object with an associated
thread. The event is triggered through the call of \code{fire()}.
The thread with the handler is placed in the runnable state and
scheduled according to priority.
\paragraph{Hardware Failures}
Serious hardware failures, such as illegal opcode or parity error
from the memory systems, lead to a shutdown of the system. However,
a \emph{last try} to call a handler that changes the state of the
system to a safe state and inform an upper level system, can improve
the integrity of the overall system.
\subsection{Task Switch}
An important issue in real-time systems is the time for a task
switch. A task switch consists of two actions:
\begin{itemize}
\item \emph{Scheduling} is the selection of the task order and timing
\item \emph{Dispatching} is the term for the context switch between tasks
\end{itemize}
\paragraph{Scheduling}
Most real-time systems use a fixed-priority preemptive scheduler.
Tasks with the same priority are usually scheduled in a FIFO order.
Two common ways to assign priorities are rate monotonic or, in a
more general form, deadline monotonic assignment. When two tasks get
the same priority, we can choose one of them and assign a higher
priority to that task and the task set is still schedulable. We get
a strictly monotonic priority order and do not have to deal with
FIFO order. This eliminates queues for each priority level and
results in a single, priority ordered task list.
Strictly fixed priority schedulers suffer from a problem called
\emph{priority inversion} \cite{626613}. The problem where a low
priority task blocks a high priority task on a shared resource is
solved by raising the priority of the low priority task. Two
standard priority inversion avoidance protocols are common:
%
\begin{description}
\item[Priority Inheritance Protocol:] A lock assigns the priority
of the highest-priority waiting task to the task holding the lock
until that task releases the resource.
\item[Priority Ceiling Emulation Protocol:] A lock gets a priority
assigned above the priority of the highest-priority task that will
ever acquire the lock. Every task will be immediately assigned the
priority of that lock when acquiring it.
\end{description}
%
The priority inheritance protocol is more complex to implement and
the time when the priority of a task is raised is not so obvious. It
is not raised because the task does anything, but because another
task reaches some point in its execution path.
Using priority ceiling emulation with unique priorities, different
from task priorities, the priority order is still strictly
monotonic. The priority ordered task list is expanded with slots for
each lock. If a task acquires a lock, it is placed in the
corresponding slot. With this extension to the task list, scheduling
is still simple and can be efficiently implemented.
\paragraph{Dispatching}
The time for a context switch depends on the size of the state of
the tasks. For a stack machine it is not so obvious what belongs to
the state of a task. If the stack resides in main memory, only a few
registers (e.g. program counter and stack pointer) need to be saved
and restored. However, the stack is a frequently accessed memory
region of the JVM. The stack can be seen as a data cache and should
be placed near the execution unit (in this case, \emph{near} means
on the chip and not in external memory). However, on-chip memory is
usually too small to hold the stack content for all tasks. This
means that the stack is part of the execution context and has to be
saved and restored on a context switch.
In JOP, the stack is placed in local (on-chip) FPGA memory with
single cycle access time. With this configuration, the next question
is how much of the stack to place there. Either the complete stack
of a thread or only the stack frame of the current method can reside
locally. If the complete stack of a thread is stored in local
memory, the invocation of methods and returns are fast, but the
context is large. For fast context switches, it is preferable to
have only a short stack in local memory. This results in less data
being transferred to and from main memory, but more memory transfers
on method invocation and return. The local stack can be further
divided into small pieces, each holding only one stack frame of one
thread. During the context switch, only the stack pointer needs to
be saved and restored. The outcome of this is a very fast context
switch, although the size of the local memory limits the maximum
number of threads.
Since JOP is a soft-core processor, these different solutions can be
configured for different application requirements. It is even
possible to mix of these policies: some stack slots can be assigned
to \emph{important} threads, while the remaining threads share one
slot. This stack slot only needs to be exchanged with the main
memory when switching \emph{to} a less \emph{important} thread.
\subsection{Architectural Design Decisions}
In hard real-time systems, meeting temporal requirements is of the
same importance as functional correctness. This results in different
architectural constraints than in a design for a non real-time
system. A low upper bound of the execution time is of premium
importance. Good average execution time is useless for a pure hard
real-time system.
Common architectural components, found in general purpose processors
to enhance average performance, are usually problematic for the WCET
analysis. A pragmatic approach to this problem is to ignore these
features for the analysis. With a processor designed for real-time
applications, these features have to be substituted by predictable
architecture enhancements.
\paragraph{Branch Prediction}
As the pipelines of current general-purpose processors get longer to
support higher clock rates the penalty of branches get too high.
This is compensated by branch prediction logic with branch target
buffers. However, the upper bound of the branch execution time is
the same as without this feature. In JOP, branch prediction is
avoided. This results in pressure on the pipeline length. The core
processor has a pipeline length of as little as three stages
resulting in a branch execution time of three cycles in microcode.
The two slots in the branch delay can be filled with instructions or
\emph{nop}. With the additional bytecode fetch and translation
stage, the overall pipeline is four stages and results in a four
cycle execution time for a bytecode branch.
\paragraph{Caches and Instruction Prefetch}
To reduce the growing gap between the clock frequency of the
processor and memory access times multi-level cache architectures
are commonly used. Since even a single level cache is problematic
for WCET analysis, more levels in the memory architecture are almost
not analyzable. The additional levels also increase the latency of
memory access on a cache miss.
In a stack machine, the stack is a frequently accessed memory area.
This makes the stack an ideal candidate to be placed near the
execution unit in the memory hierarchy. In JOP the stack is
implemented as internal memory with the two top elements as explicit
registers. This single cycle memory can be seen as a data cache.
However, unlike in picoJava, this limited memory is not
automatically spilled and filled. Automatically spill and fill
introduces unpredictable access to the main memory. Data exchange
between internal stack and main memory is under program control and
can be done on method invocation/return or on a thread switch.
The next most accessed memory area is the code area. A simple
prefetch queue, as it is found in older processors, could increase
instruction throughput after executing a multi-cycle bytecode. For a
stream of single cycle bytecodes, prefetching is useless and the
frequent occurrence of branches and method invocations, about
12--23\% (see Section~\ref{sec:bench:jvm}) in typical Java programs,
reduces the performance gain. The prefetch queue also results in
(probably unbounded) execution time dependencies over a stream of
instructions, which complicates timing analysis.
JOP has a method cache with a novel replace policy. Since typical
methods in Java programs are short and there are only relative
branches in a method, a complete method is loaded in the cache on
invocation and on return. This cache fill strategy lumps all cache
misses together and is very simple to analyze. It also simplifies
the hardware of the cache since no tag memory or address translation
is necessary. The \emph{romizer} tool JavaCodeCompact checks the
maximum allowed method size. Section~\ref{sec:cache} describes the
proposed cache solution in detail. Memory areas for the heap and
class description with the constant pool are not cached in JOP.
\paragraph{Superscalar Processors}
A superscalar processor consists of several execution units and
tries to extract instruction level parallelism (ILP) with out of
order execution. Again, this is a nightmare for timing analysis. The
code for a stack machine has less implicit parallelism than a
register machine.
One form of enhancement, usually implemented in stack machines, is
instruction folding. The instruction stream is scanned to find
frequent patterns like load-load-add-store and substitutes these
four instructions with one, RISC-like, operation. There are two
issues with instruction folding in JOP: The combined instruction
needs two read and one write access to the stack in a single cycle.
This would result in doubling of the internal memory usage in the
FPGA. It also needs, at minimum, four bytes read access to the
method cache. To overcome word boundaries, prefetching has to be
introduced after the method cache. This results in an additional
pipeline stage, time dependency of instructions with a more complex
analysis and more hardware resources for the multiplexers.
Programs for embedded and real-time systems are usually
multi-threaded. In future work, it will be investigated if the
additional hardware resources needed for ILP can be better used with
additional processor cores utilizing this implicit thread-level
parallelism.
\paragraph{Garbage Collection}
As use of the heap is avoided in hard real-time systems, no garbage
collector is implemented. Without a garbage collector, the memory
layout of objects can be simplified. Every reference points directly
to the object. No indirection through a handle, which would simplify
memory compaction in the garbage collector, is needed. This reduces
access time to object fields and methods.
\paragraph{Time-Predictable Instructions}
A good model of a processor with accurate timing information is
essential for a tight WCET analysis. The architecture of JOP and the
microcode are designed with this in mind. Execution time of
bytecodes is known cycle accurately (see Section~\ref{sec:wcet} and
Appendix~\ref{appx:bytecode}). It is possible to analyze the WCET on
the bytecode level \cite{R:Bernat:2000a} without the uncertainties
of an interpreting JVM \cite{R:Bate:2000a} or generated native code
from ahead-of-time compilers for Java.
\subsection{Summary}
In this section, we argued that, while common techniques in
processor architectures increase average throughput, they are not
feasible for real-time systems. The influence of these architectural
enhancements is at best hardly WCET-analyzable.
The proposed alternatives influence the processor architecture, as
described in earlier sections, as well as the software architecture
that will be described in Section~\ref{sec:rtprof}.
However, the most important architectural enhancement for pipelined
machines is caching, which is necessary even in embedded systems. We
have shown in Section~\ref{sec:stack} how a time-predictable data
cache for a stack machine can be implemented. In the following
section, we will propose a time-predictable cache for instructions.
|
 |