|
Message
From: cvs at opencores.org<cvs@o...>
Date: Sat Aug 25 20:01:52 CEST 2007
Subject: [cvs-checkins] MODIFIED: jop ...
Date: 00/07/08 25:20:01 Added: jop/doc/book/related related_picojava.pdf related_tree.pdf related.tex Log: Handbook update Revision Changes Path 1.1 jop/doc/book/related/related_picojava.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/related/related_picojava.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/related/related_tree.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/related/related_tree.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/related/related.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/related/related.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: related.tex =================================================================== Two different approaches can be found to improve Java bytecode execution by hardware. The first type operates as a Java coprocessor in conjunction with a general-purpose microprocessor. This coprocessor is placed in the instruction fetch path of the main processor and translates Java bytecodes to sequences of instructions for the host CPU or directly executes basic Java bytecodes. The complex instructions are emulated by the main processor. Java chips in the second category replace the general-purpose CPU. All applications therefore have to be written in Java. While the first type enables systems with mixed code capabilities, the additional component significantly raises costs. \tablename~\ref{tab_related_proc} provides an overview of the described Java hardware. Blank fields in the table indicate that the information is not available or not applicable (e.g. for simulation-only projects). Minimum CPI is the number of clock cycles for a simple instruction such as \code{nop}. One entry, the TINI system, is not a real Java hardware, but is included in the table since it is often incorrectly\footnote{TINI is a standard interpreting JVM running on an enhanced 8051 processor.} cited as an embedded Java processor. %\begin{table} % \centering %{\footnotesize %\begin{tabular} % {|>{\bfseries}p{1.4cm}|m{1.3cm}|>{\raggedright}m{1.3cm}|>{\raggedright}m{1.3cm} % |r|>{\raggedright}m{1.35cm}|r|m{1.6cm}|} % % \hline % & Type & Target & Size & Speed & Java & Min. & Remarks \\ % & & technology & & [MHz] & standard & CPI & \\ % \hline % Hard-Int & Translation & Simulation only & & & & & \\ % \hline % DELFT & Translation & Simulation only & & & & & \\ % \hline % JIFFY & Translation & Xilinx FPGA & 3800 LCs, 1KB RAM & & & & \\ % \hline % Jazelle & Co-processor & ASIC 0.18$\mu$ & 12K gates & 200 & & & Integration with ARM \\ % \hline % JSTAR & Co-processor & ASIC 0.18$\mu$ Softcore & 30K gates + 7KB & 104 & J2ME CLDC\footnotemark[2] & & \\ % \hline % TINI & Software JVM & Enhanced 8051 clone & & & Java 1.1 subset & & A small Java system for embedded applications. \\ % \hline % picoJava & Processor & No realization & 128K gates + memory & & Full & 1 & \\ % \hline % aJile & Processor & ASIC 0.25$\mu$ & 25K gates + ROM & 100 & J2ME CLDC\footnotemark[2] & & \\ % \hline % Cjip & Processor & ASIC 0.35$\mu$ & 70K gates + ROM, RAM & 67 & J2ME CLDC\footnotemark[2] & 6 & Rewriteable microcode \\ % \hline % Ignite & Stack processor & Xilinx FPGA & 9700 LCs & & & & \\ % \hline % Moon & Processor & Altera FPGA & 3660 LCs, 4KB RAM & & & & \\ % \hline % Lightfoot & Processor & Xilinx FPGA & 3400 LCs & 40 & & & \\ % \hline % LavaCORE & Processor & Xilinx FPGA & 3800 LCs 30K gates & 20 & & & \\ % \hline % Komodo & Processor & Xilinx FPGA & 2600 LCs & 20 & Subset: 50 bytecodes & 4 & \\ % \hline % FemtoJava & Processor & Altera Flex 10K & 2000 LCs & 4 & Subset: 69 bytecodes, 16-bit ALU & 3 & Application specific Java processor. \\ % \hline % JSM & Processor & Xilinx FPGA & & 3.5 & Java Card & & \cite{JSM01} \\ % \hline %% \hline %% JOP & Processor & Altera, Xilinx FPGA & 2100 LCs + 3KB RAM & 100 & J2ME CLDC & 1 & Typical configuration on a Cyclone FPGA \\ % %\end{tabular} %} % \caption{Java hardware} % \label{tab_related_proc}
%\end{table}
\begin{table}
\centering
{\footnotesize
\begin{tabular}
{|>{\bfseries}p{1.6cm}|m{1.5cm}|>{\raggedright}m{1.6cm}|>{\raggedright}m{1.6cm}
|r|>{\raggedright}m{1.5cm}|r|}
\hline
& Type & Target & Size & Speed & Java & Min. \\
& & technology & & [MHz] & standard & CPI \\
\hline
Hard-Int & Translation & Simulation only & & & & \\
\hline
DELFT & Translation & Simulation only & & & & \\
\hline
JIFFY & Translation & Xilinx FPGA & 3800 LCs, 1KB RAM & & & \\
\hline
Jazelle & Co-processor & ASIC 0.18$\mu$ & 12K gates & 200 & & \\
\hline
JSTAR & Co-processor & ASIC 0.18$\mu$ Softcore & 30K gates + 7KB & 104 & J2ME CLDC\footnotemark[2] & \\
\hline
TINI & Software JVM & Enhanced 8051 clone & & & Java 1.1 subset & \\
\hline
picoJava & Processor & No realization & 128K gates + memory & & Full & 1 \\
\hline
aJile & Processor & ASIC 0.25$\mu$ & 25K gates + ROM & 100 & J2ME CLDC\footnotemark[2] & \\
\hline
Cjip & Processor & ASIC 0.35$\mu$ & 70K gates + ROM, RAM & 67 & J2ME CLDC\footnotemark[2] & 6 \\
\hline
Ignite & Stack processor & Xilinx FPGA & 9700 LCs & & & \\
\hline
Moon & Processor & Altera FPGA & 3660 LCs, 4KB RAM & & & \\
\hline
Lightfoot & Processor & Xilinx FPGA & 3400 LCs & 40 & & \\
\hline
LavaCORE & Processor & Xilinx FPGA & 3800 LCs 30K gates & 20 & & \\
\hline
Komodo & Processor & Xilinx FPGA & 2600 LCs & 20 & Subset: 50 bytecodes & 4 \\
\hline
FemtoJava & Processor & Altera Flex 10K & 2000 LCs & 4 & Subset: 69 bytecodes, 16-bit ALU & 3 \\
\hline
JSM \cite{JSM01} & Processor & Xilinx FPGA & & 3.5 & Java Card & \\
\hline
% \hline
% JOP & Processor & Altera, Xilinx FPGA & 2100 LCs + 3KB RAM & 100 & J2ME CLDC & 1 & Typical configuration on a Cyclone FPGA \\
\end{tabular}
}
\caption{Java hardware}
\label{tab_related_proc}
\end{table}
\footnotetext[2]{J2ME CLDC stands for Java2 Micro Edition, Connected
Limited Device Configuration, which is described in
Section~\ref{subsec:cldc}.}
% Change this: \emph{JOP is included with a typical configuration as a
% reference. Further details of the resource usage of JOP is described
% in Section~xxx.}
\section{Hardware Translation and Coprocessors}
The simplest enhancement for Java is a translation unit, which
substitutes the switch statement of an interpreter JVM (bytecode
decoding) through hardware and/or translates simple bytecodes
to a sequence of RISC instructions on the fly.
A standard JVM interpreter contains a loop with a large switch
statement that decodes the bytecode (see
Listing~\ref{lst:intro:java:intprt}). This switch statement is
compiled to an indirect branch. The destinations of these indirect
branches change frequently and do not benefit from branch-prediction
logic. This is the main overhead for simple bytecodes on modern
processors. The following approaches enhance the execution of Java
programs on a standard processor through the substitution of the
memory read and switch statement with bytecode fetch and decode
through hardware.
\subsection{Hard-Int}
Radhakrichnan \cite{HardInt} proposes an additional architecture for
a standard RISC processor to speed up a JVM interpreter. The
architecture, called Hard-Int, is placed between the cache and
instruction fetch of the RISC processor. Simple Java bytecodes are
translated to a sequence of RISC instructions. For native RISC code,
the unit is bypassed. This architecture implements the expensive
switch statement of a typical interpreter in hardware. A simulation
of a SPARC processor with four execution units shows a speedup by
the factor of 2.6 over JDK 1.2 JIT with SPECjvm98. Since the
architecture is only evaluated in a software simulation, the impact
of the inserted hardware on the clock frequency of the RISC
processor is unknown. No estimation of the additional hardware cost
for the translation unit is given.
\subsection{DELFT-JAVA Engine}
In his thesis \cite{DELFT}, Glossner describes a processor for
multimedia applications in Java. A RISC processor is extended with
DSP capabilities and Java specific instructions. This combination
results in a very complex processor. Simple JVM instructions are
dynamically translated to the DELFT instruction set. However, no
explanation is given as to how this is done. A new
register-addressing mode, indirect register addressing with auto
increment or decrement, provides support for stack caching in the
register file. The translation of JVM bytecode to the DELFT
instruction set maps stack-based dependencies into pipeline
dependencies. The author expects that these dependencies can be
resolved with standard techniques such as register renaming and
out-of-order execution. To accelerate dynamic linking a link
translation buffer cache resolved entries from the constant pool.
The processor is validated through a C++ model. An experiment with a
synthetic benchmark (vector multiplication) compared a stack machine
with an ideal register machine. The ideal register machine performs
register renaming and out-of-order execution on multiple execution
units. The achieved speedup in this experiment was 2.7. The
high-level simulation model is more a proof of concept and no
estimation is given for the resources needed to implement this
complex processor. Since only a restricted subset of the JVM was
simulated, no Java applications could be used to estimate the
expected speedup.
\subsection{JIFFY}
An interesting approach to enhance Java execution in embedded
systems is presented in Acher's thesis \cite{JIFFY}. He states that
JIT-compilation in software is not possible on most embedded devices
because of resource constraints. JIFFY, a JIT in an FPGA, is
proposed as a solution to this problem. The compilation is done in
the following steps:
The Java bytecode is translated into an intermediate language with
three registers and a stack. The reduction to three registers is due
to the fact that bytecodes are using a maximum of three stack
operands, and it simplifies translation to CISC-architectures with a
low register count. In the next step, this instruction sequence,
which is still stack-based, is optimized. The main effect of this
optimization is to transform stack-based operations into
register-based operations. These optimized instructions in the
intermediate language are translated to native instructions of the
target architecture in the last step.
The quality of the generated code was tested with software versions
of JIFFY for a CISC (80586) and a RISC (Alpha 21164) architecture.
The resulting code is about 1.1 to 7.5 times faster than
interpreting Java bytecode on the x86 architecture. The speedup is
similar to Suns first JIT compiler (sunwjit in JDK 1.1). The
compilation time is estimated to be 50 to 70 clock cycles for one
bytecode. This is 10 times faster than the efficient CACAO JIT
\cite{Krall98}. A first prototype implementation in an FPGA used
3800 LCs and 8KBits RAM (80 \% of a Xilinx XC2S200).
\subsection{Jazelle}
Jazelle \cite{Jazelle} is an extension of the ARM 32-bit RISC
processor, similar to the Thumb state (a 16-bit mode for reduced
memory consumption). The Jazelle coprocessor is integrated into the
same chip as the ARM processor. The hardware bytecode decoder logic
is implemented in less than 12K gates. It accelerates, according to
ARM, some 95\% of the executed bytecodes. 140 bytecodes are executed
directly in hardware, while the remaining 94 are emulated by
sequences of ARM instructions. This solution also uses code
modification with \textit{quick} instructions to substitute certain
object-related instructions after link resolution. All Java
bytecodes, including the emulated sequences, are re-startable to
enable a fast interrupt response time.
A new ARM instruction puts the processor into Java state. Bytecodes
are fetched and decoded in two stages, compared to a single stage in
ARM state. Four registers of the ARM core are used to cache the top
stack elements. Stack spill and fill is handled automatically by the
hardware. Additional registers are reused for the Java stack
pointer, the variable pointer, the constant pool pointer and locale
variable 0 (the \textit{this} pointer in methods). Keeping the
complete state of the Java mode in ARM registers simplifies its
integration into existing operating systems.
\subsection{JSTAR, JA108}
Nozomi's JA108 \cite{JSTAR}, previously known as JSTAR, Java
coprocessor sits between the native processor and the memory
subsystem. JA108 fetches Java bytecodes from memory and translates
them into native microprocessor instructions. JA108 acts as a
pass-through when the core processor's native instructions are being
executed. The JA108 is targeted for use in mobile phones to increase
performance of Java multimedia applications. The coprocessor is
available as standalone package or with included memory and can be
operated up to 104MHz. The resource usage for the JSTAR is known to
be about 30K gates plus 45Kbits for the microcode.
\subsection{A Co-Designed Virtual Machine}
In his thesis \cite{KentPhD}, Kent proposes an interesting new form
of Java coprocessor. He investigates hardware/software co-design for
a JVM within the context of a desktop workstation. The execution of
the JVM is partitioned between an FPGA and the host processor. An
FPGA board with local memory is connected via the PCI bus to the
host. This solution provides an add-on accelerator without changing
the system. Moreover, as the FPGA can be configured for a different
task, the add-on hardware can be used for non-Java applications.
The critical issue in this approach is the partitioning of the JVM
and the memory regions between hardware and software. Not all Java
bytecodes can be executed in hardware. All object-oriented bytecodes
are performed in software. However, once these bytecodes are
replaced by their \textit{quick} variants, some of them can then be
executed in hardware. The most accessed data structures, i.e.\ the
method's bytecode, execution stack and local variables, are placed
in the FPGA board memory. The constant pool and the heap reside in
the PC's main memory. The software part of the JVM decides during
runtime which instruction sequences can be executed by the hardware.
Due to the high cost of a context switch, this is a critical
decision. Kent explored various algorithms with different block
sizes to find the optimum partitioning of the instructions between
the host processor and the FPGA. Tests with small benchmarks on a
simulation showed performance gains by a factor of 6 to 11, when
compared with an interpreting JVM. Kent is now working on the
concurrent use of the FPGA and the host system to execute Java
applications. Additional performance increases are expected for
multi-threaded applications.
In our view, there are two potential problems with this approach.
Firstly, the execution context for the hardware is too small. As
\code{invokevirtual} and the quick version are implemented in the
software partition, the maximum context is one method body. As shown
in Section~\ref{sec:bench:jvm:methods}, Java methods are usually
small (about 30\% are less than 9 bytes long), resulting in many
context switches. The second issue is the raw speedup, without
communication overhead, of the FPGA solution. This speedup is stated
to be around of 10 times greater, with the same clock frequency.
However, FPGA clock rate will never reach the clock rate of a
general-purpose processor. With a meaningful design, such as a CPU,
the clock rate of an FPGA is about 20 to 50 times lower. However,
everyone who uses an FPGA as target technology for a processor
design faces this problem. It is better not to try to compete
against mainstream PC technology.
\section{Java Processors}
Java Processors are primarily used in an embedded system. In such a
system, Java is the native programming language and all operating
system related code, such as device drivers, are implemented in
Java. Java processors are simple or extended stack architectures
with an instruction set that resembles more or less the bytecodes
from the JVM.
\subsection{picoJava}
\label{subsec:related:picojava}
Sun's picoJava is the Java processor most often cited in research
papers. It is used as a reference for new Java processors and as the
basis for research into improving various aspects of a Java
processor. Ironically, this processor was never released as a
product by Sun. After Sun decided to not produce picoJava in
silicon, Sun licensed picoJava to Fujitsu, IBM, LG Semicon and NEC.
However, these companies also did not produce a chip and Sun finally
provided the full Verilog code under an open-source license.
Sun introduced the first version of picoJava \cite{624084} in 1997.
The processor was targeted at the embedded systems market as a pure
Java processor with restricted support of C. picoJava-I contains
four pipeline stages. A redesign followed in 1999, known as
picoJava-II. This is the version described below. picoJava-II is now
freely available with a rich set of documentation \cite{pjMicroArch,
pjProgRef}.
Simple Java bytecodes are directly implemented in hardware, most of
them execute in one to three cycles. Other performance critical
instructions, for instance invoking a method, are implemented in
microcode. picoJava traps on the remaining complex instructions,
such as creation of an object, and emulates this instruction. To
access memory, internal registers and for cache management picoJava
implements 115 extended instructions with 2-byte opcodes. These
instructions are necessary to write system-level code to support the
JVM.
Traps are generated on interrupts, exceptions and for instruction
emulation. A trap is rather expensive and has a minimum overhead of
16 clock cycles:
\begin{verbatim}
6 clocks trap execution
n clocks trap code
2 clocks set VARS register
8 clocks return from trap
\end{verbatim}
This minimum value can only be achieved if the trap table entry is
in the data cache and the first instruction of the trap routine is
in the instruction cache. The worst-case interrupt latency is 926
clock cycles \cite{pjProgRef}.
\begin{figure*}
\centering
% \includegraphics[scale=\picscale]{related/related_picojava}
\includegraphics[scale=0.85]{related/related_picojava}
\caption[Block diagram of picoJava-II]
{Block diagram of picoJava-II (from \cite{pjMicroArch})}
\label{fig_related_picojava}
\end{figure*}
\figurename~\ref{fig_related_picojava} shows the major function
units of picoJava. The integer unit decodes and executes picoJava
instructions. The instruction cache is direct-mapped, while the data
cache is two-way set-associative, both with a line size of 16 bytes.
The caches can be configured between 0 and 16 Kbytes. An instruction
buffer decouples the instruction cache from the decode unit. The FPU
is organized as a microcode engine with a 32-bit datapath supporting
single- and double-precision operations. Most single-precision
operations require four cycles. Double-precision operations require
four times the number of cycles as single-precision operations. For
low-cost designs, the FPU can be removed and the core traps on
floating-point instructions to a software routine to emulate these
instructions. picoJava provides a 64-entry stack cache as a register
file. The core manages this register file as a circular buffer, with
a pointer to the top of stack. The stack management unit
automatically performs spill to and fill from the data cache to
avoid overflow and underflow of the stack buffer. To provide this
functionality the register file contains five memory ports.
Computation needs two read ports and one write port, the concurrent
spill and fill operations the two additional read and write ports.
The processor core consists of following six pipeline stages:
%
\begin{description}
\item[Fetch:]
Fetch 8 bytes from the instruction cache or 4 bytes from the bus
interface to the 16-byte-deep prefetch buffer.
\item[Decode:]
Group and precode instructions (up to 7 bytes) from the prefetch
buffer. Instruction folding is performed on up to four bytecodes.
\item[Register:]
Read up to two operands from the register file (stack cache).
\item[Execute:]
Execute simple instructions in one cycle or microcode for
multi-cycle instructions.
\item[Cache:]
Access the data cache.
\item[Writeback:]
Write the result back into the register file.
\end{description}
%
The integer unit together with the stack unit provides a mechanism,
called instruction folding, to speed up common code patterns found
in stack architectures, as shown in
\figurename~\ref{fig_related_folding}.
%
\begin{figure}
A Java instruction
\begin{verbatim}
c = a + b;
\end{verbatim}
translates to the following bytecodes:
\begin{verbatim}
iload_1
iload_2
iadd
istore_3
\end{verbatim}
\caption{A common folding pattern that is executed in a single cycle}
\label{fig_related_folding}
\end{figure}
%
When all entries are contained in the stack cache, the picoJava core
can fold these four instructions to one RISC-style single cycle
operation.
picoJava contains a simple mechanism to speed-up the common case for
monitor enter and exit. The two low order bits of an object
reference are used to indicate the lock holding or a request to a
lock held by another thread. These bits are examined by
\code{monitorenter} and \code{monitorexit}. For all other operations
on the reference, these two bits are masked out by the hardware.
Hardware registers cache up to two locks held by a single thread.
To efficiently implement a generational or an incremental garbage
collector picoJava offers hardware support for write barriers
through memory segments. The hardware checks all stores of an object
reference if this reference points to a different segment (compared
to the store address). In this case, a trap is generated and the
garbage collector can take the appropriate action. Additional two
reserved bits in the object reference can be used for a write
barrier trap.
The architecture of picoJava is a stack-based CISC processor
implementing 341 different instructions \cite{624084} and is the
most complex Java processor available. The processor can be
implemented \cite{Sekar2000} in about 440K gates (128K for the logic
and 314K for the memory components: 284x80 bits microcode ROM,
2x192x64 bits FPU ROM and 2x16KB caches).
\subsection{aJile JEMCore}
aJile's JEMCore is a direct-execution Java processor that is
available as both an IP core and a stand alone processor
\cite{aJile, 880720}. It is based on the 32-bit JEM2 Java chip
developed by Rockwell-Collins. JEM2 is an enhanced version of JEM1,
created in 1997 by the Rockwell-Collins Advanced Architecture
Microprocessor group. Rockwell-Collins originally developed JEM for
avionics applications by adapting an existing design for a
stack-based embedded processor. Rockwell-Collins decided not to sell
the chip on the open market. Instead, it licensed the design
exclusively to aJile Systems Inc., which was founded in 1999 by
engineers from Rockwell-Collins, Centaur Technologies, Sun
Microsystems, and IDT.
The core contains 24 32-bit wide registers. Six of them are used to
cache the top elements of the stack. The datapath consists of a
32-bit ALU, a 32-bit barrel shifter and the support for floating
point operations (disassembly/assembly, overflow and NaN detection).
The control store is a 4K by 56 ROM to hold the microcode that
implements the Java bytecode. An additional RAM control store can be
used for custom instructions. This feature is used to implement the
basic synchronization and thread scheduling routines in microcode.
This results in low execution overheads with thread-to-thread yield
of less than one $\mu$s (at 100MHz). An optional Multiple JVM
Manager (MJM) supports two independent, memory protected JVMs. The
two JVMs execute time-sliced on the processor. According to aJile,
the processor can be implemented in 25K gates (without the microcode
ROM). The MJM needs additional 10K gates.
Two silicon versions of JEM exist today: the aJ-80 and the aJ-100.
Both versions comprise a JEM2 core, the MJM, 48KB zero wait state
RAM and peripheral components, such as timer and UART. 16KB of the
RAM is used for the writable control store. The remaining 32KB is
used for storage of the processor stack. The aJ-100 provides a
generic 8-bit, 16-bit or 32-bit external bus interface, while the
aJ-80 only provides an 8-bit interface. The aJ-100 can be clocked up
to 100MHz and the aJ-80 up to 66MHz. The power consumption is about
1mW per MHz.
Since aJile was a member of the Real-Time for Java Expert Group, the
complete RTSJ will be available in the near future. One nice feature
of this processor is its availability. A relatively cheap
development system, the JStamp \cite{JStamp}, was used to compare
this processor with JOP.
\subsection{Cjip}
The Cjip processor \cite{Imsys, Cjip} supports multiple instruction
sets, allowing Java, C, C++ and assembler to coexist. Internally,
the Cjip uses 72 bit wide microcode instructions, to support the
different instruction sets. At its core, Cjip is a 16-bit CISC
architecture with on-chip 36KB ROM and 18KB RAM for fixed and
loadable microcode. Another 1KB RAM is used for eight independent
register banks, string buffer and two stack caches. Cjip is
implemented in 0.35-micron technology and can be clocked up to
66MHz. The logic core consumes about 20\% of the
1.4-million-transistor chip. The Cjip has 40 program controlled I/O
pins, a high-speed 8 bit I/O bus with hardware DMA and an 8/16 bit
DRAM interface.
The JVM is implemented largely in microcode (about 88\% of the Java
bytecodes). Java thread scheduling and garbage collection are
implemented as processes in microcode. Microcode is also used to
implement virtual peripherals such as watchdog timers, display and
keyboard interfaces, sound generators and multimedia codecs.
Microcode instructions execute in two or three cycles. A JVM
bytecode requires several microcode instructions. The Cjip Java
instruction set and the extensions are described in detail in
\cite{CjipRef}. For example: a bytecode \code{nop} executes in 6
cycles while an \code{iadd} takes 12 cycles. Conditional bytecode
branches are executed in 33 to 36 cycles. Object oriented
instructions such \code{getfield}, \code{putfield} or
\code{invokevirtual} are not part of the instruction set.
\subsection{Ignite, PSC1000}
The PSC1000 \cite{IGNITE} is a stack processor, based on ShBoom
(originally designed by Chuck Moore \cite{ShBoom}), designed for
high speed Forth applications. The PSC1000 was later renamed to
Ignite and promoted as a Java-processor, though it has it roots in
Forth. The instruction set, called ROSC (Removed Operand Set
Computer), is different from Java bytecodes. A small JVM driver
converts Java bytecode into the stack instruction set of the
processor.
The processor contains two on-chip stacks, as usual in Forth
processors \cite{Koopman89}, and additional 16 global registers. The
first elements of the stacks are directly accessible. The bottleneck
of instruction fetching without a cache is avoided by fetching up to
four 8-bit instructions from a 32-bit memory. To simplify
instruction decoding immediate values and branch offsets are placed
right aligned in such an instruction group. The PSC1000 is available
as ASIC at 80MHz and as a soft-core for Xilinx FPGAs (9700 LCs).
\subsection{Moon}
Vulcan ASIC's Moon processor is an implementation of the JVM to run
in an FPGA. The execution model is the often-used mix of direct,
microcode and trapped execution. As described in \cite{Vulcan2000},
a simple stack folding is implemented in order to reduce five memory
cycles to three for instruction sequences like
\textit{push-push-add}. The first version of Moon uses 3.840 LCs and
10 embedded memory blocks in an Altera FPGA. The Moon2 processor
\cite{Vulcan2003} is available as an encrypted HDL source for Altera
FPGAs (22\% of an APEX 20K400E equates to 3660 LCs) or as VHDL or
Verilog source code. The minimum silicon cost is given as 27K gates
plus 3KB ROM and 1KB single port RAM. The single port RAM is used to
implement 256 entries of the stack.
\subsection{Lightfoot}
The Lightfoot 32-bit core \cite{Lightfoot} is a hybrid 8/32-bit
processor based on the Harvard architecture. Program memory is 8
bits wide and data memory is 32 bits wide. The core contains a
3-stage pipeline with an integer ALU, a barrel shifter and a 2-bit
multiply step unit. There are two different stacks with top elements
implemented as registers and memory extension. The data stack is
used to hold temporary data -- it is not used to implement the JVM
stack frame. As the name implies, the return stack holds return
addresses for subroutines and it can be used as an auxiliary stack.
The TOS element is also used to access memory. The processor
architecture specifies three different instruction formats: soft
bytecodes, non-returnable instructions and single-byte instructions
that can be folded with a return instruction. Soft bytecode
instructions cause the processor to branch to one of 128 locations
in low program memory, where the implementation of the soft
bytecodes resides. This operation has a single cycle overhead and
the address of the following instruction is pushed onto the return
stack. The instruction set implies that it is optimized to write an
efficient interpreted JVM.
The core is available in VHDL and can be implemented in less than
30K gates. According to DCT, the performance is typically 8 times
better than RISC interpreters running at the same clock speed. The
core is also provided as an EDIF netlist for dedicated Xilinx
devices. It needs 1710 CLBs (= 3400 LCs) and 2 Block RAMs. In a
Vertex-II (2V1000-5), it can be clocked up to 40MHz.
\subsection{LavaCORE}
LavaCORE \cite{LavaCORE} is another Java processor targeted at
Xilinx FPGA architectures. It implements a set of instructions in
hardware and firmware. Floating-point operations are not
implemented. A 32x32-bit dual-ported RAM implements a register-file.
For specialized embedded applications, a tool is provided to analyze
which subset of the JVM instructions is used. The unused
instructions can be omitted from the design. The core can be
implemented in 1926 CLBs (= 3800 LCs) in a Virtex-II (2V1000-5) and
runs at 20MHz.
\subsection{Komodo}
\label{subsec:related:komodo}
Komodo \cite{Zulauf00} is a multithreaded Java processor with a
four-stage pipeline. It is intended as a basis for research on
real-time scheduling on a multithreaded microcontroller
\cite{komodo2003}. Simple bytecodes are directly implemented, while
more complex bytecodes, such as \code{iaload}, are implemented as a
microcode sequence. The unique feature of Komodo is the instruction
fetch unit with four independent program counters and status flags
for four threads. A priority manager is responsible for hardware
real-time scheduling and can select a new thread after each bytecode
instruction.
The first version of Komodo in an FPGA implements a very restricted
subset of the JVM (only 50 bytecodes). The design can be clocked at
20MHz. However, the pipeline runs at 5MHz for single cycle external
memory access and three-port access of stack memory in one pipeline
stage. The resource usage is 1300 CLBs (= 2600 LCs) in a Xilinix XC
4036 XL.
\subsection{FemtoJava}
FemtoJava \cite{Femto01} is a research project to build an
application specific Java processor. The bytecode usage of the
embedded application is analyzed and a customized version of
FemtoJava is generated. FemtoJava implements up to 69 bytecode
instructions for an 8 or 16 bit datapath. These instructions take 3,
4, 7 or 14 cycles to execute. Analysis of small applications (50 to
280 byte code) showed that between 22 and 69 distinct bytecodes are
used. The resulting resource usage of the FPGA varies between 1000
and 2000 LCs. With the reduction of the datapath to 16 bits the
processor is not Java conformant.
\section{Additional Comments}
The two classes of hardware accelerators for Java can be further
subdivided as shown in \figurename~\ref{fig_related_tree}. Many of
the Java processors are stack machines that have been derived from
Forth processors. Two different stacks in these so-called Java
processors (Cjip, Ignite and Lightfoot) do not fit very well for the
JVM. Although stack based, Forth is different from Java bytecode.
Instruction mix in Forth shows about 25\% call and returns
\cite{Koopman89}, so Forth processors are optimized for fast call
and return. In Java, the percentage of call/return is only about 6\%
(see Section~\ref{sec:bench:jvm}). With subroutine exits so common,
it is no wonder that most of the Forth stack machines have a
mechanism for combining subroutine exits with other instructions and
provide two stacks to avoid the mixture of parameters and return
addresses. However, a JVM stack frame is more complex than in Forth
(see Section~\ref{sec:stack}) and there is no use for such a
mechanism. An additional return stack provides no advantage for the
JVM.
In Forth only the top elements can be accessed, which results in a
simple stack design with only one access port. In the JVM parameters
for a method are explicitly pushed on the stack before invocation.
These parameters are then accessed in the method relative to a
variable pointer. This mechanism needs a dual ported memory with
simultaneous read and write access. These basic differences between
Forth and the JVM lead to a sub-optimal implementation of the JVM on
a Forth based processor.
\begin{figure*}
\centering
\includegraphics[scale=\picscale]{related/related_tree}
\caption{Java hardware}
\label{fig_related_tree}
\end{figure*}
There are problems in getting information about commercial products.
When new companies started developing Java processors, a lot of
information was available. This information was usually more of a
presentation of the concept, nevertheless it gave some insights into
how they approached the different design problems. However, at the
point at which the projects reached production quality, this
information quietly disappeared from their websites. It was replaced
with colorful marketing prospectuses about the wonderful world of
the new Java-enabled mobile phones. Only one company, aJile Ltd.,
presented information about their product in a refereed conference
paper.
Many research projects for a Java processor in an FPGA exists.
Examples can be found in \cite{Femto01}, \cite{Kim2000} and
\cite{368445}. These projects have much in common -- the basic
implementation of a stack machine with integer instructions is easy.
However, the realization of the complete JVM is the hard part and
therefore beyond the scope of these projects.
Other than the aJile processor and the Komodo project, no solution
addresses the problem of real-time predictability. For this reason,
as well as its availability, the aJile processor is used for
comparison with JOP.
\section{Summary}
\label{sec:related:summary}
In Table~\ref{tab:related:plus:minus}, features of selected Java
processors are compared. Category `Predictability' means how well
the processor is time-predictable. In category `Size', the chip size
is estimated and category `Performance' means average performance.
The category `JVM conformance' lists how complete the implementation
of the JVM specification \cite{jvm} is. The `Flexibility' parameter
indicates how well the processor can be adapted to different
application domains.
The assessment of the various parameters is, however, somewhat
subjective as the information is mainly derived from written
documentation. In Section~\ref{sec:performance}, the overall
performance of various Java systems, including the aJile processor,
is compared with JOP.
The last column of the table shows the features required for JOP.
This is, therefore, our research objective in a nutshell.
\begin{table}[htp]
\centering
\begin{tabular}{lccccc}
\toprule
& picoJava & aJile & Komodo & FemtoJava & JOP \\
\midrule
Predictability & $--$ & $\cdot$ & $-$ & $\cdot$ & $++$ \\
Size & $--$ & $-$ & $+$ & $-$ & $++$ \\
Performance & $++$ & $+$ & $-$ & $--$ & $+$ \\
JVM conformance & $++$ & $+$ & $-$ & $--$ & $\cdot$ \\
Flexibility & $--$ & $--$ & $+$ & $++$ & $++$ \\
\bottomrule
\end{tabular}
\caption{Feature comparison of selected Java processors}
\label{tab:related:plus:minus}
\end{table}
Due to the great variation in execution times for a trap, picoJava
is given a double minus in the `Predictability' category. picoJava
is also the largest processor in the list. However, its performance
and JVM compatibility are expected to be superior to those of other
processors.
The aJile processor is intended as a solution for real-time systems.
However, no information is available about bytecode execution times.
As this processor is a commercial product and has been on the market
for some time, it is expected that its JVM implementation would
conform to Java standards, as defined by Sun.
Komodos multithreading is similar to hyper-threading in modern
processors that are trying to hide latencies in instruction
fetching. However, this feature leads to very pessimistic WCET
values (in effect rendering the performance gain useless). The fact
that the pipeline clock is only a quarter of the system clock also
wastes a considerable amount of potential performance.
FemtoJava is given a double plus for flexibility, due to the
application-dependent generation of the processor. However,
FemtoJava is only a 16-bit processor and therefore not JVM
compliant. The resource usage is also very high, compared to the
minimal Java subset implemented and the low performance of the
processor.
So far, all processors in the list perform weakly in the area of
time-predictable execution of Java bytecodes. However, a low-level
analysis of execution times is of primary importance for WCET
analysis. Therefore, the main objective of JOP is to define and
implement a processor architecture that is as predictable as
possible. However, it is equally important that this does not result
in a low performance solution. Performance shall not suffer as a
result of the time-predictable architecture.
The second main aim of this work is to design a small processor.
Size and the resulting energy consumption are a main concern in
embedded systems. The proposed Java processor needs to be small
enough to be implemented in a low-cost FPGA device. With this
constraint, an implementation in an ASIC will also result in a very
small core that can be part of a larger system-on-a-chip.
The embedded market is diverse and one size does not fit all. A
configurable processor in which we can trade size for performance
provides the flexibility for a variety of application domains. The
aim of the architecture of JOP is to support this flexibility.
\section{Derived Work}
\label{sec:derived}
Quite common for open-source projects are derived projects.
Especially the research community appreciates open-source projects.
Following list describes projects that are either completely based
on JOP or influenced to a great extent.
JOP triggered research on implementation of the JVM in hardware for
real-time systems. The publications on JOP and also the fact that
JOP is open-source made the project and ideas easy accessible for
other researchers. Several research projects are directly or
indirectly based on the research project JOP:
\begin{itemize}
\item Lund -- Flavius
\item Dresden
\item Graz
\item Albertos MS thesis
\item \cite{conf/iscas/KoT07} JOP based dual-issue Javaprocessor
\item WCET work by Rasmus, Trevor, Elena, and upcoming CISS
\end{itemize}
|
 |