LOGIN   :::   RECOVER PASS   :::   GET ACCOUNT    
Browse
  • Projects
  • Code (CVS)
  • Forums
  • News
  • Articles
  • Polls
  •  
    OpenCores
  • FAQ
  • CVS HowTo
  • Mission
  • Media
  • Tools
  • Sponsors
  • Mirrors
  • Logos
  • Contact us
  •  
    Tools
  • Search
      
  • Download Cores (CVSGet)
  •  
    More
  • Wishbone
  • Perlilog
  • EDA tools
  • OpenTech CD
  •  
    Navigation: All forums > Cvs-checkins > Message List > Message Post

    Message

    Reply | Reply all
    Date Prev | Date Next | Thread Prev | Thread Next Date Index | Thread Index

    From: cvs at opencores.org<cvs@o...>
    Date: Sat Aug 25 20:03:12 CEST 2007
    Subject: [cvs-checkins] MODIFIED: jop ...
    Top
    Date: 00/07/08 25:20:03

    Added: jop/doc/book/results sort.dot results_app_bench.pdf
    results_app_bench_double_scaled.pdf
    results_app_bench_scaled.pdf results_kfl_exe.pdf
    results_kfl_exe_detail.pdf results_kfl_mast1.pdf
    results_kfl_mast2.pdf results_sigdel.pdf
    results_sigdel_real.pdf results_wcet_cfg.pdf
    results_wcet_sort.pdf bubble.tex results.tex
    Log:
    Handbook update


    Revision Changes Path
    1.1 jop/doc/book/results/sort.dot

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/sort.dot?rev=1.1&content-type=text/x-cvsweb-markup

    Index: sort.dot
    ===================================================================

    digraph sort {
    /* node [shape = box]; */
    B1 -> B2 [label="1"];
    B2 -> B3 [label="4"];
    B2 -> B9 [label="1"];
    B3 -> B4 [label="4"];
    B4 -> B5 [label="10"];
    B4 -> B8 [label="4"];
    B5 -> B6 [label="0-10"];
    B5 -> B7 [label="0-10"];
    B6 -> B7 [label="0-10"];
    B7 -> B4 [label="10"];
    B8 -> B2 [label="4"];
    }



    1.1 jop/doc/book/results/results_app_bench.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_app_bench.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_app_bench_double_scaled.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_app_bench_double_scaled.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_app_bench_scaled.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_app_bench_scaled.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_kfl_exe.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_kfl_exe.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_kfl_exe_detail.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_kfl_exe_detail.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_kfl_mast1.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_kfl_mast1.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_kfl_mast2.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_kfl_mast2.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_sigdel.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_sigdel.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_sigdel_real.pdf

    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_sigdel_real.pdf?rev=1.1&content-type=text/x-cvsweb-markup

    <<Binary file>>


    1.1 jop/doc/book/results/results_wcet_cfg.pdf
    http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_wcet_cfg.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/results/results_wcet_sort.pdf http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results_wcet_sort.pdf?rev=1.1&content-type=text/x-cvsweb-markup <<Binary file>> 1.1 jop/doc/book/results/bubble.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/bubble.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: bubble.tex =================================================================== B1 & & & 2 & 1 & 2 & 1 & 2 \\ & 0: & iconst\_4 & 1 & & \\ & 1: & istore\_1 & 1 & & \\ B2 & & & 5 & 5 & 25 & 5 & 25 \\ & 2: & iload\_1 & 1 & & \\ & 3: & ifle 53 & 4 & & \\ B3 & & & 2 & 4 & 8 & 4 & 8 \\ & 6: & iconst\_1 & 1 & & \\ & 7: & istore\_2 & 1 & & \\ B4 & & & 6 & 14 & 84 & 14 & 84 \\ & 8: & iload\_2 & 1 & & \\ & 9: & iload\_1 & 1 & & \\ & 10: & if\_icmpgt 47 & 4 & & \\ B5 & & & 74 & 10 & 740 & 10 & 740 \\ & 13: & aload\_0 & 1 & & \\ & 14: & iload\_2 & 1 & & \\ & 15: & iconst\_1 & 1 & & \\ & 16: & isub & 1 & & \\ & 17: & iaload & 29 & & \\ & 18: & istore\_3 & 1 & & \\ & 19: & aload\_0 & 1 & & \\ & 20: & iload\_2 & 1 & & \\ & 21: & iaload & 29 & & \\ & 22: & istore 4 & 2 & & \\ & 24: & iload\_3 & 1 & & \\ & 25: & iload 4 & 2 & & \\ & 27: & if\_icmple 41 & 4 & & \\ B6 & & & 73 & 10 & 730 & 0 & 0 \\ & 30: & aload\_0 & 1 & & \\ & 31: & iload\_2 & 1 & & \\ & 32: & iload\_3 & 1 & & \\ & 33: & iastore & 32 & & \\ & 34: & aload\_0 & 1 & & \\ & 35: & iload\_2 & 1 & & \\ & 36: & iconst\_1 & 1 & & \\ & 37: & isub & 1 & & \\ & 38: & iload 4 & 2 & & \\ & 40: & iastore & 32 & & \\ B7 & & & 15 & 10 & 150 & 10 & 150 \\ & 41: & iinc 2, 1 & 11 & & \\ & 44: & goto 8 & 4 & & \\ B8 & & & 15 & 4 & 60 & 4 & 60 \\ & 47: & iinc 1, -1 & 11 & & \\ & 50: & goto 2 & 4 & & \\ B9 & & & & 1 & & 1 & \\ & 53: & return & & & \\ \midrule \multicolumn{4}{l}{Execution time calculated} & & 1,799 & & 1,069 \\ \multicolumn{4}{l}{Execution time measured} & & 1,799 & & 1,069 \\ 1.1 jop/doc/book/results/results.tex http://www.opencores.org/cvsweb.shtml/jop/doc/book/results/results.tex?rev=1.1&content-type=text/x-cvsweb-markup Index: results.tex =================================================================== %\input{../preamble} %\chapter{Results} In this chapter, we will present the evaluation results for JOP, with respect to size, performance and WCET. \tablename~\ref{tab_results_proc} compares JOP with other Java hardware solutions (see also Chapter~\ref{chap:related}). The column year indicates the first date at which the processor became available or the first publication about the processor. The research project Komodo has now ceased, while FemtoJava is still being used as a basis for active research. We can see that JOP is the smallest realization in an FPGA and also has the highest clock frequency. JOP also has a minimum CPI of 1 while, for Komodo and FemtoJava, the minimum CPIs are four and three respectively. \begin{table}[tbh] \centering {\footnotesize \begin{tabular} {|>{\bfseries}m{1.6cm}|>{\raggedright}m{1.6cm}|>{\raggedright}m{1.7cm}|r|>{\raggedright}m{2cm}|r|r|} % {p{1.5cm}p{1.5cm}p{1.4cm}p{1.6cm}cp{1.4cm}cp{2.5cm}} \hline & Target & Size & Speed & Java & Min. & Year \\ & technology & & [MHz] & standard & CPI & \\ \hline JOP & Altera, Xilinx FPGA & 1830 LCs, 3KB RAM & 100 & J2ME CLDC & 1 & 2001 \\ \hline picoJava & No realization & 128K gates + memory & & Full & 1 & 1999 \\ \hline aJile & ASIC 0.25$\mu$ & 25K gates + ROM & 100 & J2ME CLDC & & 2000 \\ \hline Moon & Altera FPGA & 3660 LCs, 4KB RAM & & & & 2000 \\ \hline Lightfoot & Xilinx FPGA & 3400 LCs & 40 & & & 2001 \\ \hline Komodo & Xilinx FPGA & 2600 LCs & 33 & & 4 & 2000 \\ \hline FemtoJava & Altera Flex 10K & 2000 LCs & 4 & Subset: 69 bytecodes, 16-bit ALU & 3 & 2001 \\ \hline \end{tabular} } \caption{Comparison of Java hardware with JOP} \label{tab_results_proc} \end{table} In the following section, the hardware platform that is used for benchmarking is described. This is followed by a comparison of JOP's resource usage with other soft-core processors. In the `General Performance' section, a number of different solutions for embedded Java are compared at the bytecode level and at the application level. The basic properties of the real-time scheduler are evaluated using the Reference Implementation (RI) of the RTSJ on a Linux system and the real-time profile from Section~\ref{sec:rtprof} on top of JOP. It is also shown that our objective of providing an easy target for WCET analysis has been achieved. This chapter concludes with a short description of real-world applications that use JOP. \section{Hardware Platforms} During the development of JOP and its predecessors, several different FPGA boards were developed. The first experiments involved using Altera FPGAs EPF8282,\linebreak[4] EPF8452, EPF10K10 and ACEX 1K30 on boards that were connected to the printer port of a PC for configuration, download and communication. The next step was the development of a stand-alone board with FLASH memory and static RAM. This board was developed in two variants, one with an ACEX 1K50 and the other with a Cyclone EP1C6 or EP1C12. Both boards are pin-compatible and are used in commercial applications of JOP. The Cyclone board is the hardware that is used for the following evaluations. This board is an ideal development system for JOP. Static RAM and FLASH are connected via independent buses to the FPGA. All unused FPGA pins and the serial line are available via four connectors. The FLASH can be used to store configuration data for the FPGA and application program/data. The FPGA can be configured with a ByteBlasterMV download cable or loaded from the flash (with a small CPLD on board). As the FLASH is also connected to the FPGA, it can be programmed from the FPGA. This allows for upgrades of the Java program and even the processor core itself in the field. The board is slightly different from other FPGA prototyping boards, in that its connectors are on the bottom side. Therefore, it can be used as a module (60mm x 48mm), i.e.\ as part of a larger board that contains the periphery. The Cyclone board contains: % %\begin{samepage} \begin{itemize} \item Altera Cyclone EP1C6Q240 or EP1C12Q240 \item Step Down voltage regulator (1V5) \item Crystal clock (20MHz) at the PLL input (up to 640MHz internal) \item 512KB FLASH (for FPGA configuration and program code) \item 1MB fast asynchronous RAM (15 ns) \item Up to 128MB NAND FLASH \item ByteBlasterMV port \item Watchdog with LED \item EPM7064 PLD to configure the FPGA from the FLASH on watchdog reset \item Serial interface driver (MAX3232) \item 56 general-purpose IO pins \end{itemize} %\end{samepage} % The RAM consists of two independent 16-bit banks (with their own address and control lines). Both RAM chips are on the bottom side of the PCB, directly under the FPGA pins. As the traces are very short (under 10mm), it is possible to use the RAMs at full speed without reflection problems. The two banks can be combined to form 32-bit RAM or support two independent CPU cores. Pictures and the schematic of the board can be found in Appendix~\ref{appx:cycore}. An expansion board hosts the CPU module and provides a complete Java processor system with Internet connection. A step down switching regulator with a large AC/DC input range supplies the core board. All input and output pins are EMC/ESD-protected and routed to large connectors (5.08mm Phoenix). Analog comparators can be used to build sigma-delta ADCs. For FPGA projects with a network connection, a CS8900 Ethernet controller with an RJ45 connector is included on the expansion board. \section{Resource Usage} Cost, alongside energy consumption, is an important issue for embedded systems. The cost of a chip is directly related to the die size (the cost per die is roughly proportional to the square of the die area \cite{Hennessy02}). Chips with fewer gates also consume less energy. Processors for embedded systems are therefore optimized for minimum chip size. In this section, we will compare JOP with different processors in terms of size. One major design objective in the development of JOP was to create a small system that could be implemented in a low-cost FPGA. \tablename~\ref{tab_results_compare} shows the resource usage for different configurations of JOP and different soft-core processors implemented in an Altera EP1C6 FPGA \cite{AltCyc}. Estimating equivalent gate counts for designs in an FPGA is problematic. It is therefore better to compare the two basic structures, LC (logic cell) and memory. \begin{table} \centering \begin{tabular}{lcD{.}{.}{1.2}D{.}{.}{1}} \toprule Processor & Resources & \multicolumn{1}{c}{Memory} & \multicolumn{1}{c}{fmax} \\ & [LC] & \multicolumn{1}{c}{[KB]} & \multicolumn{1}{c}{[MHz]} \\ \midrule JOP Minimal & 1,077 & 3.25 & 98 \\ JOP Basic & 1,452 & 3.25 & 98 \\ JOP Typical & 1,831 & 3.25 & 101 \\ Lightfoot\footnotemark & 3,400 & 1 & 40 \\ NIOS A & 1,828 & 6.2 & 120 \\ NIOS B & 2,923 & 5.5 & 119 \\ SPEAR\footnotemark & 1,700 & 8 & 80 \\ \bottomrule \end{tabular} \caption{FPGA soft-core processors} \label{tab_results_compare} \end{table} \addtocounter{footnote}{-1} \footnotetext[1]{The data for the Lightfoot processor is taken from the data sheet \cite{Lightfoot}. The frequency used is that in a Vertex-II device from Xilinx. JOP can be clocked at 100MHz in the Vertex-II device, making this comparison valid.} \stepcounter{footnote} \footnotetext[2]{As SPEAR uses internal memory blocks in asynchronous mode it is not possible to synthesize it without modification for the Cyclone FPGA. The clock frequency of SPEAR in an Altera Cyclone is an estimate based on following facts: SPEAR can be clocked at 40MHz in an APEX device and JOP can be clocked at 50MHz in the same device.} All configurations of JOP contain a memory interface to a 32-bit static RAM and an 8-bit FLASH for the Java program and configuration data. The minimum configuration implements multiplication and the shift operations in microcode. In the basic configuration, these operations are implemented as a sequential Booth multiplier and a single-cycle barrel shifter. The typical configuration contains a variable block instruction cache (1KB, 4 blocks -- see Section~\ref{sec:cache:var}) and some useful I/O devices such as an UART and a timer with interrupt logic for multi-threading. The typical configuration of JOP needs about 30\% of the LCs in a Cyclone EP1C6, thus leaving enough resources free for application-specific logic. Lightfoot \cite{Lightfoot} is a commercial Java processor, targeted at Xilinx FPGA architectures. We can see from Table~\ref{tab_results_compare} that this processor needs about twice the resources of JOP. As a reference, NIOS \cite{NIOS}, Altera's popular RISC soft-core, is also included in the list. NIOS has a 16-bit instruction set, a 5-stage pipeline and can be configured with a 16 or 32-bit datapath. Version A is the minimum configuration of NIOS. Version B adds an external memory interface, multiplication support and a timer. Version A is comparable with the minimal configuration of JOP, and Version B with its typical configuration. SPEAR \cite{Delvai:ECRTS2003} (Scalable Processor for Embedded Applications in Real-time Environments) is a 16-bit processor with deterministic execution times. SPEAR contains predicated instructions to support single-path programming. SPEAR is included in the list as it is also a processor designed for real-time systems. To prove that the VHDL code for JOP is as portable as possible, JOP was also implemented in a Xilinx Spartan-3 FPGA. Only the instantiation and initialization code for the on-chip memories is vendor-specific, whilst the rest of the VHDL code can be shared for the different targets. JOP consumes about the same LC count (1844 LCs) in the Spartan device, but has a slower clock frequency (83MHz). From this comparison we can see that we have achieved our objective of designing a small processor. The Java processor, Lightfoot, is 2.3 times larger (and 2.5 times slower) than JOP in the basic configuration. A typical 32-bit RISC processor consumes about 1.6 to 1.8 times the resources of JOP. However, the RISC processor can be clocked 20\% faster than JOP in the same technology. The only processor that is similar in size is SPEAR. However, while SPEAR is a 16-bit processor, JOP contains a 32-bit datapath. %\begin{table} % \centering % \begin{tabular}{llrr} % \toprule % JOP & Optimization & Resource [LC] & fmax [MHz] \\ % \midrule % Typical (iomin) & default & 1930 & 101 \\ % & Reg.pack: min. Area w/Chains & 1764 & 99 \\ % & + Synth. Area & 1746 & 98 \\ % & Reg.pack: min. Are w/Chains + Synth speed & 1800 % & 100 \\ % Core (no io) & Reg.pack, Synth Area & 1452 & 98 \\ % - mul & & 1296 & 99 \\ % - shift & with single bit sar & 1077 & 98 \\ % \midrule % Typical, Spartan-3 & speed & 1844 & 83 \\ % Typical, Spartan-3 & area & 1689 & 74 \\ % \bottomrule % \end{tabular} % \caption{FPGA Soft Core Processors} % \label{tab_results_jop_versions} %\end{table} %Different LC/gate values: % from tow-level stack: 1LC = 5.4 gates % from ram stack: 1LC = 5.5 gates % from register stack: 1LC = 5.9 gates % Moon: 3660LCs - 27K gates + 3KB ROM + 1KB RAM: 1LC = 7.4 gates % Lightfoot: 3400LCs - 25K gates: 1LC = 7.4 gates % picoJava: ROM/RAM - 1Bit = 1 gate % % JOP: 1831 LCs, 3.25KB: 1831*6 + 26624*1.5 = 11K + 39k % Pentium MMX: 4.5 Mio. Transistors -> 1100K Table~\ref{tab:results:gate:count} provides gate count estimates for JOP, picoJava, the aJile processor, and the Intel Pentium MMX processor that is used in the benchmarks in the next section. Equivalent gate count for an LC\footnote{The factors are derived from the data provided for various processors in Chapter~\ref{chap:related} and from the resource estimates in Section~\ref{sec:stack}.} varies between 5.5 and 7.4 -- we chose a factor of 6 gates per LC and 1.5 gates per memory bit for the estimated gate count for JOP in the table. JOP is listed in the typical configuration that consumes 1831 LCs. The Pentium MMX contains 4.5M transistors \cite{pentium:mmx} that are equivalent to 1125K gates. \begin{table} \centering \begin{tabular}{lrrr} \toprule Processor & \multicolumn{1}{c}{Core} & \multicolumn{1}{c}{Memory} & \multicolumn{1}{c}{Sum.} \\ & \multicolumn{1}{c}{[gate]} & \multicolumn{1}{c}{[gate]} & \multicolumn{1}{c}{[gate]}\\ \midrule JOP & 11K & 39K & 50K\\ picoJava & 128K & 314K & 442K\\ aJile & 25K & 912K & 937K\\ Pentium MMX & & & 1125K\\ \bottomrule \end{tabular} \caption{Gate count estimates for various processors} \label{tab:results:gate:count} \end{table} We can see from the table that the on-chip memory dominates the overall gate count of JOP, and to an even greater extent, of the aJile processor. The aJile processor is roughly the same size as the Pentium MMX, and both are about 20 times larger than JOP. \section{Performance} \label{sec:performance} In this section, we will evaluate the performance of JOP in relation to other Java systems. Although JOP is intended as a processor with a low WCET for all operations, its general performance is still important. In the first section, we will evaluate JOP's average performance. In the section that follows it, the implementation of the simple real-time profile, as described in Section~\ref{sec:rtprof}, on JOP is compared to the RI of the RTSJ on top of Linux. \subsection{General Performance} Running benchmarks is problematic, both generally and especially in the case of embedded systems. The best benchmark would be the application that is intended to run on the system being tested. To get comparable results SPEC provides benchmarks for various systems. However, the one for Java, the SPECjvm98 \cite{SPECJvm98}, is usually too large for embedded systems. Due to the absence of a \emph{standard} Java benchmark for embedded systems, a small benchmark suit that should run on even the smallest device is provided here. It contains several micro-benchmarks for evaluating CPI for single bytecodes or short sequences of bytecodes, a synthetic benchmark (the Sieve of Eratosthenes) and two application benchmarks. To provide a realistic workload for embedded systems, a real-time application was adapted to create the first application benchmark (Kfl). The application is taken from one of the nodes of a distributed motor control system \cite{jop:wises03} (see Section~\ref{sec:app:kfl}). A simulation of both the environment (sensors and actors) and the communication system (commands from the master station) forms part of the benchmark, so as to simulate the real-world workload. The second application benchmark is an adaptation of a tiny TCP/IP stack (Ejip) for embedded Java. This benchmark contains two UDP server/clients, exchanging messages via a loopback device. As we will see, there is a great variation in processing power across different embedded systems. To cater for this variation, all benchmarks are `self adjusting'. Each benchmark consists of an aspect that is benchmarked in a loop and an `overhead' loop that contains any overheads from the benchmark that should be subtracted from the result (this feature is designed for the micro-benchmarks). The loop count adapts itself until the benchmark runs for more than a second. The number of iterations per second is then calculated, which means that higher values indicate better performance. The benchmark framework only needs two system functions: one to measure time in millisecond resolution and one to print the results. These functions are encapsulated in \code{LowLevel.java} and can be adapted to environments, in which the full Java library is not available. For example, the leJOS system has very limited output capabilities and there is therefore a special \code{LowLevel.java} for this device. The following list gives a brief description of the Java systems that were benchmarked: \begin{description} \item[JOP] is implemented in a Cyclone FPGA, running at 100MHz. The main memory is a 32-bit static RAM (15ns) with an access time of 3 clock cycles. \item[leJOS] As an example for a low-end embedded device we use the RCX robot controller from the LEGO MindStorms series. It contains a 16-bit Hitachi H8300 microcontroller \cite{hitachi:h8}, running at 16MHz. leJOS \cite{lejos} is a tiny interpreting JVM for the RCX. \item[TINI] is an enhanced 8051 clone running a software JVM. The results were taken from a custom board with a 20MHz crystal, and the chip's PLL is set to a factor of 2. The TINIOS firmware revision running on the board is 1.12p9. \item[Komodo] Komodo \cite{komodo2003} is a Java processor as a basis for research on real-time scheduling on a multithreaded microcontroller (see Section~\ref{subsec:related:komodo}). The benchmark results were obtained by Matthias Pfeffer \cite{Pfeffer} on a cycle-accurate simulation of Komodo. The values are obtained without garbage collection. According to Pfeffer, Komodo can be clocked with 33MHz in a Xlinix XCV800. \item[JStamp] aJile's JEMCore is a direct-execution Java processor that is available in two different versions: the aJ-80 and the aJ-100 \cite{aJile}. The aJ-100 provides a generic 8-bit, 16-bit or 32-bit external bus interface, while the aJ-80 only provides an 8-bit interface. A development system, the JStamp \cite{JStamp}, was used for this benchmark. It contains the aJ-80, clocked at 74MHz. \item[SaJe] is a board that contains the aJ-100 clocked with 100MHz and 10ns SRAM. \item[EJC] The EJC (Embedded Java Controller) platform \cite{EJC} is a typical example of a JIT system on a RISC processor. The system is based on a 32-bit ARM720T processor running at 74MHz. It contains up to 64 MB SDRAM and up to 16 MB of NOR flash. \item[SUN jvm] is the Sun JVM 1.4.1, running on a 266MHz Pentium MMX under Linux. \item[gcj] is the GNU compiler for Java. This configuration represents the batch compiler solution, running on a 266MHz Pentium. \item[Xint] As a reference the benchmark is also run with the Sun JVM in interpreting mode (with option -Xint). \item[MB] is the realization of Java on a RISC processor for an FPGA (Xilinx MicroBlaze \cite{microblaze}). Java is compiled to C with a Java compiler for real-time systems \cite{Java2C} and the C program is compiled with the standard GNU toolchain. \end{description} % In Figure~\ref{fig:results:app:bench}, the geometric mean of the two application benchmarks is shown. The unit used for the result is iterations per second. Note that the vertical axis is logarithmic, in order to obtain useful figures to show the great variation in performance. The top diagram shows absolute performance, while the bottom diagram shows the same results scaled to a 1MHz clock frequency. The results of the application benchmarks and the geometric mean are shown in \tablename~\ref{tab:results:bench:app}. The raw data for all benchmarks can be found in Appendix~\ref{appx:bench}. % at \tablename~\ref{tab:appendix:bench:all1} and % \ref{tab:appendix:bench:all2}. It should be noted that scaling to a single clock frequency could prove problematic. The relation between processor clock frequency and memory access time cannot always be maintained. To give an example, if we were to increase the results of the 100MHz JOP to 1GHz, this would also involve reducing the memory access time from 15ns to 1.5ns. Processors with 1GHz clock frequency are already available, but the fastest asynchronous SRAM to date has an access time of 10ns. To compare the performance relatively to the size of the different systems, Figure~\ref{fig:results:app:bench:double:scaled} shows the performance of JOP, the aJ100 and the two PC versions relative to the gate count (from Table~\ref{tab:results:gate:count}) and clock frequency. Relative to size and clock frequency, JOP outperforms the aJile processor by a factor of 19 and even the JIT-compiler on the Pentium MMX by a factor of 4. All the benchmarks measure how often a function is executed per second. Therefore, execution time is only measured indirectly -- a higher value means shorter execution time. In the Kfl benchmark, this function contains the main loop of the application (see Listing~\ref{lst:results:main}) that is executed in a periodic cycle in the original application. In the benchmark the wait for the next period is omitted, so that the time measured solely represents execution time. The UDP benchmark contains the generation of a request, transmitting it through the UDP/IP stack, generating the answer and transmitting it back as a benchmark function. The iteration count is the number of received answers per second. In the application benchmarks, the main function is executed in a loop until one second (or a longer period of time) has elapsed. For the application benchmark, there is no `overhead' loop. This feature is only used in the micro-benchmarks. As the benchmark is self-adjusting, the measured time can also be longer than one second. The result is the iteration count, scaled to one second. The accuracy of the measurement depends on the resolution of the system time. For the measurements under Linux, the system time has a resolution of 10ms, resulting in an inaccuracy of 1\%. The accuracy of the system time on leJOS, TINI and the aJile is not known, but is considered to be in the same range. For JOP, a $\mu$s counter is used for time measurement. \begin{figure} \centering \includegraphics[width=\excelwidth]{results/results_app_bench} \\ \vspace{0.5cm} \includegraphics[width=\excelwidth]{results/results_app_bench_scaled} \caption{Performance comparison of different Java systems with application benchmarks. The diagrams show the geometric mean of the two benchmarks in iterations per second -- a higher value means higher performance. The top diagram shows absolute performance, while the bottom diagram shows the result scaled to 1MHz clock frequency. } \label{fig:results:app:bench} \end{figure} \begin{table} \centering \begin{tabular}{lD{.}{.}{2}rrD{.}{.}{2}D{.}{.}{2}} \toprule & \multicolumn{1}{c}{Frequency} & \multicolumn{1}{c}{Kfl} & \multicolumn{1}{c}{UDP/IP} & \multicolumn{1}{c}{Geom. Mean} & \multicolumn{1}{c}{Per MHz}\\ & \multicolumn{1}{c}{[MHz]} & \multicolumn{4}{c}{[Iterations/s]}\\ \midrule JOP & 100 & 14,222 & 6,050 & 9,276 & 93\\ leJOS & 16 & 25 & 13 & 18 & 1\\ TINI & 40 & 64 & 29 & 43 & 1\\ Komodo & 33 & 924 & 520 & 693 & 21\\ JStamp & 74 & 2,221 & 1,004 & 1,493 & 20\\ SaJe & 103 & 14,148 & 6,415 & 9,527 & 92\\ EJC & 74 & 9,893 & 2,822 & 5,284 & 71\\ Sun jvm & 266 & 212,952 & 91,851 & 139,857 & 526\\ gcj & 266 & 139,884 & 38,460 & 73,348 & 276\\ Xint & 266 & 17,310 & 8,747 & 12,305 & 46\\ MB 2KB/0KB & 100 & 3,792 & & & \\ % MB 8KB/8KB & 100 & 25,090 & & & \\ \bottomrule \end{tabular} \caption{Application benchmarks on different Java systems. The table shows the benchmark results in iterations per second -- a higher value means higher performance. } \label{tab:results:bench:app} \end{table} \begin{figure} \centering \includegraphics[width=\excelwidth]{results/results_app_bench_double_scaled} \caption{Performance comparison of different Java systems with application benchmarks. The diagram shows the result scaled to the chip size (Kgates) and clock frequency (MHz). } \label{fig:results:app:bench:double:scaled} \end{figure} \subsubsection{Discussion} When comparing JOP and the aJile processor against leJOS and TINI, we can see that a Java processor is up to 500 times faster than an interpreting JVM on a standard processor for an embedded system. The average performance of JOP is a little bit better than a JIT-compiler solution on an embedded system, as represented by the EJC system. Even when scaled to the same clock frequency, each compiling JVM on a PC (Sun jvm and gcj) is much faster than either embedded solution. However, as we saw in Section~\ref{sec:cache}, the kernel of the application is smaller than 4KB. It therefore fits in the level one cache of the Pentium MMX (16KB + 16KB level one cache). For a comparison with a Pentium class processor we would need a larger application. JOP is about 6 times faster than the aJ80 Java processor on the popular JStamp board. However, the aJ80 processor only contains an 8-bit memory interface, and suffers from this bottleneck. The SaJe system contains the aJ100 with 32-bit, 10ns SRAMs and is as fast as JOP with its 15ns SRAMs. The MicroBlaze system is a representation of a Java batch-compilation system for a RISC processor. MicroBlaze is configured with the same cache\footnote{The MicroBlaze with a 8KB data and 8KB instruction cache is about 2.5 times faster than JOP. However, a 16KB memory is not available in low-cost FPGAs and is an unbalanced system with respect to the LC/memory relation. Furthermore, the benchmark fits into a 4KB cache and the resulting measurement does not include main memory access.} as JOP and clocked at the same frequency. JOP is about three times faster than this solution, thus showing that native execution of Java bytecodes is faster than batch-compiled Java on a similar system. However, the results of the MicroBlaze solution are at a preliminary stage\footnote{As not all language constructs can be compiled, only the Kfl benchmark was measured.}, as the Java2C compiler \cite{Java2C} is still under development. The micro-benchmarks are intended to give insight into the implementation of the JVM. In Table~\ref{tab:results:bench:clock}, we can see the execution time in clock cycles of various bytecodes. As almost all bytecodes manipulate the stack, it is not possible to measure the execution time for a single bytecode. As a minimum requirement, a second instruction is necessary to reverse the stack operation. \begin{table} \centering \begin{tabular}{lrrrrrrr} \toprule & JOP & leJOS & TINI & Komodo & JStamp & SaJe & Xint\\ \midrule iload iadd & 2 & 836 & 789 & 8 & 38 & 8 & 17\\ iinc & 11 & 422 & 388 & 4 & 41 & 11 & 2\\ ldc & 10 & 1,340 & 1,128 & 40 & 67 & 9 & 31\\ if\_icmplt taken & 6 & 1,609 & 1,265 & 24 & 42 & 18 & 36\\ if\_icmplt not taken & 6 & 1,520 & 1,211 & 24 & 40 & 14 & 37\\ getfield & 25 & 1,879 & 2,398 & 48 & 142 & 23 & 39\\ getstatic & 17 & 1,676 & 4,463 & 80 & 102 & 15 & 40\\ iaload & 30 & 1,082 & 1,543 & 28 & 74 & 13 & 30\\ invoke & 128 & 4,759 & 6,495 & 384 & 349 & 112 & 182\\ invoke static & 101 & 3,875 & 5,869 & 680 & 271 & 92 & 164\\ invoke interface & 146 & 5,094 & 6,797 & 1617 & 531 & 148 & 193\\ \bottomrule \end{tabular} \caption{Execution time in clock cycles for various JVM bytecodes} \label{tab:results:bench:clock} \end{table} For JOP we can deduce that the WCET for simple bytecodes (as given in Appendix~\ref{appx:bytecode}) is also the average execution time. We can see that the combination of \code{iload} and \code{iadd} executes in two cycles, which means that each of these two operations is executed in a single cycle. The \code{iinc} bytecode is one of the few instructions that do not manipulate the stack and can be measured alone. As \code{iinc} is not implemented in hardware, we have a total of 11 cycles that are executed in microcode. It is fair to assume that this comprises too great an overhead for an instruction that is found in every iterative loop with an integer index. However, the decision to implement this instruction in microcode was derived from the observation that the dynamic instruction count for \code{iinc} is only 2\% (see Section~\ref{sec:bench:jvm}). The sequence for the branch benchmark (\code{if\_icmplt}) contains the two load instructions that push the arguments onto the stack. The arguments are then consumed by the branch instruction. This benchmark verifies that a branch requires a constant four cycles on JOP, whether it is taken or not. For compiling versions of the JVM, these micro-benchmarks do not produce useful results. The compiler performs optimizations that make it impossible to measure execution times at this fine a granularity. During the evaluation of the aJile system, unexpected behavior was observed. The aJ80 on the JStamp board is clocked at 7.3728MHz and the internal frequency can be set with a PLL. The aJ80 is rated for 80MHz and the maximum PLL factor that can be used is therefore ten. Running the benchmarks with different PLL settings gave some strange results. For example, with a PLL multiplier setting of ten, the aJ80 was about 12.8 times faster! Other PLL factors also resulted in a greater than linear speedup. The only explanation we could find was that the internal time, \code{System.currentTimeMillis()}, used for the benchmarks depends on the PLL setting. A comparison with the wall clock time showed that the internal time of the aJ80 is 23\% faster with a PLL factor of 1 and 2.4\% faster with a factor of ten -- a property we would not expect on a processor that is marketed for real-time systems. The SaJe board is also clocked with 7.3728MHz and the PLL factor is set to 14. This gives a 103.2192MHz internal clock frequency. However, it is not known how accurate the internal time is in this setting. The results for the SaJe board can also suffer from the problem described above. % aJ80 at 7.3728MHz: internal 11' are external 8'57'' ... 1.23 times too fast % aJ80 at 73.728MHz: internal 21' are external 20'30'' ... 1.024 times too fast \subsubsection{Execution Time Jitter} For real-time systems, the worst-case of the execution time is of primary importance. We have measured the execution times of several iterations of the main function from the Kfl benchmark. \figurename~\ref{fig:results:kfl:exe} shows the measurements, scaled to the minimum execution time. \begin{figure} \centering \includegraphics[width=\excelwidth]{results/results_kfl_exe} \\ \vspace{0.5cm} \includegraphics[width=\excelwidth]{results/results_kfl_exe_detail} \caption{Execution time of the main function for the Kfl benchmark. The values are scaled to the minimum execution time. The bottom figure shows a detail of the top figure. } \label{fig:results:kfl:exe} \end{figure} %\begin{figure} % \centering % \includegraphics{results/results_kfl_exe_detail} % \caption{Execution time with KFL benchmark (detail).} % \label{fig:results:kfl:exe:detail} %\end{figure} A period of four iterations can be seen. This period results from simulating the commands from the base station that are executed every fourth iteration. At iteration 10, a command to start the motor is issued. We see the resulting rise in execution time at iteration 12 to process this command. At iteration 54, the simulation triggers the end sensor and the motor is stopped. The different execution times in the different modes of the application are inherent in the design of the simulation. However, the ratio between the longest and the shortest period is five for the JStamp, four for the gcj system and only three for JOP. Therefore, a system with an aJile processor needs to be 1.7 times faster than JOP in order to provide the same WCET for this measurement. At iteration 33, we can see a higher execution time for the JStamp system that is not seen on JOP. This variation at iteration 33 is not caused by the benchmark. The execution time under gcj on the Linux system showed some very high peaks (up to ten times the minimum, not shown in the figures). This observation was to be expected, as the gcj/Linux system is not a real-time solution. The Sun JIT-solution is omitted from the figure. As a result of the invocation of the compiler at some point during the simulation, the worst-case ratio between the maximum and minimum execution time was 1313 -- showing that a JIT-compiler is impractical for real-time applications. It should be noted that execution time measurement is not a safe method for obtaining WCET estimates. However, in situations where no WCET analysis tool is available, it can give some insight into the WCET behavior of different systems. \subsection{Real-Time Performance} \label{subsec:rt:perf} In this section, the implementation of the simple real-time profile (from Section~\ref{sec:rtprof}) with JOP is compared with the Reference Implementation (RI) of the RTSJ (see Section~~\ref{sec:rtsj}) on top of Linux. We use the Linux platform for the comparison, as it is the only platform for which the RTSJ is available. The RI is an interpreting implementation of the JVM that is, however, not optimized for performance. A commercial version of the RTSJ, JTime by TimeSys, should perform better. However, it was not possible to get a license of JTime for research purposes. JOP is implemented in Altera's low-cost Cyclone EP1C6 FPGA, and clocked with 100MHz. The test results for the RI were obtained on an Intel Pentium MMX 266MHz, running Linux with two different kernels: a generic kernel version 2.4.20 and the real-time kernel from TimeSys \cite{TimeSysLinux}, as recommended for the RI. For each test, 500 measurements were taken. Time was measured using a hardware counter in JOP and the time stamp counter of the Pentium processor under Linux. \subsubsection{Periodic Threads} Many activities in real-time systems must be performed periodically. Low release jitter is of major importance for tasks such as control loops. The test setting is similar to the periodic thread test in \cite{828497}. A single real-time thread only calls \code{waitForNextPeriod()} in a loop and records the time between subsequent calls. A second idle thread, with a lower priority, merely consumes processing time. This test setting results in two context switches per period. \tablename~\ref{tab_results_periodic_jop} shows the average, standard deviation and extreme values for different period times on JOP. The same values are shown in \tablename~\ref{tab_results_periodic_ri} for the RI. Please note that the values are in $\mu$s for JOP and in ms for the RI. \begin{table} \centering \begin{tabular}{rrd{2.0}rr} \toprule \cc{Period} & \cc{Avg.} & \cc{Std. Dev.} & \cc{Min.} & \cc{Max.} \\ \cc{[$\mu$s]} & \cc{[$\mu$s]} & \cc{[$\mu$s]} & \cc{[$\mu$s]} & \cc{[$\mu$s]} \\ \midrule % 50 & 50 & 14 & 36 & 66\\ & pre cache version 50 & 50 & 13 & 35 & 63\\ 70 & 70 & 0 & 70 & 70\\ 100 & 100 & 0 & 100& 100\\ 500 & 500 & 0 & 500 & 500\\ 1,000 & 1,000& 0 & 1,000 & 1,000\\ \bottomrule \end{tabular} \caption{Jitter of periodic threads with JOP} \label{tab_results_periodic_jop} \end{table} \begin{table} \centering \begin{tabular}{d{1}d{2}d{2.3}d{3}d{2}} \toprule \cc{Period} & \cc{Avg.} & \cc{Std. Dev.} & \cc{Min.} & \cc{Max.} \\ \cc{[ms]} & \cc{[ms]} & \cc{[ms]} & \cc{[ms]} & \cc{[ms]} \\ \midrule 5 & 4.0 & 7.92 &0.017 & 19.90 \\ 10 & 6.6 & 9.34 &0.019 & 19.94 \\ 20 & 20.0 & 0.015 & 19.87 & 20.14 \\ 35 & 35.0 & 5.001 & 29.75 & 40.25 \\ 50 & 50.0 & 0.018 & 49.95 & 50.06 \\ 100 & 100.0 & 0.002 & 99.94 & 100.1 \\ \bottomrule \end{tabular} \caption{Jitter of periodic threads with RI/RTSJ} \label{tab_results_periodic_ri} \end{table} Using microsecond accurate timer interrupts, programmed by the scheduler, results in excellent performance of periodic threads in JOP. No jitter from the scheduler can be seen with a single thread at periods longer than 70$\mu$s. The measurement for the RI excludes the first values measured. The first values are misleading as the RI behaves unpredictably at \emph{startup}. The RI performs inaccurately at periods below 20ms. This effect has also been observed in \cite{701668}. Larger periods that are multiples of 10ms have very low jitter. However, using a period such as 35ms shows a standard deviation of five ms. A detailed look into the collected samples only shows values of 30 and 40ms. This implies a timer tick of 10ms in the underlying operating system. No significant difference is observed when running this test on the generic Linux kernel and the TimeSys kernel. The commercial version of the TimeSys Linux kernel should perform better as the resolution of the timer tick is 1ms and a programmable time can be used for periodic threads. However, it was not possible to obtain a license to evaluate the combination of JTime on the commercial Linux kernel. \tablename~\ref{tab_results_periodic_ri} represents the measurements on the generic kernel. This comparison shows the advantage of an adjustable timer interrupt over a fixed timer tick. \subsubsection{Context Switch} This test setting consists of two threads. A low priority thread continuously stores the current time in a shared variable. A high priority periodic thread measures the time difference between this value and the time immediately after \code{waitForNextPeriod()}. \tablename~\ref{tab_results_context} gives the times for the context switch in processor clock cycles. \begin{table} \centering \begin{tabular}{lrd{1}rr} \toprule & \cc{Avg.} & \cc{Std. Dev.} & \cc{Min.} & \cc{Max.} \\ \midrule % JOP & 2,878 & 7.97 & 2,876 & 2,909 \\ % JOP & 2,878 & 8 & 2,876 & 2,909 \\ % pre cache version JOP & 2,686 & 14 & 2,676 & 2,709 \\ RI Linux & 4,253 & 1,239 & 3,232 & 19,628 \\ RI TS Linux & 12,923 & 1,145 & 11,529 & 21,090 \\ \bottomrule \end{tabular} \caption{Time for a thread switch in clock cycles} \label{tab_results_context} \end{table} This test did not produce the expected behavior from the RI on the generic Linux kernel. When the low priority thread ran in this tight loop, the high priority thread was not scheduled. After inserting a \code{Thread.yield()} and an operating system call, such as \code{System.out.print()}, in this loop, the test performed as expected. This indicates a major problem in either the RI or the operating system scheduler. This problem did not occur when the RI was run on the TimeSys Linux kernel. However, the context switch time on the TimeSys kernel is three times longer than on the standard kernel. \subsubsection{Asynchronous Event Handler} In this test setting, a high priority event handler is triggered by a low priority periodic thread. As \code{AsynchEventHandler} performs poorly in the RI (see \cite{701668}), a \code{BoundAsynchEventHandler} is used for the RI test program. The time elapsed between the invocation of \code{fire()} and the first statement of the event handler was measured. \tablename~\ref{tab_results_event} shows the elapsed times in clock cycles for JOP and the RTSJ RI. \begin{table} \centering \begin{tabular}{lrd{1}rr} \toprule & \cc{Avg.} & \cc{Std. Dev.} & \cc{Min.} & \cc{Max.} \\ \midrule % JOP & 2,986 & 7.3 & 2,822 & 2,986 \\ % JOP & 2,986 & 7 & 2,822 & 2,986 \\ % pre cache version JOP & 2,935 & 7 & 2,773 & 2,935 \\ RI Linux & 53,685 & 7,014 & 47,400 & 87,196 \\ RI TS Linux & 69,273 & 7,832 & 63,060 & 101,292 \\ \bottomrule \end{tabular} \caption{Dispatch latency of event handlers in clock cycles} \label{tab_results_event} \end{table} The time taken to dispatch an asynchronous event is similar to the context switch time in JOP. This is to be expected as events are scheduled and dispatched as threads. The minimum value only occurred in the first event, all following events having been dispatched in the maximum time. In the RI, the dispatch time is about 12 times larger than a context switch with a significant variation in time. This indicates that the implementation of \code{fire()} and the communication of the event to the underlying operating system are not optimal. The time factor between context switch and event handling on the TimeSys kernel is lower than on the standard kernel, but is nevertheless still significant. \subsubsection{Summary} In this section, we have compared the RTSJ on top of Linux with the implementation of a simple real-time profile on top of JOP. The RTSJ addresses several issues relating to the use of Java for real-time systems. However, the RTSJ is a specification too large and complex to be implemented in small embedded systems. We therefore use the simpler real-time profile for JOP. Tight integration of the real-time scheduler with the supporting processor results in an efficient platform for Java in embedded real-time systems. A performance comparison between this implementation and the RTSJ showed that a dedicated Java processor without an underlying operating system is more predictable than trying to adopt a general purpose OS for real-time systems. Time will show if an implementation of the RTSJ on a \emph{real} RTOS will outperform the presented solution. \section{WCET} \label{sec:wcet} Worst-case execution time (WCET) estimates of tasks are essential for designing and verifying real-time systems. WCET estimates can be obtained either by measurement or static analysis. The problem with using measurements is that the execution times of tasks tend to be sensitive to their inputs. As a rule, measurement does not guarantee safe WCET estimates. Instead, static analysis is necessary for hard real-time systems. Static analysis is usually divided into a number of different phases: % \begin{description} \item[Path analysis] generates the control flow graph (a directed graph of basic blocks) of the program and annotates (manual or automatic) loops with bounds. \item[Low-level analysis] determines the execution time of basic blocks obtained by the path analysis. A model of the processor and the pipeline provides the execution time for the instruction sequence. \item[Global low-level analysis] determines the influence of hardware features such as caches on program execution time. This analysis can use information from the path analysis to provide less pessimistic values. \item[WCET Calculation] collapses the control flow graph to provide the final WCET estimate. Alternative paths in the graph are collapsed to a single value (the largest of the alternatives) and loops are collapsed once the loop bound is known. \end{description} % For the low-level analysis, a good timing model of the processor is needed. The main problem for the low-level analysis is the execution time dependency of instructions in modern processors that are not designed for real-time systems. JOP is designed to be an easy target for WCET analysis. The WCET of each bytecode can be predicted in terms of number of cycles it requires. There are no dependencies between bytecodes. Each bytecode is implemented by microcode. We can obtain the WCET of a single bytecode by performing WCET analysis at the microcode level. To prove that there are no time dependencies between bytecodes, we have to show that no processor states are \emph{shared} between different bytecodes. \subsection{Microcode Path Analysis} To obtain the WCET values for the individual bytecodes we perform the path analysis at the microcode level. First, we have to ensure that a number of restrictions (from \cite{84850}) of the code are fulfilled: % \begin{itemize} \item Programs must not contain unbounded recursion. This property is satisfied by the fact that there exists no call instruction in microcode. \item Function pointers and computed \code{gotos} complicate the path analysis and should therefore be avoided. Only simple conditional branches are available at the microcode level. \item The upper bound of each loop has to be known. This is the only point that has to be verified by inspection of the microcode. \end{itemize} % To detect loops in the microcode we have to find all backward branches (e.g.\ with a negative branch offset). The branch offsets can be found in a VHDL file (\code{offtbl.vhd}) that is generated during microcode assembly. In the current implementation of the JVM there are ten different negative offsets. However, not each offset represents a loop. Most of these branches are used to share common code. All backward branches found in \code{jvm.asm} are summarized below: \begin{itemize} \item Three branches are found in the initialization code of the JVM. They are not part of a bytecode implementation and can be ignored. \item Five branches are used by exceptions, the interrupt bytecode, and for the call of Java implemented bytecodes. The target of these branches is found in the implementation of \code{invoke} to share part of the microcode sequence. These branches are therefore not part of a loop. \item One branch is found in the implementation of \code{imul} to perform a fixed delay. The iteration count for this loop is constant. \item Two backward branches share the same offset and are used in loops to move data between the stack memory and main memory. This loop is not part of a regular bytecode. It is contained in a system function used by the scheduler for the task switch. The bound for this loop has to be determined in the scheduler code. \end{itemize} % A few bytecodes are implemented in Java. The implementation can be found in the class \code{com.jopdesign.sys.JVM} and can be analyzed in the same way as application code. The bytecodes \code{idiv} and \code{irem} contain a constant loop. The bytecodes \code{new} and \code{anewarray} contain loops to initialize (with zero values) new objects or arrays. The loop is bound by the size of the object or array. The bytecode \code{lookupswitch}\footnote{\codefoot{lookupswitch} is one way of implementing the Java \codefoot{switch} statement. The other bytecode, \codefoot{tableswitch}, uses an index in the table of branch offsets and has therefore a constant execution time.} performs a linear search through a table of branch offsets. The WCET depends on the table size that can be found as part of the instruction. As the microcode sequences are very short, the calculation of the control flow graph for each bytecode is done manually. \subsection{Microcode Low-level Analysis} To calculate the execution time of basic blocks in the microcode, we need to establish the timing of microcode instructions on JOP. All microcode instructions except \code{wait} execute in a single cycle, reducing the low-level analysis to a case of merely counting the instructions. The \code{wait} instruction is used to stall the processor and wait for the memory subsystem to finish a memory transaction. The execution time of the \code{wait} instruction depends on the memory system and, if the memory system is predictable, has a known WCET. A main memory consisting of SRAM chips can provide this predictability and this solution is therefore advised. The predictable handling of DMA, which is used for the instruction cache fill, is explained in Section~\ref{sec:cache:wcet}. The \code{wait} instruction is the only way to stall the processor. Hardware events, such as interrupts (see Section~\ref{sec:interrupt}), do not stall the processor. Microcode is stored in on-chip memory with single cycle access. Each microcode instruction is a single word long and there is no need for either caching or prefetching at this stage. We can therefore omit performing a low-level analysis. No pipeline analysis \cite{EngblomPhD}, with its possible unbound timing effects, is necessary. \subsection{Bytecode Independency} We have seen that all microcode instructions except \code{wait} take one cycle to execute and are therefore independent of other instructions. This property directly translates to independency of bytecode instructions. The \code{wait} microcode instruction provides a convenient way to hide memory access time. A memory read or write can be triggered in microcode (with \code{stmra} and \code{stmwd}) and the processor can continue with microcode instructions. When the data from a memory read is needed, the processor explicitly waits until it becomes available. For a memory store, this wait can be deferred until the memory system is used next. It is possible to initiate the store in a bytecode such as \code{putfield} and continue with the execution of the next bytecode, even when the store has not been completed. In this case, we introduce a dependency over bytecode boundaries, as the state of the memory system is \emph{shared}. To avoid these dependencies that are difficult to analyze, each bytecode that accesses memory waits (preferably at the end of the microcode sequence) for the memory system. Furthermore, the deferring of \code{wait} in a store operation results in an additional \code{wait} in every read operation. Since read operations are more frequent than write operations (15\% vs. 2.5\%, see Section~\ref{sec:bench:jvm}), the performance gain from the hidden memory store is lost. \subsection{WCET of Bytecodes} The control flow of the individual bytecodes together with the basic block length (that directly corresponds with the execution time) and the time for memory access result in the WCET (and BCET) values of the bytecodes. These values can be found in Appendix~\ref{appx:bytecode}. \subsection{Evaluation} \begin{lstlisting}[float,caption={Bubble Sort in Java}, label=lst:results:wcet:sort:prog] final static int N = 5; static void sort(int[] a) { int i, j, v1, v2; // loop count = N-1 for (i=N-1; i>0; --i) { // loop count = (N-1)*N/2 for (j=1; j<=i; ++j) { v1 = a[j-1]; v2 = a[j]; if (v1 > v2) { a[j] = v1; a[j-1] = v2; } } } } \end{lstlisting} We conclude this section with a worst and best case analysis of a classic example, the Bubble Sort algorithm. The values calculated are compared with the measurements of the execution time on JOP on all permutations of the input data. \figurename~\ref{lst:results:wcet:sort:prog} shows the test program in Java. The algorithm contains two nested loops and one condition. We use an array of five elements to perform the measurements for all permutations (i.e. $5!=120$) of the input data. The number of iterations of the outer loop is one less than the array size: $c_1=N-1$, in this case four. The inner loop is executed $c_2 = \sum_{i=1}^{c_1}i = c_1(c_1+1)/2$ times, i.e.\ ten times in our example. The compiled version, i.e.\ the bytecodes of the test program, split into basic blocks, is given in \tablename~\ref{tab:results:bubble:wcet}. The fourth column contains the execution time of the bytecodes and the basic blocks in clock cycles. The annotated control flow graph (CFG) of the example is shown in Figure~\ref{fig:results:wcet:cfg}. The edges contain labels showing how often the path between two nodes is taken. We can identify the outer loop, containing the blocks B2, B3, B4 and B8. The inner loop consists of blocks B4, B5, B6 and B7. Block B6 is executed when the condition of the \code{if} statement is true. The path from B5 to B7 is the only path that depends on the input data. % the long table 'should' begin on a new page %\pagebreak[4] \pagebreak[3] \begin{longtable}{lllrrrrr} \toprule & & & & \multicolumn{2}{c}{WCET} & \multicolumn{2}{c}{BCET} \\ Block & Addr. & Bytecode & Cycles & Count & Total & Count & Total \\ \midrule \endhead \caption{WCET and BCET in clock cycles of the Bubble Sort test program\label{tab:results:bubble:wcet}} \endfoot % table is two pages long, so don't use last caption in list % of tables works \caption[]{WCET and BCET in clock cycles of the Bubble Sort test program} \endlastfoot \input{results/bubble} \bottomrule \end{longtable} \begin{figure} \centering \includegraphics[height=\excelwidth]{results/results_wcet_cfg} \caption{The control flow graph of the Bubble Sort example} \label{fig:results:wcet:cfg} \end{figure} The values in the fifth and seventh columns (Count) of \tablename~\ref{tab:results:bubble:wcet} are derived from the CFG and show how often the basic blocks are executed in the worst and best cases. The WCET and BCET value for each block is calculated by multiplying the clock cycles by the execution frequency. The overall WCET and BCET values are calculated by summing the values of the individual blocks B1 to B8. The last block (B9) is omitted, as the measurement does not contain the return statement. The execution time of the program is measured using the cycle counter in JOP. The current time is taken at both the entry of the method and at the end, resulting in a measurement spanning from block B1 to the beginning of block B9. The last statement, the \code{return}, is not part of the measurement. The difference between these two values (less the additional 8 cycles introduced by the measurement itself) is given as the execution time in clock cycles (the last row in \tablename~\ref{tab:results:bubble:wcet}). The measured WCET and BCET values are exactly the same as the calculated values. In Figure~\ref{fig:results:wcet:sort}, the measured execution times for all 120 permutations of the input data are shown. The vertical axis shows the execution time in clock cycles and the horizontal axis the number of the test run. The first input sample is an already sorted array and results in the lowest execution time. The last sample is the worst-case value resulting from the reversely ordered input data. We can also see the 11 different execution times that result from executing basic block B6 (which performs the element exchange and takes 73 clock cycles) between 0 and 10 times. \begin{figure} \centering \includegraphics[width=\excelwidth]{results/results_wcet_sort} \caption{Execution time in clock cycles of the Bubble Sort program} \label{fig:results:wcet:sort} \end{figure} This example has demonstrated that JOP is a simple target for the WCET analysis. Most bytecodes have a single execution time (WCET = BCET), and the WCET of a task depends only on the control flow. No pipeline or data dependencies complicate the low-level part of the WCET analysis. %\subsection{Notes for targets} % %\subsubsection{JStamp} % %\begin{verbatim} % aJile project in \usr2\ajile\bench % Sources from JOP target % LowLevel.java from directory aJile % Remove .class in ...\dist\classes % Generate .class with \bat\ajc.bat in source directory % destination is ...\dist\classes % e.g. in ...\src\bench: ajc jbe/DoAll % aJile ChemBuilder (bench.ajp) use COM1 for System.out % Terminal on Serial A, 9600 baud % Did not get the System.out in Charade (but it worked some % time ago) % Charade: Reset, File-Load \usr2\ajile\bench\build.bin %\end{verbatim} \section{Applications} \label{sec:applications} During the research for this thesis, the first working version of JOP was used in a real-world application. Using an architecture under development in a commercial project entails risks. Nevertheless, this was deemed to be the best way to prove the feasibility of the processor. In this section, the experiences of the first project involving JOP are summarized. \subsection{Motor Control} \label{sec:app:kfl} In rail cargo, a large amount of time is spent on loading and unloading of goods wagons. The contact wire above the wagons is the main obstacle. Balfour Beatty Austria developed and patented a technical solution, the so-called \emph{Kippfahrleitung}, to tilt up the contact wire. This is done on a line up to one kilometer. An asynchrony motor on each mast is used for this tilting. However, it has to be done synchronously on the whole line. Each motor is controlled by an embedded system. This system also measures the position and communicates with a base station. \figurename~\ref{fig:results:kfl:mast} shows the mast with the motor and the control system in the `down' and `up' positions. The base station has to control the deviation of individual positions during the tilt. It also includes the user interface for the operator. In technical terms, this is a distributed, embedded real-time control system, communicating over an RS485 network. \begin{figure} \centering \includegraphics[width=7cm]{results/results_kfl_mast1} \hspace{1cm} \includegraphics[width=4cm]{results/results_kfl_mast2} \caption{Picture of a \emph{Kippfahrleitung} mast in down and up position} \label{fig:results:kfl:mast} \end{figure} %\begin{figure} % \centering % \includegraphics[width=4cm]{results/results_kfl_mast2} % \caption{Picture of the mast in up position} % \label{fig:results:kfl:mast2} %\end{figure} \subsubsection{Real Hardware} Although this system is not mass-produced, there were nevertheless cost constraints. Even a small FPGA is more expensive than a general purpose CPU. To compensate for this, additional chips for the memory and the FPGA configuration were optimized for cost. One standard 128KB Flash was used to hold FPGA configuration data, the Java program and a logbook. External main memory was reduced to 128KB with an 8-bit data bus. To reduce external components, the boot process is a little complicated. A watchdog circuit delivers a reset signal to a 32 macro-cell PLD. This PLD loads the configuration data into the FPGA. When the FPGA starts, it disables the PLD and loads the Java program from the Flash into the external RAM. After the JVM is initialized, the program starts at \code{main()}. The motor is controlled by silicon switches connected to the FPGA with opto couplers. The position is measured with two end sensors and a revolving sensor. The processor supervises the voltage and current of the motor supply. A display and keyboard are attached to the base station for user interface. The communication bus (up to one kilometer) is attached via an isolated RS485 data interface. \subsubsection{Synthesized Hardware} The following I/O modules were added to the JOP core in the FPGA: % \begin{itemize} \item Timer \item UART for debugging \item UART with FIFO for the RS485 line \item Four sigma delta ADCs \item I/O ports \end{itemize} % Five switches in the power line needed to be controlled by the program. A wrong setting of the switches due to a software error could result in a short circuit. Ensuring that this could not happen was a straightforward task at the VHDL level. The sigma-delta ADCs are used to measure the temperature of the silicon switches and the current through the motor. \subsubsection{Software Architecture} The main task of the program was to measure the position using the revolving sensor and communicate with the base station. This has to be done under real-time constraints. This is not a very complicated task. However, at the time of development, many features from a full-blown JVM implementation, such as threads or objects, were missing in JOP. The resulting Java was more like a \emph{tiny Java}. It had to be kept in mind which Java constructs were supported by JOP. Because there was no multi-threading capability, and in the interests of simplicity, a simple infinite loop with constant time intervals was used. Listing~\ref{lst:results:main} shows the simplified program structure. After initialization and memory allocation, this loop was entered and did never exit. \begin{lstlisting}[float,caption=Simplified program structure, label=lst:results:main] public static void main(String[] args) { init(); Timer.start(); forever(); // this point is NEVER reached } private static void forever() { for (;;) { Msg.loop(); Triac.loop(); if (Msg.available) { handleMsg(); } else { chkMsgTimeout(); } handleWatchDog(); Timer.waitForNextInterval(); } } \end{lstlisting} \subsubsection{Communication} Communication is based on a client server structure. Only the base station is allowed to send a request to a single mast station. This station is then required to reply. The maximum reply time is bounded by two time intervals. The base station handles timeout and retry. If an irrecoverable error occurs, the base station switches off the power to the mast stations, including the power supply to the motor. This is the safe state of the whole system. From the mast station perspective, every mast station supervises the base station. The base station is required to send requests on a regular basis. If this requirement is violated, the mast station switches off its motor. The data is exchanged in small packets of four bytes, including a one-byte CRC. To simplify the development, commands to program the Flash in the mast stations and force a reset were included. It is therefore possible to update the program, or even change the FPGA configuration, over the network. %\subsubsection{Benefits from using an FPGA} % %The flexibility of FPGAs made it possible to postpone some design %decisions after production of the PCB. Since the production of the %PBC was on the critical time line this helped to finish the complete %project in time. During development, there have been situations %where problems showed up that have not been foreseen. Two examples %are given where it was possible to find simple solutions: % %The routing of the PCB was almost finished. A question about cooling %the switches (TRIAC) has arisen. The electronic development team %insisted on a temperature sensor. The requirements in resolution and %conversion time were low. A sigma-delta ADC with only two external %passive components (a NTC thermistor and a capacitor) was %implemented in the FPGA. % %When the sample rate for an ADC is low compared to the clock %frequency of the digital system it is possible to transfer the AD %conversion problem from the analogue to the time domain. In %\figurename~\ref{fig_results_sigdel} the principle of a Sigma-Delta %ADC is shown. Only the adder and the integrator have to be analogue %components. A single bit DAC is just the FPGA digital output driving %the signal between GND and VCCIO. The single bit ADC can be built %with a comparator. For low resolution the threshold of the FPGA %input is practical. The simplest form of an ADC built with an FPGA %is shown in \figurename~\ref{fig_results_sigdel_real}. The filter %averages $2^n$ samples and can be implemented as a counter. % %\begin{figure*} % \centering % \includegraphics{results/results_sigdel} % \caption{A sigma-delta ADC} % \label{fig_results_sigdel} %\end{figure*} % % %\begin{figure*} % \centering % \includegraphics{results/results_sigdel_real} % \caption{A minimal sigma-delta ADC with an FPGA} % \label{fig_results_sigdel_real} %\end{figure*} % % %The AC current of the motor had to be monitored. The solution for %this was a shunt resistor in every power line and an opto-coupler %for isolation. However, it turned out that the shunt resistors got %too hot and delivered to little voltage for the opto-couplers to %work reliable. Having seen that it is possible to build an ADC in %the FPGA a new idea was born. For EMC reasons, there is an inductor %in every AC line. With a few windings of wire, a simple transformer %can be built. The resulting voltage was amplified, rectified and %used for current measurement. An additional comparator was used for %an exacter threshold than the input buffer of the FPGA. This %solution kept the board cool and added extra functionality. It is %now possible to define two thresholds for too little and too much %current. % \subsection{Further Projects} TAL, short for TeleAlarm, is a remote tele-control and data logging system. TAL communicates via a modem or an Ethernet bus with a SCADA system or via SMS with a mobile phone. For this application, a minimal TCP/IP stack needed to be implemented. This stack was the reason for implementing threads and a simple real-time system in JOP. Another application of JOP is in a communication device with soft real-time properties -- Austrian Railways' (\"OBB) new security system for single-track lines. Each locomotive is equipped with a GPS receiver and a communication device. The position of the train, differential correction data for GPS and commands are exchanged with a server at the central station over a GPRS virtual private network. JOP is the heart of the communication device in the locomotive. The flexibility of the FPGA and an Internet connection to the embedded system make it possible to upgrade the software and even the processor in the field. \section{Summary} In this chapter, we presented an evaluation of JOP. We have seen that JOP is the smallest hardware realization of the JVM available to date. Due to the efficient implementation of the stack architecture, JOP is also smaller than a \emph{comparable} RISC processor in an FPGA. Implemented in an FPGA, JOP has the highest clock frequency of all known Java processors. We compared JOP against several embedded Java systems and, as a reference, with Java on a standard PC. A Java processor is up to 500 times faster than an interpreting JVM on a standard processor for an embedded system. JOP is about six times faster than the aJ80 Java processor and as fast as the aJ100\footnote{The measured aJ100 system contained faster SRAMs than the FPGA board for JOP.}. Preliminary results using compiled Java for a RISC processor in an FPGA, with a similar resource usage and maximum clock frequency to JOP, showed that native execution of Java bytecodes is faster than compiled Java. We compared the basic properties of the real-time scheduler on JOP against the RTSJ implementation on Linux. The integration of the scheduler in the JVM, and the timer interrupt under scheduler control, results in an efficient platform for Java in embedded real-time systems. JOP performs better and more predictably than the reference implementation of the RTSJ under Linux. We also performed WCET analysis of the implemented JVM at the microcode level. This analysis provides the WCET and BCET values for the individual bytecodes. We have also shown that there are no dependencies between individual bytecodes. This feature, in combination with the method cache (see Section~\ref{sec:cache}), makes JOP an easy target for low-level WCET analysis of Java applications. Usage of JOP in three real-world applications showed that the processor is mature enough to be used in commercial projects. %\bibliographystyle{plain} %\bibliography{../bib/mybib} %\end{document}

     
    Copyright (c) 1999 OPENCORES.ORG. All rights reserved.