LOGIN   :::   RECOVER PASS   :::   GET ACCOUNT    
Browse
  • Projects
  • Code (CVS)
  • Forums
  • News
  • Articles
  • Polls
  •  
    OpenCores
  • FAQ
  • CVS HowTo
  • Mission
  • Media
  • Tools
  • Sponsors
  • Mirrors
  • Logos
  • Contact us
  •  
    Tools
  • Search
      
  • Download Cores (CVSGet)
  •  
    More
  • Wishbone
  • Perlilog
  • EDA tools
  • OpenTech CD
  •  
    Browse articles    

    The persistent cache approach.

    by Boris Muratshin on 21-Oct-2002
    source: Boris Muratshin



    The persistent cache approach.


    Boris Muratshin (zzeng@m...),

    Alexander Artiushin (alexnikart@m...)

    Jul 31, 2002


    Top



    Ideally, memory should be able to supply a processor by data in
    such a way of avoiding any extra data delays. Unfortunately, there
    are unknown an appropriate compilation methods to provide a proper
    data transport. In modern computer systems the decreasing of
    data retrieving time is achieved via a hierarchical memory
    organization where any following memory (cache) level larger
    and slower then the previous one, at that only the last level is
    addressable. First level cache size is commonly 8Kwords (1.5Mb at
    HP8500), second and third (if exists) levels are much greater.

    When an instruction gets as operand a virtual memory address, it
    is converted into a physical one. Using this physical address as a
    key (tag), data are requested from memory and if these data exist
    in cache, we get some speed-up. Let's think, why cache works with
    physical addresses? The problem is in probable intersection of
    address spaces allocated by OS for different processes. It is
    possible to enlarge virtual address (tag) with the process ID and
    convert it into the physical one only if cache lookup is
    unsuccessful but in this case we have significant cache complexity
    increasing against the small saving of address conversion costs.



    Moreover, address conversion is usually executed in processor
    whereas the upper cache levels are commonly separated from
    the processor. The shortcomings of existing caching approach
    are obvious and mostly are concluded in shared usage of cache
    memory by all running processes. Using any strategy of caching, at
    a control returning after a series of context (processes) switches
    the process should consider its own cached data context totally
    lost (so-called cold start problem). When the regular number of
    simultaneously running processes is evaluated in hundreds, to save
    (even partially) some personal data of process we need caches of
    huge sizes.

    Virtual pages swapping causes still one problem. When during a
    virtual address conversion we got a lack of corresponding virtual
    page, it will be read but some other page will be lost. Obviously,
    we should destroy all the cache lines corresponding to this
    condemned page.

    The advantages of existing caching technique are sometimes given
    to be moot points. The average retrieval time is better especially
    in case of consecutive and localized work. In any other case a
    program should not expect any significant help from a caching
    system. At that such an "average" behavior plays a mean trick with
    programmers when it makes no sense to take care of algorithm's
    "beauty" and to optimize programs by hands - cache levels
    off all code and frequently "bad" code turned out faster then
    "good". An attempts to use a cache particularities in optimization
    yield in the lack of portability even between different versions
    of the same processor. There are attempts to make cache more
    predictable e.g. speculative loading technique. The compiler
    places the instruction of loading data into cache before their
    necessity (SPARC V9, IBM POWER3 and HP PA-8xxx). The ability to
    explicitly invalidate an unnecessary cache line also could be an
    useful feature. Some systems (e.g. TMS320C6xxx) allow to configure
    the cache as directly addressed super-fast memory but in this
    case there exist difficulties with its sharing between tasks.

    Additional words should be devoted to the cache coherency problem.
    Such as the coherency is supported by inter-processor messages and
    the number of the messages is significantly nonlinear against the
    number of processors, there exist an objective upper limit for the
    number of processors in the system. Let's note what the majority
    of the mentioned messages are useless such as certainly will not
    be ever used. This is a payment for a caching in physical
    addresses, which causes extra dependencies between processor
    units. And the method to avoid these extra dependencies will let
    to significantly increase above-mentioned upper limit of the
    number of processors in a system.

    Let's advert to the essence of supposed approach. The basic ideas
    are:

    1. the separation of virtual addresses into global and
      local which can be done on basis of the value of this address
    2. caching is performed in virtual address space
    3. each task works with own cache context and
      its cached local data can't be affected by other tasks
    4. the buffer of cache memory is spliced into two equal parts (pages)
    5. at a task switch, cache pages are switched,
      at this a new task immediately runs with one page and
      the other page is used firstly for the saving of its content
      into main memory and, secondly, for a restoring of data
      for a next task from main memory
    6. the global data tag is extended with the identifier
      given by OS, e.g. the process number

    Let the discussing computing system consists of at least one
    processor module. Let this system is intended to run in parallel a
    number of processes each of which can contain more then one
    parallel thread. We will call the data, which can be used by the
    only thread as local, and the data, which are shared between
    different threads of one process as global. The data sharing
    between processes is allowable but requires a special treatment
    e.g. they are uncacheable or unswapable or accessed only via
    system calls.

    The program loader is implemented to place a process global data
    into segments with easily discriminate virtual addresses e.g. the
    result of the binary '&' should give true at testing this address
    against some predefined or software re-definable bit mask. The OS
    kernel (later simply kernel) is responsible for the threads
    scheduling and switching.

    Each processor module contains local caching mechanism (later
    cache) of the same size and organization. The cache contains
    following logical blocks:

    1. two equal data pages one of which we call active and the second one - shadowed
    2. pages access block
    3. global data management block
    4. global data page
    5. at least one DMA channel

    Depending on the implementation, data and command caches can be
    placed either on the same cache page with sharing of all caching
    mechanism or on independent pages with duplication of some
    (or all) mentioned mechanism.

    The pages access block realizes one of known caching strategies
    and algorithms e.g. it can be fully associative, write through and
    LRU extrusion.

    The global data management block is used for the maintenance of
    data coherency in a processor module and also in all system
    scale.

    The global data page is shared between all threads, this page
    doesn't have to be of the same size as a local data page and is
    used directly by the global data management block as an
    associative memory pool.

    The DMA channels are meant for copying data from cache shadowed
    local page to basic memory and back. Every of existing DMA
    channels works independently with its own (at least in this
    moment) page region.

    The active page is dedicated for usage by active on this processor
    module task, during that the content of this page is the caching
    context of the current task. The kernel contains the mechanism for
    downloading of the shadowed page content into a main memory and
    for uploading back. In the moment, when the kernel changes the
    current task, the cache pages switching is performed, so the
    active page becomes shadowed and vice versa. Which page is
    currently active and which is shadowed can be determined by the
    value of some control register and this value is changed either
    automatically or explicitly by kernel.

    In this moment the shadowed page should contain caching context of
    the next task fully uploaded by the kernel. Just after the task
    switch, the kernel begins to download the content of new shadowed
    page. After that and when kernel defines the next task to switch,
    the caching context of this task should be uploaded into the
    shadowed page. The uploading and downloading processes can be
    performed in the same time with the time shift to exclude their
    collision.

    When the cache line is set in the global page, the line tag is
    expanded by some identifier, which is got from the OS (which gets
    it from e.g. dedicated register). An example of such identifier
    can be the process ID, in any way this extended tag should be
    unique in the scale of all system. When the processor requests a
    global data, which are not registered in the global page, one
    should process the retrieval from the main memory and registration
    in the global page and put data to the local (active) page

    In case of global page repletion different implementations are
    possible. E.g. we can reject to perform caching for this value or
    substitute some any value from an active and global page in any
    way all global data in both local page should be coherent with
    those in the global page. In the process of data uploading to the
    shadowed page, the global data management block passes through all
    local data (which, as we remember, can be easily separated from
    the global ones by virtual address) and brings global data to
    conformity with the content of the global page. In case of a
    multiprocessor system, the global data management block should
    realize some mechanism of distributed transactions to maintain the
    all-system data coherence. Is there a coincidence of global pages
    contents or not, depends on the supported protocol of distributed
    transactions.

    The caching context of each task is stored in some memory,
    allocated by the kernel for this task to describe its behavior.

    Touching some control register can stop the caching activity. In
    the same way can be stopped the caching of global data. In case of
    an apparatus interrupt, e.g. in case of a virtual page miss, the
    caching activity is stopped up to the end of an interrupt handler
    work.

    The active page additionally has a direct addressing mode. In such
    a way every task can (if necessary) use the active page not as a
    caching buffer but as an own area of super-fast memory.

    What are the advantages of supposed cache organization? First of
    all, a super-fast memory can have own address space and own
    support instructions or it can be mapped into a virtual address
    space. In any way the existence of such memory by no means is
    considered in high level languages. The objective method of their
    utilization is a temporary variables (appeared during a
    compilation) storing. Alas, there is no necessity to store
    thousands of temporary variables so we need some new ideas. There
    are at least three variants.

    1. A stack is arranged in this memory.
      Stack contains local variables (and temporary too) which are
      owned only by the current task. Traditionally, a thread can pass to
      other thread of the same process an address of variable from
      its own stack, but this technique is rather an evil deed and
      creates more problems then solves. Also this data are grouped,
      the caching of the top of a stack is always useful and an actual
      stack top is usually not too large. When the stack overgrows an allocated for it memory,
      the exception occurs, OS allocates an additional buffer for data
      saving and uses the super-fast memory as an window of accelerated
      access to a stack. If there is an request out of window borders,
      the conversion of a virtual address to physical gives an
      appropriate physical address, pointing into
      one of previously allocated saving buffers.
    2. This super-fast memory can be manually managed via OS calls.
    3. This method assumes some changes in high level languages
      and/or creation of new compilation methods supporting mentioned
      architecture details. This is especially important for the
      architectures with explicit parallelism (ILP) where the caching
      in general can be harmful due an inherent unpredictability.

    Second advantage is in explicit division of data into local and
    global. This allows a directly managing of an intended for
    coherency support inter-processor messages. The entire of spectrum
    is possible. If we prohibit the global data caching, the
    synchronization occurs on the level of main memory access.
    This is not so bad if global data are rear. In case of absence of
    global data the coherency support is not necessary. In
    situation where all data are global, we have the same as when we
    use traditional cache with physical tags. It is important for a
    programmer to clear understand which data are used in this
    moment and what is the payment for it. The choice here can be
    realized by using of two system calls for memory allocations - for
    local and process-global data.

    Third advantage. A saving task context can contain some additional
    data such as branch history table (BHT) or physical addresses
    cache (TLB). Some systems keep BHT together with an instructions
    cache, in any way if this table becomes task-personal, this allow
    to increase jump predictions for speculative calculations without
    significant costs (e.g. in MIPS R10000 BHT contains just 512
    items). A common TLB size is in hundreds of items. There are no
    problems to arrange it inside savable cache page. The obvious
    difficulty is in necessity to care about swapped out pages during
    the context uploading.


    Top


    This letter is an essence of the patent application:
    "B. Muratshin, A. Artiushin The persistent cache approach for
    multitasking and SMP systems. Novosibirsk, Jul 2002, patent
    application RUS N2002121880/20(022788)". But the OpenCores members can use it freely under the OpenCores license.


     
    Copyright (c) 1999 OPENCORES.ORG. All rights reserved.