Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction

The transaction-accurate architecture design consists of software adaptation to specific communication protocol implementation. At this phase, aspects related to the communication protocol are detailed, for example, the synchronization mechanism between the different processors running in parallel becomes explicit. The software code is adapted to the synchronization method, such as events or semaphores. The adaptation is performed through an integration of the tasks codes with the OS and the communication components of the software stack. The result of the transaction-accurate architecture design represents the transaction-accurate architecture model.

5.1.1 Definition of the Transaction-Accurate Architecture

The third abstraction level of the hardware–software architecture is called transaction-accurate architecture level (TA). The transaction-accurate architecture details the local architecture of each subsystem and makes explicit the communication protocol. On the software side, the tasks code is integrated with an operating system and communication library to form the software stack. Each processor subsystem executes a software stack. The transaction-accurate architecture model may be manually coded or automatically generated by different tools.

The objectives of the transaction-accurate architecture are as follows:

  • Early verification of the tasks code execution upon an operating system

  • Early performance validation of the communication mapping scheme

The transaction-accurate architecture is composed of processor and hardware subsystems that are interconnected using an explicit interconnection component, such as bus or NoC. The processor subsystems include the local components of the subsystem, such as local memories, peripherals, and network interfaces, and an abstract model of the processor cores.

Figure 5.1 illustrates a global view of the transaction-accurate architecture, composed of two abstract processor subsystems, one memory hardware subsystem, and the network component. The left part of the figure corresponds to the hardware architecture, while the right part represents the software stack at the transaction-accurate architecture level running on one of the processor subsystems.

Fig. 5.1
figure 5_1_192485_1_Enfigure 5_1_192485_1_En

Global view of the transaction-accurate architecture

5.1.2 Global Organization of the Transaction-Accurate Architecture

The transaction-accurate architecture model is a hierarchical model. The transaction-accurate architecture is composed of software and hardware subsystems that are interconnected using an explicit network component, e.g., bus, NoC, or dedicated hardware components like the hardware FIFO.

The software subsystem represents the processor subsystem. The hardware subsystem represents a memory subsystem or a dedicated hardware subsystem that accelerates the computation of specific application functions.

Each subsystem integrates local components that are interconnected using a local and simple bus. Usually the processor subsystems are made of one or more abstract computation models of the processor cores, local memories such as program code memory, data memory, or dedicated registers, network interfaces for the connection with the external world, and other processor-specific peripherals. The selection of these components relies on the target architecture and the software requirements at this level.

Each abstract processor model executes a specific software stack made of the tasks code, operating system, and communication library. The software stack uses hardware abstraction layer primitives (HAL APIs) for the interaction with the hardware part of the system. In fact, the abstract processor with the implementation of the HAL APIs represents the hardware–software interface.

At the transaction-accurate architecture level, the intra-subsystem communication units become communication channels implemented by the communication and operating system components of the software stack. Therefore, the communication between the tasks running on the same processor is managed totally by the OS and the communication software libraries.

The inter-subsystem communication units are mapped on full end-to-end communication paths through the architecture. Hence, the communication protocol and the synchronization between the processors become explicit. The different communication paths are characterized by different performance indicators, such as throughput of the buses, delay of the communication path, or overhead of the HdS layer (device drivers, resource sharing mechanism).

The adopted communication path and the topology of the network infrastructure are implemented according to the annotation of the system architecture model and the performance estimation through the simulation of the virtual architecture model.

Example 23. Transaction-accurate architecture for the token ring appli-cation  Figure 5.1 shows a conceptual representation of the transaction-accurate architecture for the token ring application mapped on the 1AX architecture.

Figure 5.1 illustrates that for the token ring application running on the 1AX architecture, the transaction-accurate architecture contains two processor subsystems, corresponding to the ARM, respectively, XTENSA processors and the global memory subsystem. All these subsystems are interconnected by an explicit AMBA bus.

The ARM-SS processor subsystem includes an abstract ARM module, local memory, programmable interrupt controller (PIC), mailbox for the communication synchronization, and bridge for the interface to the AMBA bus, all interconnected through a local bus.

The local architecture of the XTENSA-SS subsystem is similar to the ARM-SS subsystem, but only it includes an abstract model for the XTENSA processor instead of the ARM7 processor. The global memory subsystem includes the global memory and the bridge for the connection with the global bus.

The communication through a FIFO between the tasks T1 and T2 mapped on the ARM-SS is implemented by the software components of the ARM software stack.

At the transaction-accurate architecture level, the inter-subsystem communication units COMM1 and COMM2 are mapped on full communication path. Therefore, a data sent by the ARM and received by the XTENSA processor using as storage buffer the global memory follows the data path illustrated below:

ARM -> BUS_ARMSS -> BRIDGE_ARMSS -> AMBA -> BRIDGE_MEMSS -> BUS_MEMSS -> MEM -> BUS_MEMSS -> BRIGDE_MEMSS -> AMBA -> BRIDGE_XTENSASS -> BUS_XTENSASS -> XTENSA

where

BUS_ARMSS represents the local bus of the ARM-SS

BRIDGE_ARMSS is the bridge of the ARM-SS

BUS_MEMSS is the local bus of the MEM-SS

BRIDGE_MEMSS is the interface of the global memory to the AMBA bus

BUS_XTENSASS specifies the local bus of the XTENSA-SS

BRIDGE_XTENSASS represents the bridge inside the XTENSA-SS.

This kind of data transfer requires synchronization mechanism between the two processors using the mailbox components. Thus, when the data to transmit is stored in the global memory, the ARM sends an event to the mailbox of the XTENSA to notify that there is available data. After checking the appropriate register status of the mailbox, the XTENSA processor may transfer the data from the global memory.

Other path of communication between the processors offered by the architecture involves the following route:

ARM -> BUS_ARMSS -> HWFIFO -> BUS_XTENSASS -> XTENSA

The communication through the hardware FIFO does not require explicit synchronization because the hardware resource manages also the synchronization between the processors.

The transaction-accurate architecture model may be represented using different design languages, such as SystemC [62] or SpecC [54]. The following paragraphs will present the transaction-accurate architecture using SystemC as design language.

5.2 Basic Components of the Transaction-Accurate Architecture Model

The basic components of the transaction-accurate architecture model are the software and hardware components. The software components consist of the tasks code, operating system, communication library, and HAL APIs, while the hardware components represent the detailed subsystems and explicit communication network.

5.2.1 Software Components

At the transaction-accurate architecture level, a software stack is build for each processor subsystem. This software stack is comprised of the previously generated tasks code enriched with an OS and communication library (Fig. 5.2). The HdS software represents the assembly of the OS, communication library, and HAL APIs. The HdS refines the communication APIs (HdS APIs) to custom hardware-specific low-level APIs (HAL APIs) and is responsible for the tasks and hardware resources management. The HAL APIs abstract the underlying hardware architecture. Their implementation is not yet defined for the target processor, allowing to keep the software code still processor independent. Based on the OS and communication libraries, the proposed approach sets aside flexible building and configuration of the software stack. Therefore, it allows easy customization for specific architectures and/or applications. At this level, the data transfers use explicit addresses, e.g., read_mem(addr, dst, size)/ write_mem(addr, src, size).

Fig. 5.2
figure 5_2_192485_1_Enfigure 5_2_192485_1_En

Software components of the transaction-accurate architecture

Example 24. Software components for the token ring application at the transaction-accurate architecture level  For the token ring application, a software stack is executed by each processor (ARM7 and XTENSA). The software stack running on the ARM7 is made of two application tasks code (T1 and T2), OS, and communication library. The software stack running on the XTENSA is made of the task code of T3, OS for the interrupt management, and communication software component. For both processors, the software stack has the same OS running, namely DwarfOS, the same communication library that implements the primitives send_data(…)/ recv_data(…), and are based on the same HAL APIs (read_mem(…)/ write_mem(…),ctx_swich(…)).

5.2.2 Hardware Components

The hardware architecture at the transaction-accurate level represents a more detailed platform than the virtual architecture level. It includes the components explicitly used by the HAL APIs. The different subsystems of the architecture are detailed with explicit peripherals and abstract computation model for the processor cores. Design decisions such as subsystems positioning over the global interconnect component, NoC size definition, NoC topology, NoC routing algorithm, and communication buffer size are implemented at the transaction-accurate architecture level.

Example 25. Hardware components for the token ring application at the transaction-accurate architecture level  For the token ring application, the hardware platform has a detailed local architecture for each subsystem (Fig. 5.3). Thus, the ARM-SS and XTENSA-SS contain an abstract ARM, respectively, XTENSA processor, a local memory, an interrupt controller, a local bus, and a bridge for the interface with the AMBA. The global memory subsystem contains the global memory and the bridge for the connection to the AMBA. The hardware FIFO is connected directly to the local bus of each processor subsystem.

Fig. 5.3
figure 5_3_192485_1_Enfigure 5_3_192485_1_En

Hardware components of the transaction-accurate architecture

5.3 Modeling Transaction-Accurate Architecture in SystemC

The transaction-accurate architecture model is described using SystemC TLM language and is designed according to the annotated architecture parameters of the initial system architecture model and the results of the virtual architecture model simulation.

5.3.1 Software at Transaction-Accurate Architecture Level

The software design at the transaction-accurate architecture level consists of integration of the tasks code with an OS and a communication implementation for each processor subsystem. In the following examples, the considered operating system is called DwarfOS, a tiny operating system which supports a set of basic services, such as interrupts management, FIFO software communication protocol, a cooperative scheduling policy based on static priority, and application tasks initialization [63, 126]. The communication primitives are based on blocking message-passing interface semantic. The synchronization is made using events. At this level, the generated tasks are dynamically scheduled by the OS scheduler according to the availability of data for read operations or the availability of space for write operations.

The tasks C code remains unchanged from the virtual architecture level and it uses HdS APIs such as send_data(…)/recv_data(…). Compared with the virtual architecture, the implementation of these APIs is not anymore handled by the SystemC architecture. The implementation relies on the OS and communication libraries. Hence, the tasks are blocked on communication and scheduled by the OS scheduler and not by the SystemC scheduler as at virtual architecture level.

The OS and communication components make use of HAL APIs. At this level, the implementation of the HAL APIs is not yet defined for the target processors. Therefore, the software code is still processor independent at the transaction-accurate architecture level, but it is adapted to specific hardware communication implementation such as synchronization. Figure 5.4 shows a part of the software code at the transaction-accurate architecture level.

Fig. 5.4
figure 5_4_192485_1_Enfigure 5_4_192485_1_En

Software at the transaction-accurate architecture level

The HAL APIs, i.e., __ctx_switch(…) gives to the operating system, communication, and application software an abstraction of the underlying architecture. Furthermore, the HAL APIs ease OS porting on a new hardware architecture.

There are different categories of HAL APIs [174]:

  • Kernel HAL APIs, such as task context management APIs (e.g., context creation, deletion or context switch APIs, task initialization), stack pointer and program counter management APIs (get/set_IP(), get/set_SP()), or processor mode change APIs (enable_kernel/ user_mode())

  • Interrupt management APIs, e.g., APIs which enable/disable interrupt request from an interrupt source (vector_enable/disable(vector_id)), configure interrupt vector (vector_configure(vector_id, level, up)), mask/unmask interrupt for a processor (interrupt_enable/disable()), implement the interrupt service routines (interrupt_attach/ detach(vector_id, isr)) or HAL APIs that acknowledge to the interrupt source that the interrupt request has been processed (clear_interrupt(vector_id))

  • I/O HAL APIs, which configure the I/O devices and allow their access. For example, to configure an MMU device, the following I/O HAL APIs may be required: APIs for page management (enable/disable_paging()), address translation (virtual_to_physical()), TLB (translation lookaside buffer) management, such as set a TLB entry (TLB_add()) or get TLB entry virtual/physical page frame (get_TLB_entry()). Other I/O HAL API examples can be considered the APIs for the cache memory management, such as Instruction/Data_Cache_Enable/Disable()

  • Resource management APIs, such as APIs for power management Power management (check battery status, set CPU clock frequency) or APIs to configure the timer (set/reset_timer(), wait_cpu_cyle())

  • Design time HAL APIs, which facilitate the software design process, more precisely the simulation. Example of such kind of API is the consume_cpu_cyle() to simulate the advance of the software execution time.

Example 26. Software code for the token ring application at the transaction-accurate architecture level  Figure 5.5 illustrates an example of software code for the token ring at the transaction-accurate architecture level.

Fig. 5.5
figure 5_5_192485_1_Enfigure 5_5_192485_1_En

Initialization of the tasks running on ARM7

Figure 5.5 shows the main file. The main file contains the function “thread_main” which represents the first function executed on the processor after boot. The main file is responsible to initialize the application tasks and the software communication channels. It includes the OS-dependent header files, it declares the software FIFO communication channels, it attaches the interrupt service routines to the interrupt numbers, and it initializes the tasks in the list of tasks ready for execution for the operating system. As illustrated in Fig. 5.5, for the token ring application, the initialization file of the ARM7 processor declares the two tasks running on the ARM7 and the software FIFO used for the communication between them. It also attaches the interrupt service routine of the mailbox to the interrupt number 0.

Figure 5.6 shows a fragment of code implementing the communication primitive recv_data(…). If the protocol of the communication channel is based on a FIFO mechanism, the implementation checks the status of the FIFO. If the FIFO is empty, the scheduler of the OS is called (__schedule(…)).

Fig. 5.6
figure 5_6_192485_1_Enfigure 5_6_192485_1_En

Implementation of recv_data(…) API

The communication primitives access the logic ports of the tasks that are declared in the header files of each task. Figure 5.7 shows the header file of task T2 running on the ARM7 processor in case of the token ring application.

Fig. 5.7
figure 5_7_192485_1_Enfigure 5_7_192485_1_En

Example of task header file

Task T2 has two logic ports:

  • One input port (In1_Task2) bonded to the software FIFO channel that connects task T1 and T2 and it was declared in the main file of the ARM7 processor as pointed up in Fig. 5.5.

  • One output port (Out1_Task2) for the external communication with the task T3 running on the XTENSA processor.

The logic ports have type port_t, as illustrated in Fig. 5.8. The port_t represents the data structure which implements the logic port in case of the DwarfOS. It combines the following fields: communication protocol associated with the port, status of the local synchronization register, status of the remote synchronization register, destination buffer used to store the data to be exchanged, list of tasks that are waiting for the port to acquire a synchronization event, and a specific field which stores special protocol characteristics.

Fig. 5.8
figure 5_8_192485_1_Enfigure 5_8_192485_1_En

Data structure of tasks’ ports

The input port of task T2 is characterized by a software FIFO protocol and has the synchronization and buffer associated with the software FIFO channel. The output port of task T2 notes a global FIFO protocol with the communication buffer mapped onto the external memory at the address 0×40500000 and the synchronization making use of the registers of the local and remote mailbox corresponding to the communication channel. The local mailbox represents the mailbox corresponding to the ARM processor accessed at address 0×300808. The remote mailbox stands for the mailbox of the XTENSA-SS with address 0×700808.

Figure 5.9 shows a portion of the OS scheduler implementation. The scheduler searches for a new task in status ready for execution. If there is a new ready task, the scheduler performs a context switch, by calling the HAL API __cxt_switch(…). During the context switch, the OS saves the status and registers (program counter, stack pointer, etc.) of the processor running the current task and loads those of the new task.

Fig. 5.9
figure 5_9_192485_1_Enfigure 5_9_192485_1_En

Implementation of the __schedule() service of OS

5.3.2 Hardware at Transaction-Accurate Architecture Level

The hardware at the transaction-accurate architecture level consists of a set of hardware and software subsystems interconnected using an explicit communication network. The hardware architecture implements the communication protocol, including buffer mapping, synchronization mechanism used by the processors, and the entire communication path for inter-subsystem communication.

The different subsystems represent SystemC modules (SC_MODULE) which include the local components. A top module includes the declaration, instantiation, interconnection, and address space allocation of these subsystems. Each subsystem incorporates the local hardware modules. The local components are also SystemC modules.

The transaction-accurate architecture makes use of a library of transaction-accurate components. This library implements parametric hardware components such as mailbox, bridge, network interface, interrupt controller, interrupt signals, buses, and abstract execution model for distinct types of processor.

Example 27. Hardware code for the token ring application at the transaction-accurate architecture level  Figure 5.10 details the top module for the token ring application running on the 1AX architecture.

Fig. 5.10
figure 5_10_192485_1_Enfigure 5_10_192485_1_En

SystemC code for the top module

The top module is an SC_MODULE which includes the declaration and the instantiation of the ARM-SS (varm7-ss in Fig. 5.10), XTENSA-SS (vxtensa_ss), AMBA bus (vAMBA), global memory subsystem MEM-SS (vgmem_ss), and hardware fifo (vhwfifo). It also interconnects these different subsystems by linking the bridges of each subsystem to the AMBA bus. A 4 MB address space is allocated to each processor subsystem. Thus, the ARM-SS has the address space 0×800000–0×BFFFFF and the XTENSA-SS has the address space 0×400000–0×7FFFFF. The global memory is identified between addresses 0×40000000–0×40FFFFFF.

Figure 5.11 shows the SystemC module of the ARM7 subsystem of the 1AX architecture.

Fig. 5.11
figure 5_11_192485_1_Enfigure 5_11_192485_1_En

SystemC code for the ARM7-SS module

The ARM7 subsystem includes a local bus (sys_bus), an abstract execution model of the processor core (ArmUnixCore), a local memory (mem), a bridge (bridge) for the connection to the AMBA bus, a programmable interrupt controller (PIC) (pic), the mailbox synchronization component (sync), and some interrupt signals (sign_sync, s1, and s2) . The local peripherals have associated address space. Thus, the local memory is addressable between addresses 0×0–0×2FFFFF, the PIC between addresses 0×300000–0×30001F, and the mailbox between addresses 0×300800–0×300BFF. Each processor subsystem has the local address space between 0×0 and 0×400000. The accesses to addresses higher than 0×400000 will be forwarded by the local bus to the bridge for external access through the AMBA bus.

As illustrated in Fig. 5.12, the transaction-accurate architecture of the 1AX architecture contains a global clock used by all the processors. This clock has a period of time 1 unit, where a time unit represents1 ns.

Fig. 5.12
figure 5_12_192485_1_Enfigure 5_12_192485_1_En

SystemC clock

5.3.3 Hardware–Software Interface at Transaction-Accurate Architecture Level

The hardware–software interface at the transaction-accurate architecture level is represented by the abstract model of each processor core and the implementation of the HAL APIs. This is responsible to guarantee the software access to the hardware and implements the interaction between hardware and software.

The abstract model of the processor defines an execution environment of the software stack [138]. This is implemented as a SystemC module which interacts with the software. The abstract processor is modeled as a bus functional model, which allows operations onto the local bus, such as read and write operations [142].

The implementation of the HAL APIs allows a simulation model of the OS and inter-processor communication on the host machine [11]. For example, the implementation of the HAL API ctx_switch (old_tid, cur_tid) to perform a context switch between two tasks relies on the APIs provided by the operating system running on the host machine (Windows, Linux, UNIX, etc.). Figure 5.13 exemplifies the implementation of the context switch on the host machine running Linux OS that uses sigsetjmp and siglongjmp APIs to save and switch the context of a task.

Fig. 5.13
figure 5_13_192485_1_Enfigure 5_13_192485_1_En

Implementation of the __ctx_switch HAL API

5.4 Execution Model of the Transaction-Accurate Architecture

The full hardware–software executable model is based on co-simulation between SystemC for the hardware components including the abstract processors and the native execution of the software stacks [110]. The main advantage of the native simulation is the simulation speed.

A native software simulation usually runs as a separate process on the host machine. In order to control and access the OS/HAL simulation model, the IPC (inter-process communication) is used. The SystemC simulation environment is used as co-simulation backplane for the software and hardware simulators. Thus, each software stack is a SystemC thread which creates a Linux process for the software execution. At the beginning of the simulation, the SystemC platform launches a GNU standard debugger (gdb) Linux process for each software stack in order to start its execution. The software stack interacts with the corresponding SystemC abstract processor module through the Linux IPC layer. The hardware–software interface uses Linux shared memory (IPC Linux shm) for the interaction, data, and synchronization exchange between the software and the hardware.

The abstract processor module represents a bus functional model , shortly BFM. The BFM has two sides: one is facing the native HAL simulation model (i.e., a Linux process) and the other is a pin-level interface of the processor. Figure 5.14 shows an example of BFM for a native OS/HAL simulation model. In this example, the native HAL simulation model uses the shared memory, signals, and semaphores as IPC mechanisms for the external access.

Fig. 5.14
figure 5_14_192485_1_Enfigure 5_14_192485_1_En

Hardware–software co-simulation

The BFM is used to transform a functional memory access (e.g., read a data item from a specific physical address) to a sequence of memory accesses. Thus, the BFM transfers external access from the HAL simulation model to the hardware simulation engine (e.g., SystemC hardware simulation). This transfer is performed by polling the IPC interface of the HAL simulation model for read/write access. If there is a requested read/write data operation, the BFM transforms the access request into signal transitions on the processor’s pin interface. Besides the data transfer, the BFM also transmits a processor interrupt to the HAL simulation model. When an interrupt arrives at the processor’s interrupt pins, the BFM sends a signal (e.g., Linux signal) to the HAL simulation model, more precisely to the Linux process.

The simulation at the transaction-accurate architecture level allows validating the integration of the tasks code with the OS and the communication protocol and debug of the HdS access to the hardware resources (e.g., access to the AMBA bus, interrupt lines assignment, OS scheduling). On the software side, it makes possible the debug of the access of the OS functions to the hardware resources through the HAL APIs, e.g., read(…)/ write(…) from/to the memory, explicit synchronization using mailboxes, or the interrupt service routines. On the hardware side, it gives more precise statistics on the communication and computation performances, such as the number of exchanged data bytes during the application execution, network congestion, or estimation of the processors’ cycles spent on communication.

Example 28. Execution model for the token ring application at the transaction-accurate architecture level  Figure 5.15 shows the execution model of the software stacks running on the ARM7 and XTENSA processors in the case of the 1AX architecture. This represents a co-simulation between the gdb Linux processes of each software stack gdb1 and gdb2 (one gdb for each software stack) and one SystemC Linux process for the whole hardware platform simulation. The interface between the three Linux processes is performed using the Linux IPC shared memory.

Fig. 5.15
figure 5_15_192485_1_Enfigure 5_15_192485_1_En

Execution model of the software stacks running on the ARM7 and XTENSA processors

5.5 Design Space Exploration of Transaction-Accurate Architecture

5.5.1 Goal of Performance Evaluation

The goal of the performance evaluation at the transaction-accurate architecture level is to allow profiling the communication requirements and improve the overall performances of the system. The objective is to provide through simulation statistical information, such as utilization of the global interconnect component or the degree of contention in the network component, and validate the communication protocol and the execution of the tasks under the control of a dedicated operating system.

Based on the communication traffic resulted after the transaction-accurate architecture simulation, the designer can fix hardware and software architecture decisions. Examples of hardware architecture decisions are the entire end-to-end communication path used for the data exchange between the processors, the size of the NoC in number of routers, the positioning of the IP cores over the NoC, the final topology of the interconnect component, the routing algorithm used in a NoC, the buffer size inside the NoC routers or the communication protocol between the different subsystems fixing the mapping of the communication buffers onto the storage resources, and the synchronization mechanism. Examples of software architecture decisions are operating system used for the scheduling of the tasks running on the same processing units, implementation of the communication primitives, and synchronization mechanism managed by software.

These different decisions influence the overall execution time of the system, cost, and power consumption. Therefore, good decisions are required to be able to control the MPSoC design process.

5.5.2 Architecture/Application Parameters

The transaction-accurate architecture validates some hardware and software architecture characteristics specified at the system architecture level, such as the following:

  • Integration of the tasks code with the OS and communication libraries

  • Implementation of the communication protocol: buffers mapping, synchronization mechanism, and end-to-end data path between the processors

  • Adaptation of the software to specific hardware communication implementation

  • Type of the scheduling algorithm for the tasks

  • Type of global interconnection algorithm with its configuration parameters such as topology, buffer size, routing algorithm, arbitration algorithm

The transaction-accurate architecture still keeps the implementation of the communication protocol independent of the type of processor cores. Therefore, the CPUCoreType represents an architecture parameter that will be considered only at the next abstraction level, the virtual prototype level. This will determine the adaptation of the software to a particular CPU through the explicit implementation of the low-level processor-specific HAL software layer.

5.5.3 Performance Measurements

At the transaction-accurate architecture level, the performance measurement consists of profiling the interconnect component and the communication and computation requirements for each processor.

Using annotation of the transaction-accurate architecture model with adequate execution delays, the simulation at this level can estimate the total clock cycles spent on communication or computation by each processor. The achieved precision can be cycle accurate only for the inter-subsystem communication, since all the hardware components of the communication path are explicit. The accuracy of the software execution is transaction level.

On the hardware side, the transaction-accurate architecture may give more precise statistics on the communication architecture such as the number of conflicts on the shared global bus due to the simultaneous access requests in the case of a bus-based architecture topology. For a NoC-based architecture topology, useful information deduced during the simulation are related to the amount of NoC congestion, number of routing requests, number of transmitted packets, the average amount of transmitted bytes per packet, or the number of times some routers failed to transmit the packet due to conflicts. For both topologies (bus and NoC), the transaction-accurate architecture simulation allows extracting the total amount of transmitted bytes through the global interconnect component and the amount of data transferred between the different processors.

Example 29. Performance measurements for the token ring application at the transaction-accurate architecture level  For example, the total simulation time of the token ring application was 12 s to run the whole application and the bus was required 108 times to transfer data. But in this example, the model is not annotated with accurate information required for an accurate estimation due to operating system and communication overhead.

5.5.4 Design Space Exploration

At the transaction-accurate architecture level, the design space exploration consists of communication mapping exploration. The designer can experiment different communication mapping schemes, different communication protocols, and diverse global interconnect components in distinct configurations. For example, the designer may adopt a bus such as STBus or AMBA bus or a NoC such as Hermes or STNoC. Moreover, the NoC may support different topologies (mesh, torus, hypercube, ring, tree), the routers may be positioned in different dimensions (2D, 3D), the number of routers is configurable, and the IP cores may be located through different access points to the NoC. Thus, the NoC offers flexibility and scalability in terms of number of routers, number of network interfaces, and interconnected IP cores.

Example 30. Design space exploration for the token ring application at the transaction-accurate architecture level  At this level, the designer can still map the communication buffers onto different storage resources provided by the architecture, such as the local memories of both ARM and XTENSA processors, or the shared global memory, or on the hardware FIFO in case of the 1AX architecture running the token ring application. These different communication mapping schemes involve different communication paths and synchronization mechanisms between the processors.

5.6 Application Examples at the Transaction-Accurate Architecture Level

The following paragraph presents the transaction-accurate architecture model for the two case studies: the Motion JPEG decoder application running on the Diopsis RDT architecture with AMBA bus and the H.264 encoder application running on the Diopsis R2DT architecture with Hermes NoC in torus and mesh topologies.

5.6.1 Motion JPEG Application on Diopsis RDT

The transaction-accurate architecture design consists of two steps: software and hardware design. The software design consists of linking the tasks code with an operating system and communication library. For the Motion JPEG application, in order to produce an executable software code, the tasks code is compiled with the DwarfOS operating system and the communication library that implements the send_data(…)/recv_data(…) communication primitives. The tasks are scheduled by the OS. The communication between the tasks of the same processor is implemented by the OS and communication library.

The hardware architecture of the Diopsis RDT tile contains the components that can be accessed by the HAL APIs (Fig. 5.16).

Fig. 5.16
figure 5_16_192485_1_Enfigure 5_16_192485_1_En

Transaction-accurate architecture model of the Diopsis RDT architecture running motion JPEG decoder application

The ARM subsystem includes the abstract processor core, local data memory (SRAM), local bus, and bridge for the connection with the AMBA bus. The DSP subsystem includes the DSP core, data memory (DMEM), registers (REG), DMA, interrupt controller (PIC), mailbox, local bus, and the bridge for external connection. The POT includes the system peripherals of the RISC processor, e.g., timer, interrupts controller (AIC), synchronization component (mailbox), but also the I/O components like the serial peripheral interface (SPI).

The AMBA bus implementation is based on the implementation at the virtual architecture level. The main components of the AMBA bus are illustrated in Fig. 5.17.

Fig. 5.17
figure 5_17_192485_1_Enfigure 5_17_192485_1_En

AMBA bus at transaction-accurate architecture level

The synchronization between the different subsystems connected to the global bus is handled explicitly through the operating system and dedicated hardware components. The AMBA supports burst mode transfer at this level and fully models the arbitration strategy.

The assignment of addresses and mapping of the communication buffers into the memories with the corresponding interrupt mechanism used for synchronization is performed during the hardware platform design. The address space of the components is different from the virtual architecture platform, because the generated platform at the transaction-accurate level is more detailed and fully implements the communication protocol.

The full hardware–software executable model is based on co-simulation between SystemC for the hardware components including the abstract processors and native execution of the software stacks. Each software stack is a UNIX process created and launched at the beginning of the simulation by the SystemC platform, in order to start their execution. The software stack interacts with the corresponding SystemC abstract processor module through the Unix IPC layer. Besides the software debug, the execution model at this level also provides more precise idea on performances that allows some architecture experimentation, as detailed in the next section. The simulation of the 10 QVGA frames at the transaction-accurate level takes 5 min 10 s.

Figure 5.18 shows a screenshot taken during the simulation, which captures the execution of the two software stacks running on the ARM, respectively, DSP, and the SystemC simulation of the platform with the POT displaying the decoded image.

Fig. 5.18
figure 5_18_192485_1_Enfigure 5_18_192485_1_En

MJPEG simulation screenshot

Using transaction-accurate simulation, in this book, three experiments are conducted with different communication schemes between the DSP and RISC. The results are summarized in Table 5.1. In the first scheme, the data exchange is made only via DXM. This generated 5,256,000 transactions to the DXM. The second communication scheme makes use of DXM and REG communication units between the processors and DMEM between the DSP and the POT. This generated 4,608,000 transactions to the DXM, 72,000 to the register, and 576,000 to the DMEM. The third case uses the SRAM as communication unit between the processors and DMEM between the DSP and POT and needs 4,680,000 transactions to the SRAM and 576,000 to the DMEM. One transaction to the memory means one read/write operation of one word (4 bytes) to the memory.

Table 5.1 Memory accesses

Starting from quantitative estimators provided by ATMEL Inc., the number of clock cycles, needed by the ARM and DSP to access data buffers of length N words located in different memories, can be estimated. The DMA engine of the DSP needs 14+(N–1) cycles for DXM read, 10+(N–1) for DXM write, 5+(N–1) for SRAM read, and 8+(N–1) for SRAM write. A data movement between REG and SRAM driven by the DSP core costs N/4 cycles plus a movement to/from the SRAM driven by the DMA engine. The ARM processor is not natively equipped with a DMA engine. The cost of an ARM isolated access is 11×N for DXM read and 8×N for DXM write. Forcing the compiler to use the assembler instruction which moves blocks of 8 registers, the cost of burst can be reduced to 11× (N/8)+N for DXM read and 2×N for DXM write. On the Diopsis tile, the ARM processor runs at a clock frequency which is double of the AMBA bus used as a unit of measure. This factor 2 can be taken into account in the estimate of time of the ARM access to the SRAM. The DSP data memory can be accessed by the ARM in 6×(N/8)+N cycles for write and 8×N cycles for read.

The performance estimation results are summarized in Table 5.1. The overall number of cycles required for the communication using AMBA burst mode is approximately 8,856 k when all the data transfer is made via DXM; 7,884 k in the second case using REG, DXM, and DMEM storage resources; and 3,960 k in the third case using the SRAM and DMEM local memories. Thus, if the software code makes use of the existing hardware resources, an improvement in communication performance can be obtained. This improvement corresponds to 11% in the second communication mapping case and 55% in the third case. The communication protocol is specified in the initial Simulink model by annotating the communication units.

5.6.2 H.264 Application on Diopsis R2DT

The transaction-accurate architecture of the Diopsis R2DT tile with Hermes NoC is illustrated in Fig. 5.19.

Fig. 5.19
figure 5_19_192485_1_Enfigure 5_19_192485_1_En

Global view of the transaction-accurate architecture for Diopsis R2DT with Hermes NoC running H.264 encoder application

The tasks code is combined with the DwarfOS operating system and the implementation of the send_data(…)/recv_data(…) communication primitives to build each software stack running on the processors. The processors execute a single task on top of the operating system. The OS is required for the interrupt service routines and the application boot.

The hardware platform is comprised of the detailed three processor subsystems (ARM9-SS, DSP1-SS, and DSP2-SS), one global memory subsystem (MEM-SS), and the peripherals on tile subsystem (POT-SS). The different subsystems are interconnected through an explicit Hermes NoC available in torus and mesh topologies.

Figure 5.19 presents the transaction-accurate architecture of the Diopsis R2DT tile with NoC running the H.264 encoder application. The local architectures of each subsystem are detailed, including network interfaces, local bus, data memories and registers, abstract processor models, synchronization components, interrupt controller, or DMA engines.

The Hermes NoC at the transaction-accurate architecture adds more architectural details such as topology, routing algorithm, and router buffer size. The Hermes NoC model is comprised of the same basic elements as at the virtual architecture level: network interface, mapping table, and routers but with a more detailed implementation. Topology (e.g., mesh, torus), routing algorithm (e.g., pure XY, west first), arbiter algorithm (e.g., round robin, priority based), and buffer size (e.g., number of flits) can be varied. The packet structure in this model is comprised of destination address, size, and body fields, similar to the one assumed in the synthesizable NoC description. The Hermes NoC allows at the transaction-accurate architecture level extracting information from the system communication architecture like (i) number of routing requests; (ii) number of packets inserted into the NoC; (iii) amount of bytes exchanged; (iv) the average of bytes per packet; (v) the number of packets transmitted; and (vi) number of routing request failed due to NoC congestion.

At the transaction-accurate architecture level, the DMA components belonging to the DSP subsystems become explicit and have direct link to the interconnect component. Thus, the Hermes NoC for the Diopsis R2DT architecture requires seven access points: five for the different subsystems, as previously presented in the virtual architecture model, and two additional for the DMA components.

The different subsystems can be mapped over the NoC in different ways. The following paragraphs describe with details an example of IP cores mapping scheme. Thus, in a first scheme, the network interfaces connect the following IP cores to the NoC:

  • The ARM9-SS is connected to the network interface with address 1×0.

  • The network interface with address 2×1 connects the DSP1-SS.

  • The network interface with address 1×1 connects the DMA of the DSP1-SS.

  • The network interface with address 1×2 connects the DMA of the DSP2-SS.

  • The network interface with address 2×2 connects the DSP2-SS to the NoC.

  • The network interface corresponding to the MEM-SS has address 0×0.

  • The network interface connecting the POT-SS has address 0×1.

The NoC was adopted in two topologies: mesh and torus. In both cases, the NoC has nine routers (3×3). Each router is connected to the corresponding network interface and the neighbor routers.

Figure 5.20 shows the NoC employing a 2D mesh topology, a pure XY routing algorithm, and a round-robin arbiter algorithm at each router and wormhole as packet-switching strategy.

Fig. 5.20
figure 5_20_192485_1_Enfigure 5_20_192485_1_En

Hermes NoC in mesh topology at transaction-accurate level

Table 5.2 shows the results captured during the transaction-accurate architecture mesh model simulation in case of the H.264 encoder application. The first and the second columns represent the correspondence between the different subsystems and the NoC access points. A routing request is performed at least once per packet per router that it will cross. Depending on the application, the NoC structure, routing algorithm, NoC congestion state, the routing request can occur as many times as needed inside a router. For the H.264 encoder simulation with 10-frame QCIF YUV 420 format, 96,618,508 routing requests were issued. The third column of Table 5.2 presents the percentage of the routing requests at each router, while the other columns detail this information related to the router port (local to the corresponding network interface, north, south, east, or west). These results were captured in the case of mapping all the communication buffers onto the external memory.

Table 5.2 Mesh NoC routing requests

Figure 5.21 shows the amount of data that traverses each router in the mesh NoC for the H.264 encoder application by using external memory for the communication between the processors. The local port of each router inserts packets to the NoC, while the remaining ports transfer them inside the NoC. The value assigned to the local port of the router 0×0 (MEM SS) corresponds to response packets due to read requests or confirmation packets due to write requests. Block transfer operations (amount of operation that will be transferred in one packet) permit to optimize the amount of data exchanged inside the NoC by minimizing the amount of control data.

Fig. 5.21
figure 5_21_192485_1_Enfigure 5_21_192485_1_En

Total kilobytes transmitted through the mesh

In the second topology, the adopted NoC was a 2D torus topology using a deadlock free version of the non-minimal west-first routing algorithm proposed by Glass and Ni [60]. Figure 5.22 presents the Hermes 3×3 torus NoC.

Fig. 5.22
figure 5_22_192485_1_Enfigure 5_22_192485_1_En

Hermes NoC in torus topology at transaction-accurate level

The H.264 encoder simulation with 10-frame QCIF YUV 420 format using torus NoC topology involved approximately 78,217,542 routing requests, representing 19% of reduction when compared to the mesh NoC. This was possible because the 2D torus topology has the longest minimum paths that are only half of those in 2D meshes. Also, torus networks have better path diversity than meshes, which, if exploitable by the routing algorithm, leads to diminished network congestion, thus reducing routing requests.

Table 5.3 presents these results. The first columns represent the correspondence between the IP cores and the network interfaces, while the others show the distribution of the routing requests along the local, north, south, east, and west ports of each router. The results were captured in the case of mapping all the communication buffers onto the external memory.

Table 5.3 Torus NoC routing requests

Table 5.4 sums up the amount of data transferred through the torus NoC during the H.264 encoder simulation. The third column of the table represents the amount of data and control information exchanged (e.g., operation request, confirmation response). The other columns of the table show the amount of data transmitted per router port.

Table 5.4 Torus NoC amount of transmitted data (bytes)

Figure 5.23 shows a screenshot captured during the simulation of the H.264 encoder running on the Diopsis R2DT architecture with torus NoC.

Fig. 5.23
figure 5_23_192485_1_Enfigure 5_23_192485_1_En

Simulation screenshot of H.264 encoder application running on Diopsis R2DT with torus NoC

In order to analyze the communication performances, the AMBA bus is also experimented as global interconnect instead of the Hermes NoC. The average throughput of the interconnect component in order to execute the H.264 in real time (25 frames/s) was 235 MB/s for the NoC and 115 MB/s for the AMBA.

The NoC allows various mapping schemes of the IPs over the NoC with different impact on performances. In this book, two different mappings of the IP cores over the mesh and torus NoC are experimented: scheme A, detailed in the previous paragraphs and scheme B with the MEM-SS connected at network interface with address 1×1 (both x- and y-coordinates are 1). Figure 5.24 summarizes the correspondence between the network interface and the IP core in case of these two IP mapping schemes.

Fig. 5.24
figure 5_24_192485_1_Enfigure 5_24_192485_1_En

IP core mapping schemes A and B over the NoC

Table 5.5 presents the results of the transaction-accurate simulation: estimated execution cycles of the H.264 encoder,the simulation time using the different interconnect components on a PC running at 1.73 GHz with 1 GB RAM, and the total routing requests for the NoC. These results were evaluated for the two considered IP mapping schemes shown in Fig. 5.24 (A and B) and for three communication buffer mapping schemes: DXM+DXM+DXM, DMEM1+DMEM2+SRAM, and DMEM1+SRAM+DXM. The AMBA had the best performance, as it implied the fewest clock cycles during the execution for all the communication mapping schemes. The mesh NoC attained the worse performance in case of mapping all the communication buffers onto the DXM and similar performance with the torus in case of using local memories.

Table 5.5 Execution and simulation times of the H.264 encoder for different interconnect, communication, and IP mappings

This is explained by the small numbers of subsystems interconnected through the NoC. In fact, NoCs are very efficient in architectures with more than 10 IP cores interconnected, while they can have a comparable performance results with the AMBA bus in less complex architectures. Between the NoCs, the torus has better path diversity than mesh. Thus, torus reduces network congestion and decreases the routing requests. Also, scheme A of IP cores mapping provided better results than scheme B for the DMEM1+DMEM2+SRAM buffer mapping. For the other buffer mappings the performance of scheme A was superior to scheme B. In fact, the ideal IP cores mapping scheme would have the communicating IPs separated by only one hope (number of intermediate routers) over the network to reduce latency.

Comparing with the virtual architecture, the transaction-accurate interconnects fully implement the bus, respectively, the NoC protocol. Thus it provides accurate characteristics. Therefore, the simulation of the transaction-accurate interconnects requires higher simulation time compared with the virtual architecture. But, during both design steps, the NoC needs more time for the application simulation than buses due to its high complexity.

5.7 State of the Art and Research Perspectives

5.7.1 State of the Art

Current literature offers large set of references dealing with transaction-accurate architecture design and software native execution using an abstract hardware platform.

ChronoSym [11] presents a fast and accurate SoC co-simulation that allows verification of the integration of the tasks code with the operating system. It is based on an OS simulation model and annotation of the software with execution delays. The abstract execution model of the processors in the transaction-accurate architecture presented in this book is similar to the timed bus functional model used in the ChronoSym approach, but it is not annotated for accurate estimation.

Reference [25] presents an abstract simulation model of the processor subsystem. In this work, the processor subsystem is not defined as a set of hardware components, but it is viewed from a software point of view. Thus, the processor subsystem is made of execution, access, and data unit elements to allow early validation of the MPSoC architecture and native time-accurate simulation of the software.

Reference [56], based on the work described in [23], resumes a hardware–software interface modeling approach in SystemC at the transaction-accurate architecture level. This work uses the concept of required and provided services in the modeling of the hardware–software interfaces. The hardware–software interface is assembled using software, hardware, and hybrid elements.

Reference [74] illustrates a configurable event-driven virtual processing unit (VPU) to capture timing behavior of multiprocessor multithreaded platforms through flexible timing annotation. The VPU enables investigation of the mapping of the application tasks with respect to time and space and early design space exploration.

Reference [138] deals with abstract modeling of the embedded processors using TLM. This work develops a high-level abstract processor model that allows fast simulation, acceptable accuracy in simulated timing, and exposing the structure of the software architecture (e.g., drivers and interrupts). This approach is similar to the abstract execution model of the processor belonging to the transaction-accurate architecture.

Reference [19] details the Synopsys System Studio design tool that allows a SoC design flow from system level to implementation by passing through several abstraction levels. One of the intermediate refinement steps corresponds to the development at the platform level, which represents a TLM platform of the hardware that allows starting the development of the software. The software development itself uses a specific development and simulation kernel such as RTLinux, together with an interface layer to the virtual processors on the platform.

References [67, 68] present a simulation model of μTRON-based RTOS kernels in SystemC. They developed a library of APIs that supports preemption, task priority assignment, or scheduling RTOS services by native execution and a SystemC wrapper to encapsulate the OS simulation model into the bus functional model (BFM) of the hardware platform. Their approach is similar to the presented approach, but they do not give details on the hardware side.

Reference [143] presents a communication design flow based on automatic TLM model generation. They allow generation and refinement of bus-based communication architectures, including bus bridges and transducers. But they do not address software code adaptation to specific communication protocol implementation, in order to optimize the overall communication performance.

Reference [77] proposes a hardware procedure call (HPC) protocol to abstract the platform-dependent details of the TLM communication between the different subsystems, by providing an additional layer for the software modeling on top of transaction-level models.

5.7.2 Research Perspectives

The most important research perspective regarding the transaction-accurate architecture design consists of annotating the software code with execution delays for accurate software performance estimation and annotating the hardware code for accurate communication architecture performance estimation. This could be managed by applying a similar approach with the timed bus functional model used in ChronoSym [11].

Other research perspective represents the automatic generation of the transaction-accurate architecture. The generation could be made possible by applying a service-based modeling of the hardware–software interface as described in [56]. The composition of the services easies the automatic generation tools to reduce design time. The generation can be performed from the system architecture or virtual architecture. Generation from the system architecture enables generation of different detail levels from the same specification (virtual architecture, transaction-accurate architecture, and virtual prototype). The generation from the virtual architecture enables gradual refinement of the hardware/software architecture based on the performance estimation performed at this level.

On another proposed research perspective refers to the design at the transaction-accurate architecture level of more complex multi-tile architectures such as Tile64 [157] or AM2000 [4] running massive parallel applications.

5.8 Conclusions

This chapter defined the transaction-accurate architecture design. It presented the software organization as final application tasks code running upon a real-time OS and the hardware organization in detailed subsystems interconnected through an explicit network component.

The transaction-accurate architecture design was performed using SystemC for three case studies: token ring mapped on the 1AX architecture, Motion JPEG running on the Diopsis RDT architecture, and H.264 encoder running on the Diopsis R2DT architecture.

The simulation of the transaction-accurate architecture model allowed to verify the integration of the final application tasks code with an OS and communication software adapted to the synchronization protocol. It also gave more precise information on the interconnect model. This includes the number of conflicts in the global bus, the amount of NoC congestion, the number of transmitted bytes through the bus or NoC, the number of routing requests, the number of times some routers failed to transmit the packet due to conflicts inside the NoC, or the average bytes per packet.

The transaction-accurate architecture design also allows exploration of different IP cores mapping over the NoC in order to analyze their impact on the overall performances.