This is a self-archived – parallel published version of this article in the publication archive of the University of Vaasa. It might differ from the original. TTADF : Power Efficient Dataflow-Based Multicore Co-Design Flow Author(s): Hautala, Ilkka; Boutellier, Jani; Silvén Olli Title: TTADF : Power Efficient Dataflow-Based Multicore Co-Design Flow Year: 2019 Version: Accepted manuscript Copyright Institute of Electrical and Electronics Engineers (IEEE) © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Please cite the original version: Hautala, I., Boutellier, J., & Silvén, O., (2019). TTADF : Power Efficient Dataflow-Based Multicore Co-Design Flow. IEEE Transactions on Computers online 27 April, 1–14. https://doi.org/10.1109/TC.2019.2937867 IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 1 TTADF: Power Efficient Dataflow-Based Multicore Co-Design Flow Ilkka Hautala, Jani Boutellier, Senior Member, IEEE, and Olli Silvén, Member, IEEE Abstract—The era of mobile communications and the Internet of Things (IoT) has introduced numerous challenges for mobile processing platforms that are responsible for increasingly complex signal processing tasks from different application domains. In recent years, the power efficiency of computing has been improved by adding more parallelism and workload-specific computing resources to such platforms. However, programming of parallel systems can be time-consuming and challenging if only low-level programming methods are used. This work presents a dataflow-based co-design framework TTADF that reduces the design effort of both software and hardware design for mobile processing platforms. The paper presents three application examples from the fields of video coding, machine vision, and wireless communications. The application examples are mapped and profiled both on a pipelined and a shared-memory multicore platform that is generated by TTADF. The results of the TTADF co-design-based solutions are compared against previous manually created designs and a recent dataflow-based design flow, showing that TTADF provides very high energy efficiency together with a high level of automation in software and hardware design. Index Terms—Dataflow, application specific processor, design flow, low power. F 1 INTRODUCTION RAPIDLY evolving technology has been shrinking the time-to-market window of mobile software and com- puting platforms. To this extent, the design time should be as short as possible and the lifetime of a design should be maximized [1]. To meet these requirements, automated design flows are needed, and the resulting system should be programmable by high-level programming languages, enabling fixes and inclusion of additional functionality after deployment. Consequently, software has a significant role in contemporary mobile computing platforms. Advances in semiconductor manufacturing processes have continuously increased the efficiency of programmable processing platforms. Therefore, software programmable off-the-shelf digital signal processors (DSPs) and general purpose processors (GPP) have increasingly replaced appli- cation specific integrated circuits (ASIC) in mobile comput- ing platforms. However, application requirements, includ- ing power consumption, throughput, and latency, have led to the trend where multiprocessor System-on-Chips (MP- SoC) have become mainstream [2]. A heterogeneous MP- SoC can integrate for example a multicore GPP, a graphics processing unit (GPU), DSPs, a field programmable gate array (FPGA) and multiple ASICs or application-specific instruction processors (ASIP) for wireless communications, computer vision, and security. ASIPs have been used for applications where low power consumption and high performance are essential, but also programmability is needed [3]. The energy efficiency of • I. Hautala and Olli Silvén are with the Center for Machine Vision and Signal Analysis Research Group, University of Oulu, Oulu, Finland E-mail: {ilkka.hautala, olli.silven}@oulu.fi • Jani Boutellier is with the School of Technology and Innovations, Univer- sity of Vaasa, Finland, and with the Faculty of Information Technology and Communications, Tampere University, Finland. E-mail: jani.boutellier@univaasa.fi Manuscript received –, –, 2018; revised –, –, 2019. ASIPs has usually been achieved by tailoring the instruction set for a single application domain, which enables sim- plifying the architecture compared to GPPs. Compared to ASICs, ASIPs have a time-to-market advantage thanks to programmability and highly automated ASIP design tools [4]. ASIP design tools commonly include a retargetable com- piler, which automatically adapts to the ASIP architecture, and enables machine code generation from a high-level language without compiler redesign. Programming of MPSoCs is challenging due to the lack of high-productivity programming environments that can compile application descriptions into useful implementa- tions for parallel and heterogeneous target platforms [5]. Also, existing legacy code has hindered the adoption of more intuitive and suitable methods for software develop- ment for these platforms. Utilizing the potential of MPSoCs is a complex task requiring correct specification, implemen- tation, decomposition, and mapping of applications [5]. Dataflow [6], a well-known and widely used program- ming paradigm for expressing the functionality of signal processing and streaming applications, has been proposed as a next-generation programming solution for heteroge- neous MPSoCs [7] [8] [9] [10] [11]. Dataflow programming is an intuitive approach for describing the parallelism of application software, while at the same time increasing its modularity, flexibility and re-usability. In this paper, the dataflow-based design framework for trans- port triggered architectures (TTADF1) is presented. The main contributions of this work are as follows: • TTADF integrates ASIP development and C- language based dataflow programming for the de- sign of power-efficient and high-performance multi- core ASIPs and their software. 1. The source code of TTADF is available for download at http://github.com/ithauta/ttadf IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 2 Instruction memory Instruction fetch Instruction decode FU FU LSU Bypass network RF Data memory TTA Instruction memory Instruction fetch Instruction decode FU FU LSU Data memory VLIW RF Bypass network Fig. 1. Comparison between TTA and VLIW architectures • TTADF has tools for both rapid and accurate simula- tion of multicore ASIPs and their software. • The outcome of the TTADF design flow is a synthe- sizable register transfer level (RTL) description of the designed platform and executable program binaries. Hardware synthesis for an FPGA or using a standard cell library can be done using existing Electronic Design Automation (EDA) tools. • Comprehensive experiments confirm high energy ef- ficiency. Example applications from three different domains, including new dataflow representations of High Efficiency Video Coding (HEVC) in-loop filters and stereo depth estimation. This paper is organized as follows: Section 2 presents the theory of the dataflow paradigm and the Transport Trig- gered Architecture, whereas Section 3 reviews related work. In Section 4 implementation details of the proposed TTADF framework are described. In Section 5 experimental results of the designs produced using the proposed framework are shown, and later compared to related work in Section 6. Finally, a brief discussion and the conclusion of the achieved results can be found in Sections 7 and 8. 2 BACKGROUND First we present a brief introduction to transport triggered architectures and after that basic concepts of dataflow pro- gramming. 2.1 Transport Triggered Architecture Transport Triggered Architecture (TTA) processors resemble VLIW processors concerning the fetching, decoding and execution of multiple instructions each clock cycle, and by providing instruction level parallelism [12]. Fig. 1 shows the fundamental differences between datapaths of TTA and VLIW architectures. In the case of TTA, each FU and RF is connected to the bypass network, whereas in VLIW each FU is directly connected to the RF. The fully exposed datapath of TTAs allows direct control of data transfer for the programmer (or compiler), in contrast to VLIWs that are programmed by operations. In TTA, operations take place as side effects of data transfers, controlled by move instructions that are the only FU ADD FU MUL RF Instruction fetch & decode 0 1 2 FU LSU Instruction memory Data memory Fig. 2. A simple TTA processor instruction of TTA processors. Using move, data is trans- ferred between function units (FU) and register files (RF) via the bypass network that can consist of multiple transport buses. FUs are logic blocks that implement different operations, such as additions or multiplications. Depending on the set of operations included in FUs, the FU has one or more input ports and registered output ports. In every single FU, one input port is a triggering port and a data move to it triggers an operation of the FU in question. If an operation has multiple operands, it is assumed that all the other operands are transferred to FU input ports before or at the same time as the operand, which is going to be written to the trigger port. After triggering an operation, the operation result can be moved from an FU output port to the input ports of one or more FUs/RFs, which makes TTAs exposed datapath processors, enabling many compiler optimizations. Fig. 2 presents a simple TTA processor, which has three transport buses (black horizontal lines), a load store unit (LSU), an adder, a multiplier, and one RF. Small squares are FU ports, and a cross inside the square indicates the triggering port. The instruction fetch and decode unit is responsible for loading one instruction per bus from the instruction memory and executing them. Three transport buses enable executing three instructions in parallel each clock cycle. For example, in one clock cycle, we can simultaneously move a result from the LSU to the input port of the adder unit using bus 0, move the mul result to the adder’s trig- gering port using bus 2 and move the result of the previous add operation to the RF using bus 3. The previous example can be presented by one TTA assembly instruction word that executes in a single clock cycle: LSU.out1 -> ADD.in2, MUL.out1 -> ADD.in1t.add, ADD.out1 -> RF.1; FUs and RFs are connected to transport buses via sock- ets (rectangular vertical blocks), and arrows above sockets indicate the input/output direction of a socket. Black dots on sockets indicate where FUs/RFs are connected to the transport buses. Similar to DSPs, also TTAs use the Harvard architecture that has separate program and data memories. More than that, TTAs can have multiple data memories, each of them appearing as a different address space to the programmer. Implementing full context switch support to TTAs is regarded as unfeasible due to the high number of reg- isters within (pipelined) TTA FUs. Hence, interrupts and pre-emptive multitasking are commonly unsupported, and TTAs are used as slave co-processors of a master GPP IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 3 that runs control-intensive software, such as the operating system [13]. TTA processors can be designed using the open source TTA-based Co-design Environment (TCE) [14]. By using the TCE compiler tcecc, programs written in high-level pro- gramming languages (C/C++, OpenCL) can be compiled to TTA machine code. Tcecc is a retargetable compiler, adapt- ing itself to the designed architecture and automatically tak- ing advantage of added processing resources, which allows fast experimentation with different processor designs and design space exploration. TCE also offers a cycle-accurate simulator for program execution analysis on a given TTA processor. Moreover, a synthesizable RTL description of the given processor design can be created using the TCE processor generator tool. TCE currently supports the design and simulation of single-core and GPU-style data parallel multicores. TTADF builds on TCE and adds a design and simulation framework for task-parallel streaming applications. 2.2 Dataflow paradigm The dataflow programming paradigm is used in this work for top-level application descriptions. The dataflow pro- gramming paradigm started forming in the end of the 1960s inspired by highly parallel computer architectures [15]. Dataflow applications are represented as directed graphs, which consist of actors as vertices and unidirectional first- in-first-out (FIFO) channels as edges. The data, transferred between actors via channels, is quantized to tokens of ar- bitrary size. Dataflow description of an application en- sures application modularity, re-usability and exposes its concurrency, which simplifies distributing the application execution across multiple processing elements. Several different formal dataflow Models of Compu- tation (MoC) have been proposed, including synchronous dataflow [16] (SDF) and dataflow process networks [17] (DPN). Dataflow MoCs can be classified to static or dynamic depending on whether data can affect the behavior of ac- tors. In a static MoC, the token production and the token consumption rate of actors are statically defined whereas in dynamic MoCs production and consumption rates can be data-dependent. Static MoCs enable more compile-time optimizations than dynamic MoCs and can therefore lead to more efficient machine code. However, the limited ex- pressiveness of static MoCs reduces the set of applications that can be described compared to dynamic MoCs. Thus, different dataflow MoCs offer a tradeoff between efficiency and expressiveness. The dataflow framework used in this work is designed to support the DPN MoC [17] that is very dynamic. However, it also allows using more restricted MoCs such as SDF. 3 RELATED WORK In [18], Park et al. review different MPSoC design methods and categorize them to four approaches: compiler-based, language-extension (OpenMP, OpenCL), model-based and platform-based approaches. In this paper we focus on the model-based approach, where the designer utilizes a MoC for implementing applications. There are several works, which are based on the dynamic dataflow programming paradigm, and some of them also target TTA architectures. The most relevant works considering this paper are briefly presented below. Ptolemy [19], a simulation and prototyping framework for heterogeneous systems, was the pioneering software development framework for dataflow-based design. In ad- dition to dataflow, Ptolemy also supports a variety of non- dataflow MoCs. Orcc [20] is a recent open source dataflow development environment that is based on the DPN MoC. Orcc appli- cations are written using the RVC-CAL language, to be translated to software and hardware code. By using Orcc, various video coding applications such as High Efficiency Video Coding have been implemented for various target platforms [21] [22] [23]. Distributed Application Layer (DAL) [24] is a scenario- based design flow, which can map Kahn Process Network (KPN) applications onto heterogeneous many-core systems. Differing from many other frameworks, DAL offers support for OpenCL capable platforms, which enables the use of GPUs [25]. PRUNE [11] is an open source framework for design and execution of dynamic and decidable dataflow applications on heterogeneous platforms. The PRUNE framework de- fines its own dynamic MoC, which is developed specifically for the requirements of heterogeneous platforms. PRUNE al- lows the execution of dynamic actors on OpenCL platforms, which is not possible in DAL. Dardaillon [10] et al. proposed an LLVM based compila- tion flow that can compile parameterized dataflow graphs to a heterogeneous MPSoC platform. They use actor based C++ for expressing dataflow graphs. Their framework is dedicated to software-defined radio applications that re- quire parallel processing and fast dynamic reconfiguration. In [9] Bezati et al. introduce a tool that compiles RVC- CAL dataflow programs into RTL descriptions, targeting MPSoCs. They translate the RVC-CAL description to C and feed it to the Xilinx Vivado High-Level Synthesis compiler with automatically generated constraints and directives to get RTL descriptions. PREESM [26] is a dataflow-based framework for mul- ticore DSP programming. PREESM exploits the Parameter- ized and Interfaced Synchronous Dataflow (PiSDF) MoC. The behavior of PREESM actors is described using C lan- guage, while XML is used for describing the target architec- ture, the algorithm graph, and the scenario. In [26], PREESM is used to execute a stereo image matching application [27]. In [28] and [8], the authors present automatic synthesis of TTA processor networks from dynamic dataflow programs using the Orcc framework. They propose a design flow where the RVC-CAL dataflow language is used to describe an actor network. Orcc [20] compiles the actor network into LLVM (Low Level Virtual Machine) [29] assembly code in the case of [8], or into C code in the case of [28]. In both works, TCE is then used to generate RTL descriptions of processor cores and the machine code for each core. In both works, a dedicated TTA processor is generated for each actor, and inter-processor communication is implemented via hardware FIFOs between TTA cores. In [8], the authors define three TTA processor configurations: standard, custom and huge with different numbers of function units and IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 4 General Purpose Processor Interface Shared Memory TTA Cores TTA Cores Memory TTA Coprocessing system Memory Host General Purpose Processor GPP Cores Fig. 3. An example of a TTA co-processing system transport buses. Finally, the work of [28] is generalized in [30] by allowing an arbitrary number of actors to be mapped to a TTA core. Similarly, Yviquel et al. [23] refine the basic ideas pre- sented in [8] by introducing a hybrid memory architecture designed for dataflow programs. Instead of using hardware FIFOs for inter-processor communication the authors ex- ploit shared memories. As in [28], RVC-CAL is used as the input language, which is first transformed into the LLVM in- termediate representation and then into binary code, which is suitable for the target TTA processor. In [23], Yviquel et al. demonstrate their work by implementing HEVC and MPEG-4 video decoders on top of custom, fast and huge TTA processors. Currently, their work is a part of Orcc [20], and it is named the Orcc TTA Backend. 4 TTA DATAFLOW FRAMEWORK This section describes the proposed TTADF framework. First, a brief overview of the framework is given, followed by a high-level description of its usage. Finally, the central framework components are explained in detail. In the proposed framework, TTAs are used as co- processors, which are communicating with the host proces- sor using shared memory, as shown in Fig. 3. TTA cores and the memories connected to them form the TTA co- processing system. The host architecture encompasses all other processing cores, which have a memory address space in common with the TTA co-processing system. By using TTADF, the designer can specify both the host architecture and the TTA co-processing system and syn- thesize a unified dataflow program of the whole host-co- processing ensemble. A dataflow program, which is de- scribed using the TTADF API can be executed on numer- ous different target platforms without modifications to the software code, but by merely switching the architecture description. TTADF also allows automatic RTL generation of the TTA co-processing system, which can be imported to EDA tools for hardware synthesis. 4.1 Design flow The proposed design flow is presented in Fig. 4, and starts by the application design step (Actor Descriptions and Actor Network items). As explained in Section 2.2, the dataflow application consists of actors and FIFO communication chan- nels between them. In the proposed framework the actor net- work is the top-level description of the dataflow application, and it defines the actors and the FIFO connections, creating No Compiler inputs Yes Compiler Outputs Yes TTA DATAFLOW FRAMEWORK Actor Mapping Actor Descriptions Actor Network Processor De�nitions System Architecture VHDL Model of System Performance requirements ok ? SystemC Toplevel Testbench C++/SystemC Simulation Model No FPGA / ASIC Synthesis & Place and Route Tools CHIP Timing, area, power etc ok? TTA Based Co-Design Environment TTADF Compiler Fig. 4. The proposed TTADF design flow the application datapath. The detailed behavior of each actor is defined in the actor description. In the proposed framework, the system architecture file defines all computational resources available for the execu- tion of software. Similarly as the actor network defines con- nections between different actors, the system architecture file defines processing resources and their interconnections. The system architecture file can refer to the processor definition that is the detailed specification of a TTA processor architecture. These five description files are inputs for the TTADF Compiler, which analyses the inputs and generates a consis- tent description of the software and the hardware. By using that description, the compiler creates necessary inputs for the TCE tools to produce TTA machine code and processor RTL descriptions. The TTADF compiler produces C++ and SystemC simulation code, which can be compiled using GCC. The SystemC simulation model is cycle-accurate, whereas the C++ simulation model uses a simplified mem- ory model for faster simulation. TTADF generates all needed VHDL models for the TTA co-processing system synthesis and a SystemC testbench to ease simulation and verification. 4.2 Actor network The dataflow application is specified using the actor network file, which lists all actors and FIFOs used in the dataflow software. Fig. 5 presents the actor network of a simple dataflow program that generates numbers (Source), multi- plies them by a constant factor (CMultiply) and then prints the values (Sink). The corresponding actor network presen- tation of the constant multiply application is presented in Listing 1 as pseudocode. Each actor is defined by its name and has a behavior description written in C. Actor descriptions, including the number of input and output ports can be parameterized. For example, in Listing 1 the constant FACTOR is defined for the actor CMultiply. The FIFO descriptions of the actor network define the token size (in bytes) and the token capacity of each FIFO. IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 5 CMultiply (factor=2) Sink i32_o i32_i �fo_0 size=4*10 �fo_1 size=4*10 m_i32_om_i32_i Source Fig. 5. The constant multiply dataflow application example NETWORK mult iply DEFINE p a r a l l e l i s m =1 ACTOR source 0 MODELNAME source MODELSOURCE a c t o r s /source . c PARAMETER PARALLELISM= p a r a l l e l i s m GENERATE i TO p a r a l l e l i s m OUTPUTPORT p o r t $ i END GENERATE STOPNETWORK 1 END ACTOR GENERATE i TO p a r a l l e l i s m ACTOR cmul t ip ly $ i MODELNAME cmult iply MODELSOURCE a c t o r s /cmult iply . c PARAMETER FACTOR=2 INPUTPORT in por t OUTPUTPORT out port END ACTOR END GENERATE ACTOR sink 0 . . . END ACTOR GENERATE i TO p a r a l l e l i s m FIFO f i f o 1 $ i TOKENSIZE 4 CAPACITY 10 SOURCE source 0 p o r t $ i TARGET cmul t ip ly $ i in por t END FIFO FIFO f i f o 2 . . . END FIFO END GENERATE END NETWORK Listing 1. An example of a parameterized actor network description of the constant multiply dataflow program For each FIFO, a source and a target port of an actor is determined by using an actor id and a port name. Parameterization can be used for fast adaptation of the actor network to the system architecture. Considering Fig. 5 as an example, the degree of parallelism can be adjusted by adding more input and output ports to the Source and Sink actors, and by replicating the CMultiply actor. To implement that, the TTADF compiler supports a compiler pragma for static code generation, presented in Listing 1. 4.3 Actor description TTADF provides a high-level template that has to be used for describing actors, as well as an API (TTADF API) for implementing framework specific behavior such as FIFO I/O. The detailed behavior of actors is described using the C language, which also allows the use of legacy code and C compiler tools. In the case of TTA processors, using C provides the designer the possibility for efficient use of special function units (SFUs) of TTA processors, by using TCE C macro calls for custom operations. As can be seen from Listing 2 of the Source actor, the actor description file contains four different structural elements that are defined by the actor description name and source code: • ACTORSTATE is a structural element that contains in- formation about the actor state. The designer defines all actor state variables, which need to be maintained between actor firings into this structure. • INIT is a function, which is executed once before the actor is fired for the first time by the TTADF runtime. The function is useful for opening input and output file streams. • FIRE is an element that defines a function, which is executed whenever the actor is fired. A pointer to the ACTORSTATE structure is passed to the FIRE func- tion so that the actor can preserve its state between firings. • FINISH is an element that defines a function, which is called once when the execution of the actor net- work is terminated. A similar INIT / FIRE / FINISH approach has also been used in other dataflow flavored frameworks (e.g. [24]). Inside these three functions it is possible to use TTADF API function calls or macros. The TTADF API offers basic FIFO I/O operations including reading, writing, peeking and querying the number of tokens in a FIFO. Pragma commands for static code generation are supported in the actor description source files, for enabling parameterized actor specification. 4.4 System architecture model This work proposes a system architecture model (SAM) to specify the ensemble of the host system and the TTA cores. The SAM can be divided into the host architecture and the TTA Co-processing architecture parts, as shown in Fig. 3 (below, host in italics refers the host architecture). In TTADF, the host architecture is required to have a) memory-mapped access to the co-processing architecture, and b) a global shared memory architecture, where all cores can access the same address space. In practice, the host can be connected to the shared memory through a memory interface such as AMBA (Ad- vanced Microcontroller Bus Architecture), PCI-E (Peripheral Component Interconnect Express), Ethernet, etc. The TTA Co-processing system comprises all TTA cores, memory components and other components that are directly connected to them. The memory architecture of the TTA Co- processing system resembles the hybrid memory architecture presented in [23]. In the hybrid memory architecture, each TTA core has its own private data memory and instruction memory. Inter-core communication is performed through shared memories that form a communication network be- tween cores. This kind of a memory architecture is natural for dataflow programs since the local data of actors can be IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 6 // Library inc ludes here # include ACTORSTATE source{ //Actor s t a t e v a r i a b l e d e c l a r a t i o n s i n t number ; #PRAGMA GENERATE i to PARALLELISM) TTADF PORT VAR( ” p o r t $ i ” , data $i , ” i n t ” ) ; #PRAGMA END GENERATE } INIT source ( source STATE ∗ s t a t e ) { //All i n i t i a l i z a t i o n setup here s t a t e−>number = 0 ; } FIRE source ( source STATE ∗ s t a t e ) { #PRAGMA GENERATE( i =0 , i ++ , i< PARALLELISM) s t a t e−>number ++; TTADF PORT WRITE START( ” p o r t $ i ” , s t a t e−>data $ i ) ; ∗ s t a t e−>data $ i$ = s t a t e−>number ; TTADF PORT WRITE END( ” p o r t $ i ” ) ; #PRAGMA END GENERATE // After 12 generated numbers // stop execut ion of a c t o r network i f ( s t a t e−>number > 12){ TTADF STOP ( ) ; } } FINISH source ( source STATE ∗ s t a t e ) { //Cleanup things here return 0 ; } Listing 2. The behaviour model of the Source actor stored in the private memory and incoming and outgoing tokens can reside in shared memory. The memory organiza- tion also divides memory components into subcomponents, which reduces memory pressure, provides simultaneous R/W access and reduces power consumption when com- pared to a global shared memory architecture used in many GPPs [23]. An example of a generated system can be seen as a block diagram in Fig. 6. The designer can set a core of the host architecture to be of type X86-64, ARM or ARM64, whereas for the TTA co-processing system all cores will be of the type TTA. Additionally, the clock frequency and all memory connections are defined for each core. A core can be connected to a memory component directly or via a memory arbiter. The capacity of each memory component is defined in bytes. Memory arbiters can be used to connect multiple cores to one single port memory. For each TTA core, a TTA Architecture Definition File (ADF) needs to be provided. The ADF describes all re- sources of the TTA core instance, including FUs, RFs, etc. Prode, a TCE tool, can be used to design TTA cores and produce their ADFs for the coprocessing system. For en- abling RTL generation, also an Implementation Definition File (IDF) needs to be provided for each TTA. The IDF file defines which RTL description is used for each processor component. Memory and arbiter instances are connected to dedi- cated LSUs of TTA cores. Therefore, the ADF of each TTA Memory "shared0" arbiter arbiter ARM "core_0" TTA "core_2" Memory "shared1" I D I D Source CONSTANT MULTIPLY_0 CONSTANT MULTIPLY_1 Sink TTA Coprocessing System Host TTA "core_1" Fig. 6. System architecture model and actor mapping instance has to include an LSU for each memory or arbiter instance that is connected to the TTA core. The TTADF framework automatically sets the address spaces for these LSUs. The designer can also define a SAM that does not include any TTA cores at all. An example of this is presented later on in Section 5, where the ARM-based embedded ODROID XU3-platform is used to execute TTADF applications. 4.5 FIFO communication channels In this work, FIFO channels are implemented using a lock- free circular buffer structure that is placed into addressable memory. Each FIFO must be connected to exactly one input port and exactly one output port of an actor. Each FIFO has user-defined token size and capacity, which define the maximum number of tokens the FIFO can hold. Token size can be freely chosen with the accuracy of one byte. For synchronizing FIFO-read and FIFO-write transac- tions, the writing actor has write access to the FIFO-write pointer, and the reading actor has write access to the FIFO- read pointer. Both, the reading actor and the writing actor have read access to the pointers, which enables querying (peeking) of the FIFO state. Each FIFO operation is started by requesting a pointer to the token buffer for reading or writing. If the requested FIFO operation is feasible (FIFO has room / FIFO has tokens), a pointer to the token is returned. The actor can read data from the port, process it and simultaneously write the processed data directly to another port without additional data copying. After all needed data has been read or written, the FIFO access is ended. Choosing a suitable FIFO buffer size is essential for reaching high performance. The buffer capacity should be two or more tokens since single buffering prevents simulta- neous execution of the producer and consumer actor. With the set of applications used to test TTADF, no significant performance increase was observed when increasing FIFO capacity beyond three. 4.5.1 Blocking and non-blocking communication The proposed framework supports blocking FIFO commu- nication channels where actor execution is halted until the required number of tokens for reading, or suitable space for writing to the FIFO is available. The implementation of blocking communication is not straightforward because IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 7 when multiple actors have to be executed on the same core, there has to be a mechanism for transferring execution from a blocked actor to another actor to prevent system deadlock. Since TCE does not offer support for preemptive mul- titasking, the proposed framework addresses the problem by using protothreads (PT) [31], which is a non-preemptive multitasking concept that does not rely on context switches. The idea of using protothreads in graph-based processing has previously been presented in [24]. The proposed framework implements actor firing func- tions so that blocking FIFO operations are labeled, and can be seen as entry points for the function. The actor state holds the current entry point, and when the actor is fired, execu- tion jumps to the current entry point. If a FIFO operation blocks, the FIFO operation in question is stored as the entry point. In some cases blocking communication can be used to reduce overhead caused by checking actor firing rules. If firing rules are independent, it is not necessary to check all rules, but start directly with the blocking one [32]. As TTA processors generally do not have branch predictors, this is a considerable advantage. The framework also supports non-blocking communi- cation where actor firing also continues in the case when a FIFO cannot provide enough tokens or free space. Non- blocking communication is especially needed for supporting the DPN MoC. 4.5.2 Hardware accelerated FIFO operations Low-level FIFO access operations are suitable to be acceler- ated by custom instructions. The TTADF default FIFO access SFU includes two custom operations: • get population for calculating the token population of a FIFO. • update fifo pointer – an operation for finishing FIFO access. TTADF detects if a TTA core is equipped with the FIFO access SFU that includes hardware accelerated FIFO oper- ations, and automatically makes use of the custom opera- tions. The speedup advantage of the FIFO operation SFU is case dependent, however use of the FIFO SFU can signif- icantly improve code density in cases where the compiler would inline a high number of software-based FIFO access functions. 4.6 Mapping and scheduling In the TTADF framework, the actor mapping file defines for each actor, which core takes responsibility for the execution of that actor. The actor mapping influences the actions of the TTADF compiler, and therefore actor mapping is static, unmodifiable at runtime. Based on the actor to core map- ping, the TTADF compiler automatically maps FIFOs from the actor network to the memory components. If connected, actors are assigned to different cores, and a FIFO is mapped to the shared memory component. If two (or more) actors are mapped to the same TTA core, the TTADF compiler maps the FIFO to the private data memory of that core. For each core, a dedicated actor firing scheduler is cre- ated. The actor firing scheduler handles all actors that are assigned to the core in question. Currently, actor firings are scheduled in a round-robin fashion. In round-robin scheduling, each actor attempts to fire, and regardless of success (token availability / free FIFO space), the scheduler proceeds to fire the next actor until an actor triggers a stop condition. Michalska et al. [33] show that despite its sim- plicity, round-robin scheduling is a very efficient scheduling methodology for TTA processors, whereas more complex scheduling methodologies can provide only minimal im- provements or even degrade performance. 4.7 The TTADF Compiler The TTADF Compiler is the main component of the frame- work, and it is written in the Python programming lan- guage. The TTADF Compiler deserializes the actor network, actor mapping and system model files, and constructs a unified object representation which is used to generate C software code for the cores and hardware synthesizable VHDL code of the TTA co-processing system including the SystemC testbench. One of the main tasks of the compiler is to translate the behavior model of the actors into plain C code so that it can be compiled to binary code for the target processor: the compiler processes ACTORSTATE, INIT, FIRE, and FINISH elements and TTADF API calls, translating them to the unified object representation. For each core in the system, the TTADF compiler generates runtime code, which has the responsibility for initialization, scheduling and cleaning up of actors that are mapped to the core in question. 4.7.1 Simulation and testing The TTADF Compiler can generate three different simula- tion models with different accuracy and speed tradeoffs: • C/C++ simulation - TTA cores run on top of a cycle- accurate simulator by sharing a memory with GPP core threads, which are running on the host com- puter. This model assumes that memory instances are ideal in the sense that the core can always access them in constant time, and it is the fastest simulation model of the framework. This model is targeted towards performance evaluation in cases where only a few TTA cores are connected to the same shared memory. • SystemC simulation - System-level simulation of the design. TTA cores are instantiated as a SystemC model and host GPP cores are running as threads on the host machine of the framework. Memory and ar- biter instances have generic cycle-accurate SystemC simulation models. This simulation model is primar- ily intended for cases where the designer needs to accurately know how simultaneous memory accesses influence the TTA co-processor performance. • HDL simulation - a SystemC testbench where the TTA co-processing system is instantiated in RTL level. The simulation needs mixed language support from the HDL tools. The testbench can be used for RTL, gate-level and post-layout verification, and power estimation of the TTA co-processing system. IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 8 DF-VE DF-HE SAO Sink DF-VE DF-HE SAO DF-VE DF-HE SAO DF-VE DF-HE SAOSource tile 0 tile 0 tile 0 tile n tile n tile n Fig. 7. HEVC Inloop filtering dataflow application 4.7.2 Actor mapping in testing and analysis The TTADF feature that allows assigning actors to the host cores, and simulating these actors rapidly on the host system can be exploited in many ways in testing. At the beginning of application development, it is useful to map all actors to the host machine, since it allows application debugging using conventional C/C++ debuggers. When the applica- tion prototype works on the host machine, the designer can proceed by mapping actors to TTA cores. Finally, tailoring the TTA processor for a specific actor is an iterative process, where the designer modifies the processor and profiles actor performance repeatedly. The possibility of mapping actors to the host accelerates testing, as these actors do not need to be run on slow cycle-accurate simulators. Finally, the ease of actor-to-host mapping also speeds up design space exploration: discovering the best execution core type for each actor can be performed merely by changing the mapping, without code modifications. 5 EXPERIMENTS In this section, three test case workloads from different application domains are presented. After that, the TTA co- processing architectures which are used to execute the work- loads are introduced. Finally, the experimental test results are presented. 5.1 HEVC inloop filtering The High Efficiency Video coding standard [34] introduces two inloop filters for reducing coding artifacts caused by image transforms and quantizations. These filters are the deblocking filter (DF) [35] and the sample adaptive offset filter (SAO) [36]. The dataflow description of the inloop filters is based on the authors’ prior work [37], and it consists of five actors in the simplest case, where only one tile is used. As shown in Fig. 7. the DF filter is divided into two actors: vertical edges filtering (DF-VE) and horizontal edges filtering (DF-HE). SAO is performed in the SAO actor after the DF-VE and DF-HE filters. In the case where the video is coded using multiple tiles and filtering over tiles is not allowed, the filtering pipeline can be parallelized up to the number of tiles. In the experiments, one, two and four tiles are used for the video size of 1920 × 1080 pixels. The actor network processes the video in a coding tree block (CTB) basis, and the token sizes of FIFOs are selected based on the size of the CTB. In the case of a 64 × 64 CTB, the token size is about 5 kB depending on the needed coding parameters, which are explained in detail in [37]. The HEVC Inloop Filtering application uses TTA special function units, which is not the case for the other test applications. Imageread Imageread SAD Sobel Sobel Grayscale Grayscale Imagewrite Block 0 Block n Block 0 Block n Block 0 Block n Fig. 8. SAD based depth estimation dataflow application network and a resulting depth map image 5.2 Stereo depth estimation The dataflow implementation of Stereo Depth Estima- tion (SDE) is based on the open-source computer vision (OpenCV) implementation of block matching for camera calibration and 3D reconstruction. The dataflow implemen- tation of SDE includes five different actors for the reading of image (imageread), image grayscaling (grayscale), So- bel filtering (sobel), Sum of Absolute Differences (SAD) [38] calculating, and image writing (imagewrite). As shown in Fig. 8, image reading, grayscaling and Sobel filtering is performed for both the left and right images before the SAD is calculated between the images, to determine the depth map. The SDE dataflow application processes the images line-by-line or multiple lines at a time, depending on available memory. Because of memory constraints, all experiments presented in the paper use line-by-line process- ing. In the experiments, stereo images with a size of 450 × 375 pixels are used. The search window size is set to 9, and the maximum disparity is limited to 64 pixels. The token size of the FIFO is set to be the same as the input image width, and the FIFO capacity is defined as one. 5.3 Dynamic predistortion filter Dynamic Predistortion Filter (DPD) is a wireless commu- nications application tailored to suppress the most harmful spurious emissions at the mobile transmitter power ampli- fier output [39]. The DPD dataflow application mainly con- sists of parallel 10-tap complex valued FIR filters, which are implemented using fixed-point arithmetic. The configure actor controls at runtime, which set of FIR filters is used for processing the input signal by notifying the poly and the add actors. There can be two to ten active filters at each time instant, depending on the adaptive runtime configuration. Since an external input controls the configuration, the net- work behavior is truly dynamic. The dataflow network for the DPD is presented in Fig. 9. The fixed point DPD dataflow IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 9 SRC-I CONFIGURE FIR 0 FIR 1 FIR 2 FIR 3 FIR 4 FIR 5 FIR 6 FIR 7 FIR 8 FIR 9 POLY SRC-Q ADD SINK-I SINK-Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q I+Q Fig. 9. Dataflow network of the dynamic predistortion filter application network has been designed for TTADF from a floating point version of DPD network presented in PRUNE [11]. In the experiments, the token size is set to 256 bytes, which equals to 64 32-bit integer numbers. The FIFO capacity is also set to two to enable double buffering. The DPD is a dynamic dataflow application, containing actors that have dynamic token rates, whereas in other test applications only fixed token rate actors are used. 5.4 TTA co-processing system architectures For the experiments, eight different TTA co-processing sys- tem configurations are defined. The configurations can be divided into two categories based on the TTA processor core architecture which they use. The first category of co- processing systems is referred to as Inloop, and consists of TTA cores tailored for HEVC Inloop filtering. This core type is presented in detail in [37] 2. In the second category, referred to as Shared, the co-processing system is based on one of the predefined TTA core architectures from [23], where it is called custom 2. Both categories include different configurations concerning the core count with 1, 3, 6 and 12 TTA cores. From here on, when some specific configuration is referred to, the notation is the name of the category, fol- lowed by the number of cores, such as Inloop3, for example. Inloop1 and Shared1 are simple system configurations, where only a single TTA core is connected to the host processor using shared SRAM memory. Shared3, Shared6, Shared12, Inloop6 and Inloop12 consist of clusters of three TTA cores. Each cluster is connected to the host via a shared SRAM, which is also used for inter-cluster communication. Each TTA core is connected to the shared memory through a memory arbiter. In Fig. 10, a four-cluster Shared12 TTA coprocessing system is presented. Inloop3, presented in Fig. 11, is a special case, and it is initially tailored for HEVC Inloop filtering [40]. The TTA cores are connected using shared SRAM memories so that a three-stage pipeline is formed. Since the host is connected to the SRAMs at both ends of the pipeline, the architecture 2. For this work the processor endianness has been changed from big-endian to little-endian and the SFU for FIFO operations (Sec- tion 4.5.2) has been added. arbiter Shared TTA Core GPP Host processor dmem imem Shared TTA Core dmem imem Shared TTA Core dmem imem shared memory arbiter Shared TTA Core dmem imem Shared TTA Core dmem imem Shared TTA Core dmem imem s h a re d m e m o ry a rb it e r Shared TTA Core dmem imem Shared TTA Core dmem imem Shared TTA Core dmem imem shared memory s h a re d m e m o ry a rb it e r Shared TTA Core dmem imem Shared TTA Core dmem imem Shared TTA Core dmem imem Fig. 10. The four-cluster triple-TTA core architecture (shared12) arbiter arbiter Inloop TTA Core 0 GPP Host processor shared memory shared memory shared memory shared memory dmem imem Inloop TTA Core 1 dmem imem Inloop TTA Core 2 dmem imem Fig. 11. The pipelined triple-core Inloop TTA architecture (Inloop3) follows a ring topology, where data can be moved into both directions. In [40], the authors show that the Inloop TTA co- processing system can achieve a 1.2 GHz clock frequency (1.0 V operating voltage) when placed and routed using 28 nm standard cell technology. Based on that result, in the experiments, a clock frequency of 1.2 GHz is assumed in all Inloop system configurations. In all system configurations, each core has a 32 kB private data memory and a 64 kB instruction memory. Shared memory blocks are of the size 64 kB except for Inloop3 which has four 16 kB shared memory blocks. All test case dataflow applications are configured to fit these tight memory constraints. The size of memory blocks has been minimized for keeping on-chip SRAM size realistic and for improving power efficiency. The authors of this paper have already shown in [40] that the power consumption of a TTA co-processing system such as Inloop3 is between 66 mW and 207 mW when the clock frequency is ranges between 530 MHz and 1.2 GHz. To get power estimates for the Shared system configura- tions, Shared3 was placed and routed using a 28 nm standard cell library, which yielded a power consumption of 154 mW for the 1.0 V operating voltage and 1.0 GHz clock frequency. In the experiments, a 1.0 GHz clock is assumed in all Shared system configurations. IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 10 TABLE 1 Comparison of Inloop and Shared TTA cores Processor Inloop [37] Shared [23] Bitwidth 32 32 ALUs 0 2 Adders 3 0 Relational ops fu 1 0 Logic ops fu 2 0 Bitwise ops fu 2 0 Multipliers 1 1 LSUs 3 2 Int RFs 5×16 3×12 Bool RFs (1 bit) 1×2 1×2 Special Ops 9 0 Buses 5 6 Connectivity Partial Full Instruction Width 131 258 Gate count (NAND2) 106K [40] 66K SRC-I POLY FIR3 FIR4 FIR5 FIR6 DR IQ 1 FIR0 FIR1 FIR2 CONFIGURE ADD SRC-Q FIR7 FIR8 FIR9 DR IQ 0 SINK-I SINK-QTTA CORE 2 TTA CORE 1 TTA CORE 0 Host GPP Fig. 12. The DPD application mapped to the triple-core TTA using data repeater (DR) actors 5.5 Mappings In the case of the HEVC Inloop application, Source and Sink actors are always mapped to the host processor. For architectures Inloop1 and Shared1, actors DF-VE, DF-HE and SAO are mapped to the same TTA core. In the triple-TTA core cases (Inloop3 and Shared3), actors are mapped in a pipelined manner: DF-VE to core 0, DF-HE to core 1 and SAO to core 2. In the case of two and four triple-TTA clusters, the actor pipeline is replicated so that the input video is divided to 2 or 4 tiles, respectively (see Fig. 7). The SDE application has two Imageread actors and an Imagewrite actor, which are mapped to the host processor in all cases. In the single TTA core cases, the input images are processed by two Grayscale actors, two Sobel actors and one SAD actor, which are all mapped to the same core. In the multicore cases the input images are divided to blocks so that the block number matches the number of TTA cores (see Fig. 8), and each core gets the same (but replicated) TABLE 2 Performance and power measurements Application HEVC INLOOP SDE DPD Platform Perf. (fps) Power (W) Perf. (Mde/s) Power (W) Perf. (MS/s) Power (W) Inloop1 60.102 0.068 14.074 0.068 1.816 0.068 Inloop3 148.355 0.211 42.273 0.211 4.280 0.211 Inloop6 271.291 0.422 86.977 0.422 7.458 0.422 Inloop12 536.237 0.843 143.819 0.843 9.031 0.843 Shared1 11.657 0.050 7.314 0.050 1.409 0.050 Shared3 24.013 0.154 21.344 0.154 3.197 0.154 Shared6 49.156 0.309 42.963 0.309 5.252 0.309 Shared12 93.103 0.617 85.783 0.617 7.362 0.617 Odroid1 19.187 2.618 26.836 2.752 3.650 2.702 Odroid2 32.783 4.447 51.933 4.253 6.886 4.042 Odroid3 33.378 5.590 71.349 5.651 9.777 7.119 Odroid4 49.152 5.101 82.692 5.862 6.986 5.866 Odroid6 33.609 5.843 47.786 6.156 5.276 6.112 Odroid8 25.650 5.693 63.920 5.643 6.039 6.041 group of actors, as in the single core case. When considering Fig. 10 or Fig. 11, it can be seen that the TTA co-processing system configurations are not fully connected, meaning that there is no direct connection between all cores. Therefore, there is a set of actor-to-core mappings, which are not feasible. (E.g. in the case of actors that have connections that are mapped to different clusters.) However, it is possible to reduce the set of unfeasible mappings by using data repeater actors. Data repeater actors can be used when there is an in- direct connection between processors through other proces- sors. Data repeater actors are added to the actor network and mapped to the processors just for enabling commu- nication between specific actors. For example, if actor A communicates with actor B, and actor A is assigned to the TTA core 0, and B to the TTA core 2 in Inloop3, there is no way to transfer data directly from TTA core 0 to TTA core 2. Due to that, the actor network is changed so that a new data repeater actor DR is placed between A and B and it is mapped to TTA core 1. DR actors are needed in the DPD application when triple-core TTA clusters are used. The actor-to-core mappings of Inloop3 and Shared3 cases are shown in Fig. 12. 5.6 Results The performance and power consumption results of differ- ent TTA co-processing architectures are presented in Table 2. The place and route results of Inloop3 and Shared3 are used to estimate power figures for other Inloop and Shared configu- rations. Fig. 13, Fig. 14 and Fig. 15 show the energy efficiency of the architectures for all test applications. As explained in Section 4, TTADF enables executing applications on the host cores only. To demonstrate this possibility, the performance and power consumption results of an Odroid XU3 platform have been included to the results. The Odroid XU3 is powered by the mobile Samsung Exynos 5422 SoC, which includes four ARM Cortex-A15 @ 2.0 GHz and four Cortex-A7 cores @ 1.4 GHz, and utilizes IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 11 0 2 4 6 8 10 1 2 3 4 6 8 12 Number of cores 100 120 140 160 180 200 220 240 m J o u le s p e r F u ll H D f ra m e s Shared Inloop Odroid XU3 Fig. 13. Energy efficiency of the HEVC Inloop Filter application 0 2 4 6 8 10 1 2 3 4 6 8 12 Number of cores 70 80 90 100 110 120 130 m J o u le s p e r m ill io n d is p a ri ty e s ti m a ti o n Shared Inloop Odroid XU3 Fig. 14. Energy efficiency of the Stereo Depth Estimation application 0 20 40 60 80 100 1 2 3 4 6 8 12 Number of cores 500 600 700 800 900 1000 1100 1200 m J o u le s p e r m ill io n s a m p le s Shared Inloop Odroid XU3 Fig. 15. Energy efficiency of the Dynamic Predistortion Filter application the ARM big.LITTLE heterogeneous multi-processing solu- tion. The dataflow applications are compiled for the Odroid using GCC 5.4 with an optimization level of 3. The operating system of the platform was Ubuntu Mate 16.04. The power figures of the Odroid XU3 platform contain only the power consumption of the processor cores, measured by current/- voltage sensors which are integrated to the platform. Table 2 and Fig. 13 show the performance and energy efficiency results with the HEVC Inloop filtering workload. Thanks to the special function units, the Inloop configu- rations show superior performance and energy efficiency when compared to the Odroid-XU3 platform or the Shared configurations. For example, in the three-core case, Inloop can filter 148 frames per second, whereas the Odroid-XU3 can filter only 49 frames per second at best. Since HEVC Inloop filtering can easily be parallelized by using tiles, speedups are about 2×, 4× or 8× when the number of cores is increased from 1 to 3, 6 or 12 in the cases of Inloop and Shared configurations. In stereo depth estimation the performance advantage of the Inloop based TTA co-processing system is narrow when compared to the Shared configurations, but it is still notable. This was expected since the special instructions of the Inloop cores cannot be utilized in SDE. On the other hand, when the same instruction memory size (64 KB) is used in all TTA cores, more aggressive loop unrolling can be used in the case of the Inloop core, due to its instruction width being significantly narrower (131 b) than that of the Shared core (258 b). When the FIFO SFU is exploited, instruction memory requirements decrease up to 55% per core. The dataflow implementation of SDE can easily be parallelized by processing multiple rows at the same time. The parallelization enables almost linear speedups when increasing the core count from single core to 3, 6, or 12 in the case of Inloop (×3.0, ×6.1 and ×10.2) and Shared (×2.9, ×5.8 and ×11.7) configurations. In single core cases, where all actors are mapped to the same core, actor scheduling overhead and conservative loop unrolling (due to program memory limits) decreases throughput with the consequence that speedups can become superlinear. The Odroid platform can utilize four cores efficiently, but going beyond that in the number of processing cores has only a negative impact. In the best case, Odroid can compute 82 Mde/s. Dynamic Predistortion filter performance results are pre- sented in Table 2 and energy efficiency figures in Fig. 15. Inloop based TTA co-processing architectures outpace Shared architectures by a small margin regardless of the number of cores. Inloop configurations can filter from 1.8 up to 9.0 mega samples per second (MS/s), depending on the number of cores. Both Inloop and Shared show considerable speedup when the core count is 3 or 6, but in the case of 12 cores, the DPD application structure limits the achievable speedup. That is caused by the fact that the workloads of the cores are varying due to dynamic application behavior: FIR filters can be switched on and off dynamically. The workload of the DPD application is more suitable to the Odroid XU3 than the other test applications. The best result for the Odroid is 9.8 Ms/s when using three cores. This is more than double when compared to the Inloop3 or Shared3 configurations. However, increasing the number of cores over three does not give any performance advantage. As IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 12 with other workloads, TTA-based platforms show over 10× higher energy efficiency than the Odroid. 6 COMPARISON OF RESULTS Key figures from related existing TTA based works have been collected to Table 3 for comparison. Yviquel et al. [23] use RVC-CAL descriptions of applications, and have implemented a complete HEVC decoder using 12 Fast TTA processors. The fast TTA is similar to the Shared TTA, but it has one additional ALU (arithmetic-logical unit), more reg- isters and 18 transport buses in total. The implementation can decode five 720p video frames per second. As the authors of [23] have observed that inloop filtering takes about 22% of the total decoding time for Full High Definition (FHD) frames, it was possible to scale the results of [23] to match our test case of HEVC inloop filtering: the scaled performance for inloop filtering in [23] is 14.7 FHD frames per second, which is about 7.6 times to 36.5 times slower than with our Shared12 and Inloop12 configurations, respectively. Although the Fast TTA has more computational resources than the Shared TTA, the Orcc TTA Backend im- plementation on 12 Fast TTAs has equal performance as the proposed TTADF implementation of a single Shared TTA. That can be a consequence of one or both of the following issues: 1) the RVC-CAL description of HEVC inloop filtering is not efficient, or 2) the Orcc TTA backend cannot produce high-quality LLVM code for TTA processors. Since the proposed Inloop configuration uses tailored SFU units for HEVC inloop filtering, it is not surprising that this configuration is about five times faster than the Shared configuration. For exposing possible overhead caused by the TTADF framework, the manually tuned implementation of [40] was included in the comparison (Table 3, Inloop TTA × 3 ). In the case of Inloop3, TTADF has only 3% overhead which means five frames per second. Nyländen et al. [41] implemented a highly optimized OpenCL based SAD depth estimation algorithm for a tai- lored data-parallel SIMD TTA accelerator. In their work, 16-bit floating point arithmetic is used instead of 32-bit integer arithmetic to decrease memory requirements and for achieving better energy efficiency by slightly sacrificing image quality. Compared to the SDE application presented in this work, their algorithm does not include Sobel fil- tering or uniqueness thresholding. Their implementation runs on a single tailored SIMD TTA core clocked at 800 MHz, which can compute 117 Mde/s, whereas Inloop1 can compute only 14.9 Mde/s, about eight times less. Inloop12 can compute 30% more Mde/s than a single SIMD TTA core, which shows that the TTADF SDE implementation is scalable. Because the Inloop architecture is not optimized for the SDE workload, moderate performance results are not a surprise. On the other hand, Inloop3 can compete with the general purpose Intel Core i5-440M mobile processor, while Inloop6 achieves the performance of the Qualcomm Adreno 330 mobile GPU. Surprisingly, the Odroid XU3, using the TTADF implementation of SDE, has better performance than the OpenCL SDE implementation running on an Intel Core i5-480M. Finally, the DPD application offers a fair comparison be- tween the proposed TTADF framework and its closest com- petitor, the Orcc TTA Backend [23]. The DPD application was written in the RVC-CAL language, and a multicore TTA implementation was generated for the application using the Orcc TTA Backend. Very similar TTA processor cores were used, as our Shared core is essentially the same as the Custom core available in the Orcc TTA Backend. Performance results show that the Orcc TTA Backend based implementation produces 2.1, 2.9 and 4.0 MS/s per second for 3, 6 and 12 Custom TTA cores, respectively. In comparison, the DPD implementations produced by TTADF using the Shared TTA configuration were on average 2 × faster. Also, performance scaling as a function of core count was slightly better for TTADF, since Shared6 and Shared12 were ×1.9 and ×2.3 faster than Shared3, while the corresponding speedups were 1.4× and 1.6× for the Orcc TTA Backend. 7 DISCUSSION The TTADF and the Orcc TTA Backend frameworks can essentially be used for the same purpose, but their design flows have a substantial difference. In TTADF, the designer separately specifies the system architecture (TTA core defi- nition, core connections, and host connections), after which the dataflow application is mapped to the architecture. In contrast, in the Orcc TTA backend, the system architecture (TTA core interconnections) is derived from the dataflow ap- plication. From this viewpoint, TTADF can be considered to be more generic. On the other hand, the Orcc TTA Backend provides more automation due to automatic interconnect generation and actor mapping features. Besides, Orcc offers high-level dataflow analysis features which are currently not available in TTADF. TTADF enables the possibility to map actors to the host processor, and in simulations, these actors are executed on the host system of the framework. Testing of an individual actor on a particular TTA core is easy and fast since test data for the actor is created by other actors of the application at runtime. A previous paper [40] from the authors of the proposed work presented a 3-core TTA accelerator for HEVC inloop filtering with manually optimized C code, which achieved 152 HD frames/s processing performance at 207 mW. Now, as TTADF has been measured to achieve a throughput of 148 HD frames/s with practically the same processor core, but generated from a generic dataflow-based design framework, it is justified to state that TTADF offers a way to raise the abstraction level of multicore co-design with negligible impact on performance. Additionally, the experimental results suggest that TTADF can outperform the current state of the art, the Orcc TTA backend, by a clear margin regarding performance. Especially the possibility of exploiting special function units gives a substantial competitive advantage for TTADF over the Orcc TTA backend. In the future, various new features can be adapted to TTADF: automatic mapping of actors to cores and automatic creation of data repeater actors. When considering hard- ware, the possibility of power saving could be achieved by observing FIFO fill counts. There are also plans to directly support SoC FPGAs so that a TTA co-processing system is placed on the programmable logic, and the host-mapped actors are executed on hard processor cores. IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 13 TABLE 3 Comparison of different programmable implementations of the test case applications Application Architecture Framework Tech. Clkf Perf. Power Energy Eff. (nm) (MHz) (mW) (mJ/ Perf. unit) HEVC Inloop [23] Fast TTA × 12 Orcc 40 1000 14.67 fps 1 - - Shared12 TTADF 28 1200 93.1 fps 617 6.63 Inloop12 TTADF 28 1200 536 fps 843 2 1.57 [40] Inloop TTA × 3 None 28 1200 153 fps 207 1.35 Inloop3 TTADF 28 1200 148 fps 211 2 1.43 Shared3 TTADF 28 1000 24.0 fps 154 6.42 Stereo Depth Estimation [41] OpenCL SIMD TTA OpenCL 28 800 117 Mde/s 3 33 0.28 [41] Intel Core i5-480M OpenCL 32 2600 30.3 Mde/s 3 35000 1155 [41] Qualcomm Adreno 330 OpenCL 28 578 99.1 Mde/s 3 1800 18.16 Inloop1 TTADF 28 1200 14.9 Mde/s 69 2 4.63 Inloop3 TTADF 28 1200 44.8 Mde/s 211 2 4.71 Inloop12 TTADF 28 1200 152 Mde/s 843 2 5.55 Odroid-XU3 TTADF 28 2000 82.1 Mde/s 5861 71.4 Dynamic Predistortion Filter Shared3 (Orcc TTA Backend) Orcc 28 1000 1.75 Msample/s 154 88.0 Shared6 (Orcc TTA Backend) Orcc 28 1000 2.38 Msample/s 309 135.5 Shared12 (Orcc TTA Backend) Orcc 28 1000 3.31 Msample/s 617 248.79 Shared6 TTADF 28 1000 5.2 Msample/s 309 59.42 Inloop6 TTADF 28 1200 7.46 Msample/s 422 2 56.57 1Estimated assuming that the share of inloop filtering is 22% [23] of total HEVC decoding workload 2Estimate based on work [40], 316-bit floating point, no sobel filtering and uniqueness thresholding 8 CONCLUSION In this paper, TTADF, a dataflow framework dedicated to transport triggered architectures is presented. TTADF en- ables software synthesis of dynamic dataflow applications to a TTA based co-processing system, which can be co- designed with the dataflow software. The dataflow descrip- tions of applications are written using C and XML. The design flow achieves energy efficient implementation by offering many design options: • Special (custom) operations by calls from C code, • Hybrid memory architecture for reduced congestion and improved access times, • Hardware accelerated FIFO access operations that save instruction memory. TTADF enables three different simulation approaches with different accuracy and speed tradeoffs, including C++, SystemC, and mixed HDL simulations. TTADF was eval- uated using three applications from different fields, con- sisting of video coding, machine vision, and wireless com- munications. The experimental results show that the energy efficiency of the TTADF-generated system falls within 3% of a manually designed baseline and that the generated mul- tiprocessing platform overcomes a commercial multicore by 10× in energy efficiency while providing similar or better performance. ACKNOWLEDGMENTS The work was partially funded by the Academy of Finland project 309693 UNICODE and Tauno Tönning Foundation. REFERENCES [1] S. Bhattacharyya, R. Leupers, and P. Marwedel, “Software syn- thesis and code generation for signal processing systems,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 47, no. 9, pp. 849–875, 2000. [2] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor system-on- chip (MPSoC) technology,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 10, pp. 1701– 1713, 2008. [3] P. G. Paulin, C. Liem, M. Cornero, F. Nacabal, and G. Goossens, “Embedded software in real-time signal processing systems: ap- plication and architecture trends,” Proceedings of the IEEE, vol. 85, no. 3, pp. 419–435, 1997. [4] D. Goodwin and D. Petkov, “Automatic generation of application specific processors,” in Proceedings of the 2003 international confer- ence on Compilers, architecture and synthesis for embedded systems. ACM, 2003, pp. 137–147. [5] G. Martin, “Overview of the MPSoC design challenge,” in Design Automation Conference, 2006 43rd ACM/IEEE. IEEE, 2006, pp. 274– 279. [6] W. M. Johnston, J. Hanna, and R. J. Millar, “Advances in dataflow programming languages,” ACM Computing Surveys (CSUR), vol. 36, no. 1, pp. 1–34, 2004. [7] J. Castrillon, R. Leupers, and G. Ascheid, “Maps: Mapping Con- current Dataflow Applications to Heterogeneous MPSoCs,” IEEE Transactions on Industrial Informatics, vol. 9, no. 1, pp. 527–545, 2013. [8] H. Yviquel, J. Boutellier, M. Raulet, and E. Casseau, “Automated design of networks of transport-triggered architecture processors using dynamic dataflow programs,” Signal Processing: Image Com- munication, vol. 28, no. 10, pp. 1295–1302, 2013. [9] E. Bezati, S. Casale-Brunet, M. Mattavelli, and J. W. Janneck, “High-level synthesis of dynamic dataflow programs on het- erogeneous MPSoC platforms,” in Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), 2016 International Conference on. IEEE, 2016, pp. 227–234. [10] M. Dardaillon, K. Marquet, T. Risset, J. Martin, and H.-P. Charles, “A new compilation flow for software-defined radio applications on heterogeneous MPSoCs,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 13, no. 2, p. 19, 2016. [11] J. Boutellier, J. Wu, H. Huttunen, and S. S. Bhattacharyya, “PRUNE: Dynamic and Decidable Dataflow for Signal Processing on Heterogeneous Platforms,” IEEE Transactions on Signal Process- ing, vol. 66, no. 3, pp. 654–665, 2017. [12] H. Corporaal, Microprocessor Architectures: From VLIW to TTA. New York, NY, USA: John Wiley & Sons, Inc., 1997. [13] P. Jääskeläinen, V. Guzma, A. Cilio, T. Pitkänen, and J. Takala, “Codesign toolset for application-specific instruction set proces- sors,” in Multimedia on Mobile Devices 2007, vol. 6507. Interna- tional Society for Optics and Photonics, 2007. [14] O. Esko, P. Jääskeläinen, P. Huerta, C. S. de La Lama, J. Takala, and J. I. Martinez, “Customized Exposed Datapath Soft-Core Design Flow with Compiler Support,” in Proceedings of the 2010 IEEE TRANSACTIONS ON COMPUTERS VOL. XX, NO. XX AUGUST 2019 14 International Conference on Field Programmable Logic and Applications. Washington, DC, USA: IEEE Computer Society, 2010, pp. 217–222. [15] J. B. Dennis, “First version of a data flow procedure language,” in Programming Symposium. Springer, 1974, pp. 362–376. [16] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceedings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987. [17] E. A. Lee and T. M. Parks, “Dataflow process networks,” Proceed- ings of the IEEE, vol. 83, no. 5, pp. 773–801, 1995. [18] H.-W. Park, H. Oh, and S. Ha, “Multiprocessor SoC design meth- ods and tools,” IEEE Signal Processing Magazine, vol. 26, no. 6, 2009. [19] J. Eker, J. W. Janneck, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuen- dorffer, S. Sachs, and Y. Xiong, “Taming heterogeneity-the Ptolemy approach,” Proceedings of the IEEE, vol. 91, no. 1, pp. 127–144, 2003. [20] H. Yviquel, A. Lorence, K. Jerbi, G. Cocherel, A. Sanchez, and M. Raulet, “Orcc: Multimedia development made easy,” in Pro- ceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 863–866. [21] K. Jerbi, H. Yviquel, A. Sanchez, D. Renzi, D. De Saint Jorre, C. Alberti, M. Mattavelli, and M. Raulet, “On the Development and Optimization of HEVC Video Decoders Using High-Level Dataflow Modeling,” Journal of Signal Processing Systems, vol. 87, no. 1, pp. 127–138, 2017. [22] M. Chavarrias, F. Pescador, M. J. Garrido, E. Juarez, and M. Raulet, “A DSP-Based HEVC decoder implementation using an actor lan- guage dataflow model,” IEEE Transactions on Consumer Electronics, vol. 59, no. 4, pp. 839–847, 2013. [23] H. Yviquel, A. Sanchez, P. Jääskeläinen, J. Takala, M. Raulet, and E. Casseau, “Embedded multi-core systems dedicated to dynamic dataflow programs,” Journal of Signal Processing Systems, vol. 80, no. 1, pp. 121–136, 2015. [24] L. Schor, I. Bacivarov, D. Rai, H. Yang, S.-H. Kang, and L. Thiele, “Scenario-based design flow for mapping streaming applications onto on-chip many-core systems,” in Proc. International Confer- ence on Compilers Architecture and Synthesis for Embedded Systems (CASES), Oct 2012, pp. 71–80. [25] L. Schor, A. Tretter, T. Scherer, and L. Thiele, “Exploiting the Parallelism of Heterogeneous Systems using Dataflow Graphs on Top of OpenCL,” in Proc. IEEE Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia). Montreal, Canada: IEEE, Oct 2013, pp. 41–50. [26] M. Pelcat, K. Desnos, J. Heulot, C. Guy, J.-F. Nezan, and S. Aridhi, “Preesm: A dataflow-based rapid prototyping framework for simplifying multicore DSP programming,” in 2014 6th European Embedded Design in Education and Research Conference (EDERC),, Sept 2014, pp. 36–40. [27] J. Zhang, J.-F. Nezan, M. Pelcat, and J.-G. Cousin, “Real-time GPU- based local stereo matching method,” in 2013 Conference on Design and Architectures for Signal and Image Processing (DASIP). IEEE, 2013, pp. 209–214. [28] J. Boutellier, O. Silvén, and M. Raulet, “Automatic synthesis of TTA processor networks from RVC-CAL dataflow programs,” in 2011 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, 2011, pp. 25–30. [29] C. Lattner and V. Adve, “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation,” in Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04), Palo Alto, California, Mar 2004. [30] J. Boutellier and O. Silvén, “Towards generic embedded multi- processing for RVC-CAL dataflow programs,” Journal of Signal Processing Systems, vol. 73, no. 2, pp. 137–142, 2013. [31] A. Dunkels, O. Schmidt, T. Voigt, and M. Ali, “Protothreads: Simplifying Event-Driven Programming of Memory-Constrained Embedded Systems,” in Proceedings of the Fourth ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, Colorado, USA, Nov. 2006. [32] A. Tretter, J. Boutellier, J. Guthrie, L. Schor, and L. Thiele, “Exe- cuting Dataflow Actors As Kahn Processes,” in Proceedings of the 12th International Conference on Embedded Software, ser. EMSOFT ’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 105–114. [33] M. Michalska, N. Zufferey, J. Boutellier, E. Bezati, and M. Mat- tavelli, Efficient scheduling policies for dynamic data flow programs executed on multi-core, ser. 11th International Meeting on Logistics Research, 2016. [34] G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, Dec. 2012. [35] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson, M. Zhou, and G. Van der Auwera, “HEVC De- blocking Filter,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1746–1754, Dec. 2012. [36] C. M. Fu, E. Alshina, A. Alshin, Y. W. Huang, C. Y. Chen, C. Y. Tsai, C. W. Hsu, S. M. Lei, J. H. Park, and W. J. Han, “Sample Adaptive Offset in the HEVC Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1755–1764, Dec. 2012. [37] I. Hautala, J. Boutellier, J. Hannuksela, and O. Silvén, “Pro- grammable low-power multicore coprocessor architecture for HEVC/H. 265 in-loop filtering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 7, pp. 1217–1230, 2015. [38] K. Mühlmann, D. Maier, J. Hesser, and R. Männer, “Calculating dense disparity maps from color stereo images, an efficient imple- mentation,” International Journal of Computer Vision, vol. 47, no. 1-3, pp. 79–88, 2002. [39] M. Abdelaziz, A. Ghazi, L. Anttila, J. Boutellier, T. Lähteensuo, X. Lu, J. R. Cavallaro, S. S. Bhattacharyya, M. Juntti, and M. Valkama, “Mobile transmitter digital predistortion: Feasibility analysis, algorithms and design exploration,” in 2013 Asilomar Conference on Signals, Systems and Computers. IEEE, 2013, pp. 2046– 2053. [40] I. Hautala, J. Boutellier, and O. Silvén, “Programmable 28nm coprocessor for HEVC/H.265 in-loop filters,” in 2016 IEEE Inter- national Symposium on Circuits and Systems (ISCAS), May 2016, pp. 1570–1573. [41] T. Nyländen, H. Kultala, I. Hautala, J. Boutellier, J. Hannuksela, and O. Silvén, “Programmable data parallel accelerator for mobile computer vision,” in 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2015, pp. 624–628. Ilkka Hautala received his M.Sc. (Tech.) degree from the Department of Electrical and Informa- tion Engineering at the University of Oulu (Fin- land) in 2013. He is currently a doctoral student in the Center for Machine Vision Research at the University of Oulu. His research interests include low-power design, multicore processor architectures and video coding techniques. Jani Boutellier received the M.Sc. and Ph.D. degrees from the University of Oulu, Finland, in 2005 and 2009, respectively. Currently he is an Associate Professor at the School of Technology and Innovations, University of Vaasa, Finland. His research interests include dataflow program- ming, design and implementation of deep learn- ing algorithms, and heterogeneous computing. He is a member of the IEEE Signal Processing Society DISPS Technical Committee. Olli Silvén received the M.Sc. and Ph.D. de- grees in electrical engineering from the Univer- sity of Oulu, Finland, in 1982 and 1988, respec- tively. Since 1996, he has been a professor of signal processing engineering at the University of Oulu. His main research interests are in em- bedded signal processing and machine vision system design. He has contributed to the de- velopment of numerous solutions from real-time 3-D imaging in reverse vending machines to IP blocks for mobile video coding.