It is worth mentioning that the proposed energy model is only concerned with the computational workload of the CPU. As the proposed technique does not consider the memory DVFS and the core allocation does not make any significant differences in memory accesses, this modeling is valid enough to tell the relative energy consumptions of the two control knobs.
In order to assess the effectiveness of the frequency scaling, we define E F F s , f and E F F p , f as the predicted performance gains per unit energy increase in sequential and parallel phases, respectively. Now, we have an indicator of the effectiveness of the frequency scaling, g a i n f , as the sum of Equations 7 and 8. We basically follow the same principle in quantifying the effectiveness of the core assignment.
The only difference is that we do not need to use more than a core in sequential phases; i. On the other hand, it does increase the energy consumption in parallel phases. Again, adjusting core allocation does not affect the sequential phase; i. Now that we have both g a i n f and g a i n c , we can tell which one is the more suitable reconfiguration policy. As shown in lines 6—10 of Algorithm 1, the one which has a bigger g a i n value will be chosen as a reconfiguration policy for the next epoch.
The same principle applies to the case of reconfiguring systems to run slower lines 11— In such cases, E F F s can be understood as performance loss per energy saving lines 12—13 , and the option with smaller value is adopted as a next configuration. Namely, when decreasing the clock and releasing a core have the same energy savings, the one with less execution time increments is chosen. On the other hand, when they tie in the execution time increments, the system is adjusted to the one with larger energy savings.
While this study focuses on the frequency scaling and core assignment, the proposed framework is not limited to any specific control knobs. In other words, the proposed technique can be extended to consider other control knobs, once the performance and energy of the target system can be properly modeled with them. Likewise, as another example one may consider heterogeneous multi-cores in the reconfiguration by enhancing Equations 3 and 4.
The clock frequency of the processor scales from MHz to 2. Benchmarks: We took an image processing application— Heart-Wall —from the Rodinia benchmark suite [ 29 ] and a particle filter-based Object tracking application as benchmarks, both of which exhibit per-frame workload variations. Image processing and object tracking applications are among commonly used applications for high-end WSNs or sensory swarms [ 30 , 31 ].
Note that the workload characteristics involved in the benchmarks are different. In particular, Heart-Wall is compute-intensive with relatively consistent processor utilization, while memory access behavior of Object tracking is quite nondeterministic due to the stochastic nature of the particle filter. We implemented the proposed technique described in Algorithm 1 by augmenting the heartbeat APIs [ 10 ] into the beginning of the outer-most loop in the benchmarks to monitor the runtime performance and to perform the required adaptation.
Gradually increasing the core frequency, we measured the system power consumption, then built a simple prediction model of core power consumption using linear regression.
We compare the proposed approach with a state-of-the-art work [ 16 ] where the adaptation is done by exhaustively searching the energy-optimal configurations combining frequency scaling and core allocation. In [ 16 ], the design space is predetermined to reduce the computation overhead taken for search, using the notation of distance. In particular, the distance of two configurations is defined as total disparity in the control knob adjustments, core allocation, and frequency scaling.
Taking this approach, we consider all configurations with less than distance of 8 from the current configuration during the adaptation. This method is referred to as Exhaustive hereafter. We also take a default Linux scheduler with the high-performance governor as Baseline. Figure 3 and Figure 4 show the comparisons of the three approaches in terms of workload adaptation and corresponding energy consumption over time.
The throughput constraints were set to vary as depicted with the dotted lines in Figure 3 a and Figure 4 a. We also provide the configuration of the two control knobs as counterparts at the same epochs for each of the benchmarks in Figure 3 b and Figure 4 b, respectively. Note that we exclude a warm-up stage of the first few epochs until which the performance exhibited with our approach reached the lower bound of the constraints because the configurations were initially set to the lowest possible compute capability.
The performance traces of Baseline are omitted here because it is not designed to adapt to the given constraints; only its energy consumptions are shown for comparison in Figure 3 c and Figure 4 c. When the memory access behavior is stable as shown in Heart-Wall , the proposed approach and Exhaustive perform similarly in throughput. Comparisons of our proposed approach with the Baseline and Exhaustive approaches under smooth workload Heart-Wall : a performance, b hardware configurations, and c energy consumptions.
Comparisons of our proposed approach with the Baseline and emphExhaustive approaches under heavily varying workload Object-tracking : a performance, b hardware configurations, and c energy consumptions. The reason why Exhaustive performs relatively well partly in Figure 3 a is largely due to the fact that the stable and compute-intensive workload of Heart-Wall is favorable for the performance model in [ 16 ], which unlike ours ignores the impact of memory-intensity on performance.
However, in the cases where the memory-intensity of workload severely changes in a nondeterministic way as is in Object tracking , the performance model of Exhaustive becomes inaccurate, and thus the adaptation tends to oscillate as shown in Figure 4 a. On the other hand, our approach adapts to the workload variations smoothly compared to Exhaustive. In turn, such better adaptivity leads to a higher energy efficiency, as is quantified in Figure 4 b. We observe that the different patterns of resource management appear according to workload characteristics, as shown in Figure 3 b and Figure 4 b.
The proposed technique adapts for Heart-Wall by primarily changing core allocation over epochs. Since Heart-Wall has a compute-intensive workload, it is advantageous to use more cores while keeping clock frequency in terms of energy efficiency as demonstrated by our approach. As a result, the gap between the energy consumptions with the two approaches is marginal as shown in Figure 3 c. On the other hand, much complicated behavior of the adaptation appears in the case of Object tracking.
Both control knobs are active in use over epochs, meaning that workload exhibited in the benchmark has much more memory-centric epochs than those of Heart-Wall , requiring careful reconfiguration in the consideration of memory-intensiveness of the workload. Consequently, Figure 3 c shows that—unlike the case of Heart-Wall —our approach outperforms Exhaustive in terms of energy by large margin.
Note that there is still room for further improvement; due to the incremental nature of our approach, there is the potential for resource over-provisioning or constraint violation with steep change in the constraint as shown in epochs 7 and 19 of Figure 4 a. More aggressive adjustment could alleviate such drawbacks, which is left as future work. Figure 5 shows the energy efficiency of the proposed approach and Exhaustive in terms of performance per watt. We sum up the achieved throughput of each epoch and divide it by the accumulated power consumption over epochs. Note that we take the maximum of the constraints as the throughput of an epoch if it actually surpasses the constraints in order to avoid exaggerating the result from the proposed technique.
As discussed, Exhaustive performs sightly better for the stable and compute-intensive workload, Heart-Wall ; in spite of its expensive computational cost, Exhaustive is just 4. In terms of CPU utilization overhead, by limiting the search space of the reconfiguration candidates in Algorithm 1, the CPU utilization overhead is always negligible.
Multicore Technology: Architecture, Reconfiguration, and Modeling
This reveals that the accurate modeling of system performance is a key to effective and efficient runtime resource adaptation. We presented a runtime management technique for DVFS-enabled multi-core sensory swarms, aiming at minimizing energy consumption under performance constraints. In order to adjust compute capability in response to the dynamically varying workload of an application, the proposed technique considers the runtime adjustment of two control knobs: task-to-core allocation and clock frequency scaling. In order to make an accurate and effective decision, we devised a set of simple performance and energy models for each of the adjustment options.
In particular, it was proven to be further effective when the application had a highly varying memory-intensity behavior, which is a realistic and challenging case in real-life sensory swarm systems. Kim devised the reconfiguration algorithm and designed the experiments. Yang performed the experiments and analyzed the results; S.
- The Truth Commissioner;
- The Healthy Bird Cookbook: A Lifesaving Nutritional Guide and Recipe Collection!
- Embedded Computing Design;
- Most Recent Articles?
- Recent Journal of Systems Architecture Articles - Elsevier.
Kim and H. Yang wrote and revised the paper together.
Multicore Technology: Architecture, Reconfiguration, and Modeling - Google книги
National Center for Biotechnology Information , U. Brussels, - March 3, — Software development for embedded multi-core processors is considered to require a large expenditure and to be difficult. Compared to single-core processors, applications are accelerated by about a factor of three. Read more on ISPA Contact us for an appointment and demo He explains how the framework distinguishes itself from existing compiler framework by providing an environment for developing "design flows", rather than a compiler pass where interaction with the hardware designer is expected.
Parallel programming of multi-core and many-core SoCs remains a difficult, timeconsuming and complicated task. State of the art fourth generation of high programming languages like Scilab or MATLAB can help to make programming multi-cores easier for end users who are not parallel programming experts, provided that an additional tool chain translates these high programming languages to efficient parallel code. The ALMA toolset is being developed to receive Scilab code as input and produce parallel code for generic embedded multiprocessor systems on chip, using platform quasi-agnostic optimisations.
In his talk, Oliver Oey discusses hurdles in coarse grain parallelism extraction and optimization, as well as solutions for automatic parallel code generation for the ALMA toolset. During the conference , Timo Stripf Karlsruhe Institute of Technology is available to discuss high-level programming of multi-core SoCs. On Jan 22 , Recore Systems will be available at the Industry Exhibit to discuss the same topic from an industry point of view. He will also present the ALMA project during the presentations session at PARIS - January 24, — Why must a programmer care for the hardware architecture when writing embedded applications for multi-core?
The ALMA Consortium proposes a tool chain that hides the complexity of hardware architectures from the programmer, and creates optimized code at the same time. As a matter of fact, however, programming of the respective applications is rather time-consuming and expensive. For fast and easy programming, partners from research and industry within the EU consortium ALMA are developing a novel tool chain based on the open-source software Scilab.
More in English or in German The consortium aligned research tasks to be able to reach the goal of programmer-friendly, architecture-independent tooling for complex parallel multi-core systems in the next 36 months. Simplifying programming for multi-cores Why must a programmer care for the hardware architecture when programming embedded applications on a multiprocessor Systems-on-Chip? Download the press release In this talk, we explain how polyhedral methods to build high-level, more scalable accuracy models extend the applicability of analytical approach to verify the accurate conversion of floating-point to fixed-point.
Lis, K. Cho, United States Patent No. Hijaz, O. Fletcher, L. Ren, X. Yu, M. Shi, F. Fletcher, R. Harding, O.
Multicore Technology: Architecture, Reconfiguration, and Modeling
Cho, S. Qadri and S. Kurian, O. Kinsy, I. Celanovic, O. Khan and S. Lis, M. Cho, K. Shim, C. Fletcher, O. Khan, N. Zheng, S. Kinsy, J. Poon, I. Report , May, Lis, Y. Sinangil, S.
Cho, O. Khan and M. Lis contributed equally to this work. Kinsy, O. Khan, I. Celanovic, D. Majstorovic, N. Celanovic, S.
4 Series Titles
Khan, H. Hoffmann, M. Lis, F. Agarwal, S. Rodrigues, A.