martes, 18 de octubre de 2011

DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB




Carlos Villavieja, UPC






@PACT 2011







Translation Lookaside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chipmultiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs,a process known as a TLB shootdown. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shootdowns on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shootdown cost and frequency increase with the number of processors and project that softwarebased TLB shootdowns would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shootdowns by an order of magnitude.

Optimizing Data Layouts for Parallel Computation on Multicores

Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut Kandemir

The Pennsylvania State University

@PACT 2011

The emergence of multicore platforms offers several opportunities for boosting application performance. These opportunities, which include parallelism and data locality benefits,require strong support from compilers as well as operating systems. Current compiler research targeting multicores mostly focuses on code restructuring and mapping. In this work,we explore automatic data layout transformation targeting multithreaded applications running on multicores. Our transformation considers both data access patterns exhibited by different threads of a multithreaded application and the on chip cache topology of the target multicore architecture. It automatically determines a customized memory layout foreach target array to minimize potential cache conflicts acrossthreads. Our experiments show that, our optimization brings significant benefits over state-of-the-art data locality optimization strategies when tested using 30 benchmark programs on an Intel multicore machine. The results also indicate that this strategy is able to scale to larger core counts and it performs better with increased data set sizes.

Phase-Based Application-Driven Hierarchical Power Managementon the Single-chip Cloud Computer

Nikolas Ioannou @ PACT 2011

University of Edinburgh


To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-thefly.Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs,have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today’smulti-cores.This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains.We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller,providing the necessary DVFS scheduling points.Experimental results with SCC hardware show that our approach provides significant improvement of the EnergyDelayProduct (EDP) of as much as 27.2%, and 11.4% on average,with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effectiveDVFS control of the domains, compared to existing approaches.

No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs

PACT Conference 11 Octubre de 2011, 17:00 Spanish Time, Galveston Island TX, USA, Hotel Galvez.

Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N. Bhuyan

Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share(TS) is the default scheduling policy in a modern OS such as OpenSolaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness inscheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads.To alleviate this problem, in this paper, we present ascheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell PowerEdgeR905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some microbenchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.

A Unified Scheduler for Recursive and Task Dataflow Parallelism

PACT Conference 11 Octubre de 2011, 17:00 Spanish Time, Galveston ISland TX,USA, Hotel Galvez, Music Hall.

Hans Vandierendonck


Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-over head ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyperobject. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies.Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.

martes, 11 de octubre de 2011

Extreme Computing



PACT Conference
11 Octubre de 2011, 15:50 Spanish Time

Galveston ISland TX,USA, Hotel Galvez, Music Hall.

Stuart Feldmanm
Vice-President, Engineering, Google



Computing at the limits of technology calls for numerous engineering decisions and tradeoffs. General purpose solutions do not work at the extremes. Traditional HPC has been analyzed for decades, resulting in specialized architectures. Systems for life critical systems, those for large enterprises, those for tiny devices, also present their own special requirements. The area of data intensive computing is newer, and the computing models are less established. To support large (millions) of users doing similar but different computations, expecting to have access to enormous amounts of information (petabytes, not gigabytes) and to get prompt responses and global access calls for different compromises. Different applications present their own requirements and difficulties. This talk will address some of those needs - different models of storage and data management that are appropriate for different types of application, networking demands for parallelism and global access, management of large numbers of fallible processors and storage. Support for such computing also calls for different approaches to software methodology, system management, and deployment. But massive data also opens new ways to approach science and to get remarkable results the delight and surprise users.



Biography:




Stuart Feldman is responsible for the health and productivity of Google's engineering offices in the eastern part of the Americas, Asia, and Australia. He also has executive responsibility for a number of Google products. Before joining Google, he worked at IBM for eleven years. Most recently, he was Vice President for Computer Science in IBM Research, where he drove the long-term and exploratory worldwide science strategy in computer science and related fields, led programs for open collaborative research with universities, and influenced national and global computer science policy. Prior to that, Feldman served as Vice President for Internet Technology and was responsible for IBM strategies, standards, and policies relating to the future of the Internet, and managed a department that created experimental Internet-based applications. Earlier, he was the founding Director of IBM's Institute for Advanced Commerce, which was dedicated to creating intellectual leadership in e-commerce. Before joining IBM in mid-1995, he was a computer science researcher at Bell Labs and a research manager at Bellcore (now Telcordia). In addition he was the creator of Make as well as the architect for a large new line of software products at Bellcore. Feldman did his academic work in astrophysics and mathematics and earned his AB at Princeton and his PhD at MIT. He was awarded an honorary Doctor of Mathematics by the Univeristy of Waterloo in 2010. He is former President of ACM (Association for Computing Machinery) and member of the board of directors of the AACSB (Association to Advanced Collegiate Schools of Business). He received the 2003 ACM Software System Award. He is a Fellow of the IEEE, of the ACM, and of the AAAS and serves on a number of government advisory committees.