This is an archived page of the 2008 conference


Final update: April 29, 2008.

keynotes | tutorials | papers | speakers | vendors


Title Meeting the Challenge of Petascale Computing
author(s) Thom H. Dunning and Rob Pennington, National Center for Supercomputing Applications, USA
presenter Thom H. Dunning
abstract An increasing number of fields in science and engineering are using high performance computing to advance scientific discovery and the state-of-the-art in engineering. Within the next few years, petascale computers will be installed at several sites across the U.S. Although the opportunities associated with petascale computing are enormous, the challenges are daunting. Petascale computers will be very complex systems, built from multicore chips with 10,000s of chips and 100,000s of cores, 100s of terabytes of memory and 10,000s of disk drives. Petascale systems may also include accelerators, such as graphics processing units. The rise of petascale computing has significant implications for the design of the next generation of science and engineering applications as well as the methodology to develop them. We will discuss the steps we are taking to meet the challenge of petascale computing.
title From the Petascale to the Exascale
Author(s) Pete Beckman, ANL, USA
Presenter Pete Beckman
abstract Recently, Argonne National Laboratory installed a half petaflop Blue Gene/P system.  It is the largest Blue Gene platform in the world, and currently the largest platform available for open scientific research. In my presentation, I will cover its architecture, file system, software system, computational science, and our roadmap for exascale computing.
title (Some) Answers to the Challenges of Petascale Climate Simulation
author(s) Rich Loft, NCAR, USA
presenter Rich Loft
abstract Over the past 30 years, the use of supercomputers to obtain numerical solutions of the equations governing weather and climate, along with satellite-based global observations, have led to the realization that the Earth is characterized by a number of complex, highly interactive phenomena that must be understood as a complete system. The problem is complicated by the fact that the sub-components of the Earth system interact on a wide range of spatial and temporal scales that require century long simulations and high resolutions to capture.

The anticipated availability of massively parallel petascale computers in the next few years offers the climate community a golden opportunity to dramatically advance our understanding of the Earth’s climate system and climate change, if they can be harnessed to the task, as the fit is not perfect. First, massively parallel systems will impose stringent and unavoidable Amdahl-law requirements on application scalability. Second, the trade-off between resolution and integration rate, both critical factors in climate modeling, are severe. Third, the increasing complexity of petascale systems, e.g. in terms of the numbers of cores on a chip, increases the tension between the system architecture and programmability. Finally, the size and complexity of climate applications make them difficult to port, adapt, and validate on new architectures.

This talk will address on-going efforts within the DOE SciDAC and NSF PetaApps programs to both seize this important scientific opportunity and address the increased complexity of petascale systems. Efforts to develop lightweight, incremental, and beneficial scaling improvements on existing climate ocean, land and sea-ice components will be demonstrated. Similar improvements for the atmosphere will be shown for the High-Order Method Modeling Environment (HOMME), a new dynamical core cur­ rently being evaluated within the Community Atmosphere Model (CAM). This progress has improved scalability and performance of these components to the point that 50 km atmospheric component coupled to eddy-resolving ocean and sea-ice simulations are now being attempted at Lawrence Livermore National Laboratory.

Further gains may involve more complex and far-reaching modifications. For example, algorithmic acceleration, through adaptive spatial and temporal schemes for solving non-hydrostatic systems of equations, combined with dynamic load-balancing techniques, may be needed to ultimately obtain practical climate simulation capabilities at cloud resolving scales.


RAS, Fault Tolerance, and Programmability
title A Scalable Unified Fault Tolerance for HPC Environments
author(s) Kulathep Charoenpornwattana, Box Leangsuksun, Anand Tikotekar, Louisiana Tech U.; Geoffroy Vallée and Stephen Scott, Oak Ridge National Lab, USA
presenter Geoffroy Vallée
abstract Reliability is one of the major issues in High Performance Computing (HPC) systems today. It is expected to become even a greater challenge in the next generation petascale systems. The traditional fault tolerance mechanisms (e.g., checkpoint/restart mechanisms) may not be efficient in every scenario in such large scale systems due to the scalability and performance issues. In this paper, we aim to provide a distributed scalable Unified Fault Tolerance (UFT), which consists of Proactive Fault Avoidance (PFA) and traditional Reactive Fault Tolerance (RFT) for HPC systems based on fault prediction and virtualization technologies. The results from simulation suggest the performance improvement over the existing solutions.
title TALC: A Simple C Language Extension For Improved Performance and Code Maintainability
author(s) Jeff Keasler, Terry Jones, and Dan Quinlan, Lawrence Livermore National Laboratory, USA
presenter Jeff Keasler
abstract In this paper, we present TALC -- a small language extension for C and C++ suitable for applications that traverse common data structures such as large meshes or cubes. We make three contributions in this paper. First, we motivate the need for a new C/C++ extension focused on addressing emerging problem areas in performance and code maintainability. Second, we define the language extension and illustrate how it is employed in C. Third, we demonstrate the utility of such an extension by providing comparison code snippets that demonstrate advantages in both software maintainability and performance. Performance benefits of the extension are provided for several experiments resulting in up to 200% speedups over more conventional methods to achieve the same algorithm.
Student Papers I
title Undergraduate Experience in Clustering at the SC07 Cluster Challenge
author(s) Alexander Younts, Andrew Howard, Preston Smith, and Jeffrey Evens, Purdue University, USA
presenter Alexander Younts
abstract The Cluster Challenge held at SC07 in Reno, Nevada provided a unique opportunity for undergraduate students to build, benchmark, and run real science applications on a state-of-the-art cluster of computers, while working with real-world constraints and interacting with experts in the high-performance computing field. After building a cluster, students ran the HPC Challenge benchmark and two days' worth of jobs for the applications POP, GAMESS, and POVRay. In this paper, students who participated in the Cluster Challenge describe the process for preparing for the competition, outline the work performed to understand complex real-world codes, and report on lessons learned while preparing for and competing in the Cluster Challenge.
title Accelerating MPIBLAST on System X by using RAMdisks
author(s) Vivek Venugopal, Kevin Shinpaugh, Geoff Zelenka, and Luke Scharf, Virginia Tech, USA
presenter Vivek Venugopal
abstract Genome sequencing is a major contributor to the high performance computing workload worldwide. Scientists use Basic Local Alignment Search Tool (BLAST) from the National Center for Biotechnology Information (NCBI) and perform their searches against databases from GenBank, the worldwide repository for genome sequences. With GenBank reporting exponential growth rate, more emphasis is given to the development of faster search algorithms and better access techniques for the huge databases. Even though parallel implementations, such as mpiBLAST, have emerged to address this problem, the performance of mpiBLAST is dependent on a lot of factors including the storage, memory and algorithm to architecture mapping. This paper addresses a Random Access Memory(RAM)disk implementation on Virginia Tech's System X Linux nodes that includes modifications to the shared storage of the mpiBLAST program with fragments stored in a RAM-based filesystem before the scheduled mpiBLAST execution. The user specified genome database is pre-formatted, fragmented and stored in the local storage e.g. /tmp partition of the compute nodes. The fragmented pieces are transferred to the RAMdisk partition from the local storage partitions of the compute nodes. The initial pre-formatting and distribution of databases results in an average time savings of 90% and a speedup of 8x over the normal Network File System (NFS)-based shared storage mpiBLAST implementation with sustained results over 16 and 32 cluster nodes for the various query files.
title The Cluster Challenge - 6 Students, 26 Amps, 44 Hours
author(s) Robert Beck, Gordon Klok,and Paul Greidanus, University of Alberta, Canada
presenter Gordon Klok
abstract The University of Alberta was one of six teams of undergraduate students competing at the Cluster Challenge at the 2007 SuperComputing conference in Reno Nevada. The purpose of the challenge was to interest and educate undergraduate students in cluster computing, and to demonstrate the maturity of high performance computing tools Our team won the Cluster Challenge with a cluster built from Altix XE310 systems provided by vendor partner SGI. In this paper we discuss some of the techniques that worked for the University of Alberta team and some which did not. Some of the most important lessons we learned were: - Knowing your application is very important to ensuring correctness and achieving good performance. - Measure power consumption. In our case, many of our initial assumptions were not borne out by real testing. - For certain applications having nodes with extra memory is beneficial
Performance Tools; Resource Management
title Hardware Assisted Precision Time Protocol (PTP, IEEE 1588) - Design and Case Study.
author(s) Patrick Ohly, Intel Corporation, DE
presenter Patrick Ohly
abstract Keeping system time closely synchronized among all nodes of a cluster is a hard problem. The Network Time Protocol reliably synchronizes only to an accuracy of a few milliseconds. This is too coarse to compare time stamps of fast events on modern clusters, for example, the send and receive times of a message over a low-latency network. The Precision Time Protocol (PTP), defined in IEEE 1588, specifies a protocol which can substantially enhance the time accuracy across nodes in a local area network. An open source implementation of PTP (PTPd) relies on software time stamping, which is susceptible to jitter introduced by the non-realtime OS. An upcoming Ethernet NIC from Intel solves this problem by providing time stamping in hardware. This paper describes our modifications which allow PTPd to make use of this hardware feature, and evaluates several approaches for synchronizing the system time against the PTP time. Without hardware assistance, PTPd achieved accuracy as good as one microsecond; with hardware assistance, accuracy was reliably improved and dependence on real-time packet time stamping in software was virtually eliminated.
title ARUM: Application Resource Usage Monitor
author(s) Rashawn Knapp, Karen Karavanic, Portland State University; Douglas Pace, IBM, USA
presenter Rashawn Knapp
abstract We present the design and initial implementation of ARUM, an Application Resource Usage Monitor for multi-core and multi-processor AMD and Intel systems running the Linux operating system. ARUM is a lightweight, easy-to-use tool that operates on unmodified binaries, requires no kernel modifications to support access to hardware counters, and is designed to measure both system level and application level metrics. The design contains four measurement aspects: process and thread level resource usage, architecture specific event counting, application level measurements, and measurements of the ambient environment. We describe the design of ARUM and its related goals and design requirements. We have implemented the first two measurement components. In this paper we present the implementation of these components and early results of using ARUM.
title Preemption and Priority Calculation in Production Job Scheduling
author(s) Martin Margo, Kenneth Yoshimoto, SDSC, USA; Patricia Kovatch, University of Tennessee, Knoxville, USA
presenter Martin Margo
abstract In order to offer on-demand computing services as well as improve overall job throughput, the San Diego Supercomputer Center has implemented preemption on its production supercomputers. Preemption allows time critical applications to run as needed and also allows more jobs to be processed through the queue since the jobs can backfill in several smaller discrete blocks of time rather than in one contiguous block. With our local home grown scheduler, Catalina[4], we implemented preemption to enable these new usage scenarios. We explore the impact of the job expansion factor and utilization when preemption and job priority are taken into account on both long-running and large-scale jobs.
Applications and Grid
title A General Relativistic Evolution Code on CUDA Architectures
author(s) Burkhard Zink, Louisiana State University, USA
presenter Burkhard Zink
abstract I describe the implementation of a finite-differencing code for solving Einstein's field equations on a GPU, and measure speed-ups compared to a serial code on a CPU for different parallelization and caching schemes. Using the most efficient scheme, the (single precision) GPU code on an NVIDIA Quadro FX 5600 is shown to be up to 26 times faster than the a serial CPU code running on an AMD Opteron 2.4 GHz. Even though the actual speed-ups in production codes will vary with the particular problem, the results obtained here indicate that future GPUs supporting double-precision operations can potentially be a very useful platform for solving astrophysical problems.
title Robust Machine Learning Applied to Terascale Astronomical Datasets
author(s) Nicholas Ball, Robert Brunner, and Adam Myers, University of Illinois at Urbana-Champaign, USA
presenter Nicholas Ball
abstract We present recent results from the LCDM\footnote{Laboratory for Cosmological Data Mining;} collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of terascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for {\it data mining}, and not just performing simulations. Via a modified implementation of the NCSA cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million stars and galaxies in the Sloan Digital Sky Survey, improved distance measures, and a full exploitation of the simple but powerful $k$-nearest neighbor algorithm. A driving principle of this work is that our methods should be extensible from current terascale datasets to upcoming petascale datasets and beyond. We discuss issues encountered to-date, and further issues for the transition to petascale. In particular, disk I/O will become a major limiting factor unless the necessary infrastructure is implemented.
Systems Evaluation
title Early Evaluation of the IBM BG/P
author(s) Patrick Worley, Oak Ridge National Laboratory, USA
presenter Patrick Worley
abstract This paper describes early results of a performance evaluation of the IBM BG/P recently installed at Oak Ridge National Laboratory. We use microkernels to determine computation and communication performance, both with and without contention. We use the Parallel Ocean Program to evaluate application scalability and to distinguish BG/P performance from other High Performance Computing systems.
title Investigating the Balance Between Capacity and Capability Workloads across Large-Scale Computing Platforms
author(s) Mahesh Rajan, Courtenay Vaughan, Robert Leland, Douglas Doerfler, Robert Benner, Sandia National Laboratories, USA
presenter Mahesh Rajan
abstract The focus of this paper is on the effectiveness of HEC (high-end computing) systems on meeting engineering and scientific analysis needs. Performance measurement and analysis of the applications constituting the work load, on a large commodity InfiniBand cluster, and, on a large custom Cray XT3, is used to assess the merits of the competing HEC architectures. Those applications with communication intensive algorithms show a factor of 2 to 10 better (on 1024 processors) performance on XT3, making XT3 ideal for long, large capability simulations. However, applications with moderate to low communication need have comparable performance on the cluster and these commodity clusters eminently meet the need for higher volume capacity computing cycles. We analyze the reasons for the performance difference seen between the two systems. Since the single cpu wall clock execution time is very close between the two systems, we use parallel efficiency as a measure, to analyze optimal workload mapping on our capability and capacity computing resources. Introduction: Understanding the performance of scientific applications on high performance computers is important for setting resource management policies. Application performance can be influenced by the architecture of the computer, software characteristics, and characteristics introduced by the simulation being run. The same application may be used to run very large capability class simulations or used with a smaller number of processors in several runs in a capacity context to cover a range of parameter space for analysis like uncertainty quantification. In the context of current and future major investments in capacity and capability computing systems, it is useful to analyze mapping of workload against the available computing resources. Current HEC systems vary in the node/processor architecture, the inter-connect, and, system software. IDC classification of HEC systems into two broad categories, namely, capability and capacity, is widely used. However the demarcation is not strictly defined. Moreover applications and analysis that are targeted for these HEC systems again cross the definition boundaries. Our experience with a number of applications and analysts needs, clearly indicate need for large capacity compute cycles. At the same time capability computing often addresses need for interesting and new science that were often not undertaken previously due to lack of compute power. A broad guide line for classifying capability class simulations currently under vogue, include: 1) Simulations that use a significant fraction of the total nodes installed 2) Simulations that require large memory, I/O, and storage 3) Simulations with stringent time-to-solution and short design cycle times 4) Some combination of the above analysis characteristics making it the only means of achieving the goal In this context, both from a management concern for providing the correct investment to meet an institutions need as well as from an analyst desire to extract optimal performance, there exists a strong need to understand effectiveness of different classes of HEC systems on meeting the engineering and scientific analysis needs. Table 1 is the result of a usage survey done few years ago, listing the top few applications and node-hour percentage usage. The current fraction is based on usage logs and estimated future fraction is based on user surveys reflecting programmatic needs. The recent availability of large capability computing systems like ASC Red Storm at Sandia and ASC Purple at LLNL has enabled analysts to conceive new approaches and analysis that were if not impossible, were difficult to undertake on a routine basis. The statistics of node-hours for such large capability class simulations are just beginning to emerge. However, the question of appropriate allocation of compute cycles on capability and capacity computing systems when the demand for total node-hours exceeds available resources is an area of much interest. In this paper we have measured application scaling characteristics so that efficiency gains of a capability class system for each application guides the selection and allocation of limited capability computing cycles. A large InifiniBand cluster with over 8000 processors and a large Cray XT3 with over 20000 processors are used to measure performance of seven applications of interest. The measured parallel efficiency on both these systems is used to understand impact of architectural balance. Parallel efficiency works as a useful measure because the single cpu performance is very close. In some cases, we used strong scaling with engineering models that do not lend to easy construction of weak scaling inputs. It is recognized that scaling behavior is data set dependent and often bigger models permit scaling to a larger number of processors. However, the performance ratio between the two systems provides broad guidelines on optimal usage of both the systems to meet capability and capacity computing node-hour demands. Table 1. SNL application node-hour usage and projections Code Use Numerical Method Current Fraction Future Fraction Presto Crash/ Solid dynamics FEM, explicit time integration 34.4% 15% Salinas Vibration/ Structural dynamics FEM, spectral analysis 15.8% 10% LAMMPS Molecular dynamics FFT, sparse matrix methods 12.8% 10% DSMC Plasma dynamics Discrete Simulation Monte Carlo 10.4% 10% CTH Penetration/ Hydrodynamics Control volume, explicit time integration 7.4% 10% ITS Radiation transport Monte Carlo .08% 15% SAGE Hydrodynamics Finite Volume 0.0% TBD TOTAL 81% 70% In the paper we first provide a short description of each application and the analysis that was benchmarked on the two systems. The wall clock run time ratio and parallel efficiency plots show the scaling characteristics of the applications.
title Capacity Planning of a Commodity Cluster in an Academic Environment: A Case Study
author(s) Baochuan Lu, Linh Ngo, Hung Bui, Amy Apon, University of Arkansas, USA; Nathan Hamm, Larry Dowdy, Vanderbilt University, USA; Doug Homan and Denny Brewer, Acxiom Corporation, USA
presenter Amy Apon
abstract In this paper, the design of a simulation model for evaluating two alternative supercomputer configurations in an academic environment is presented. The workload is analyzed and modeled, and its effect on the relative performance of both systems is studied. The Integrated Capacity Planning Environment (ICPE) toolkit, developed for commodity cluster capacity planning, is successfully applied to the target environment. The ICPE is a tool for workload modeling, simulation modeling, and what-if analysis. A new characterization strategy is applied to the workload to more accurately model commodity cluster workloads. Through “what-if” analysis, the sensitivity of the baseline system performance to workload change, and also the relative performance of the two proposed alternative systems are compared and evaluated. This case study demonstrates the usefulness of the methodology and the applicability of the tools in gauging system capacity and making design decisions.
Student Papers II
title Fast Two-Point Correlations of Extremely Large Data Sets
author(s) Joshua Dolence and Robert Brunner, University of Illinois at Urbana-Champaign, USA
presenter Joshua Dolence
abstract The two-point correlation function (TPCF) is an important measure of the distribution of a set of points, particularly in astronomy where it is used to measure the clustering of galaxies in the Universe. Current astronomical data sets are sufficiently large to seriously challenge current serial algorithms in computing the TPCF, and within the next few years, the size of data sets will have far exceeded the capabilities of serial machines. Therefore, we have developed an efficient parallel algorithm to compute the TPCF on large data sets. We discuss the algorithm, its implementation using MPI and OpenMP, performance, and issues encountered in developing the code for multiple architectures.
title A Scalable Framework for Offline Parallel Debugging
author(s) Karl Lindekugel, Anthony DiGirolamo; Dan Stanzione Arizona State University, USA
presenter Anthony DiGirolamo
abstract Detection and analysis of faults in parallel applications is a difficult and tedious process. Existing tools attempt to solve this problem by extending traditional debuggers to inspect parallel applications. This technique is limited since it must connect to each computing processes and will not scale to next generation systems running on hundreds of thousands of processors. Additionally, existing techniques for monitoring programs and collecting runtime information can scale but are unable to provide enough interaction to find complex software faults. This paper describes a novel parallel application debugger that combines parallel application debugging and a programmable interface with runtime event gathering and automated offline analysis. This debugger is shown to diagnose several common parallel application faults through offline event analysis.
title SC07 Cluster Challenge: A Student Perspective
author(s) Dustin Leverman, University of Colorado at Boulder, USA
presenter Dustin Leverman
abstract This paper examines the design and performance of a cluster constructed by a team of undergraduate students for the SC07 Cluster Challenge. The objective of the Cluster Challenge is to engineer a system that operates within a strict power envelope while maximizes performance on a set of prescribed benchmarks. A description of the hardware and software configuration is presented with emphasis on the challenges encountered in optimizing the benchmarks and the techniques employed to stay within the power budget. The paper distils several strategies for success from our experiences that apply to competitions such as the SC07 Cluster Challenge as well as to organizations undertaking general cluster architecture design and implementation.
title Deploying pNFS Across the WAN: First Steps in HPC Grid Computing
author(s) Dean Hildebrand, Marc Eshel, Roger Haskin, IBM Almaden, USA; Phil Andrews, National Institute for Computational Sciences, USA; Patricia Kovatch ,University of Tennessee at Knoxville, USA; John White, Revision3, USA
presenter Dean Hildebrand
abstract Global file systems promise great advantages in convenience, performance, and ubiquitous access over more traditional data movement mechanisms for HPC grids. Unfortunately, their widespread acceptance is limited due to licensing issues and lack of interoperability. Using the pNFS extension to the popular NFS standard, we performed the first large-scale test of non-proprietary clients accessing GPFS servers within a real HPC Grid environment. Extremely high transfer rates across the TeraGrid Wide Area Network were achieved and we see this as a possible harbinger of a much wider applicability for global files systems.
title A First Look at Scalable I/O in Linux Commands
author(s) Kenneth Matney, Sr., Sarp Oral, Shane Canon, Oak Ridge National Laboratory, USA
presenter Kenneth Matney, Sr.
abstract Data created from and used by terascale and petascale applications continues to increase in, but our ability to handle and manage these files is still limited by the capabilities of the standard serialized Linux command set. This paper introduces the Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) efforts towards providing parallelized and more efficient versions of the commonly used Linux commands. The design and implementation details as well as performance analysis of an in-house developed distributed parallelized version of the cp tool, spdcp is presented. Tests show that our spdcp utility achieves 73 times more performance than its serialized counterpart. In addition, we introduce current work to extend this approach to other tools.
title Exploration of Parallel Storage Architectures for a Blue Gene/L on the TeraGrid
author(s) Michael Oberg, National Center for Atmospheric Research, USA; Henry Tufo, U. Colorado & NCAR, USA; Matthew Woitaszek, University of Colorado at Boulder, USA
presenter Michael Oberg
abstract This paper examines the construction of a cost-effective storage cluster for an IBM Blue Gene/L supercomputer on the TeraGrid. The performance of storage clusters constructed using commodity Opteron servers and directly-attached fibre-channel RAID systems is analyzed utilizing the GPFS, Lustre, and PVFS file systems. In addition to traditional client scalability results, server scalability is also examined, identifying the number of servers required to saturate a 10 Gbps infrastructure network uplink. The results highlight test and system configurations that provide scalable throughput and identify configurations where the environment restricts scalability. Moreover, the Blue Gene/L’s multiplexed I/O design introduces an additional point of contention that affects each file system to varying degrees. The paper concludes with a summary of our anticipated future work and recommendations for storage cluster deployment.


title Parallel Performance Evaluation Tools for HPC Systems: PerfSuite, PAPI, TAU, KOJAK and Vampir
author(s) Sameer Shende, University of Oregon, USA & Rick Kufrin, NCSA, USA
presenterS Sameer Shende and Rick Kufrin
abstract Performance analysis concepts have manifested in robust software tools for HPC systems. Today, there are frameworks available that offer the collection, analysis and visualization of runtime profiles, detailed program behavior and other performance characterizations. Additionally, tools for automatic code generation, stepwise execution, debugging, verification, simulation, and similar purposes are available. Yet, amazingly, the print and time directives are still the most frequently used tool even 35 years after the emergence of the first HPC systems. This tutorial presents the state-of-the-art tool developments for leading-edge HPC, parallel and cluster systems, focusing on profile- and trace-based program visualization, automatic performance analysis, and programming environments that provide generic interfaces to the tools. The what, where, when, and why questions regarding program execution and performance will form the center of the tutorial. Tools characteristics and features will be presented according to these questions. The overall goal is to show the real potential of modern tools for program analysis. In addition, we will show how these tools can be used to diagnose and locate typical performance bottlenecks in real-world parallel programs. To meet the needs of computational scientists to evaluate the performance of their parallel, scientific applications, we present five parallel performance evaluation tools PerfSuite, PAPI, TAU, KOJAK and Vampir/VampirServer. This workshop will focus on performance data collection, analysis, and performance optimization. PerfSuite allows the users to easily profile an application using sampling. The TAU performance system uses instrumentation or direct insertion of calls to measure application performance. After describing and demonstrating how performance data (both profile and trace data) can be collected using TAUs (Tuning and Analysis Utilities) automated instrumentation, the tutorial will cover how to analyze the performance data collected and drill down to find performance bottlenecks and determine their causes. The workshop will include some sample codes that illustrate the different instrumentation and measurement choices available to the users. Topics will cover generating performance profiles and traces with hardware performance counters data using PAPI. Hardware counter data can show not only which routines are taking the most time, but why? For example, because of cache misses, TLB misses, excess address arithmetic, or poor branch prediction behavior. Automated analysis of trace data using the KOJAK tool, in conjunction with TAU, can help find and determine the causes of communication inefficiencies such as excessive communication blocking. Parallel trace based analysis using the Open Trace Format (OTF) traces emitted by TAU will be demonstrated using the Vampir/VampirServer trace visualization tool.
title Debugging Parallel and Distributed Applications
author(s) Ed Hinkel, TotalView Technologies, USA
presenter Ed Hinkel
abstract Tutorial Description This multi-core debugging tutorial will feature a presentation with hands-on content focusing on ways to solve bugs in parallel and distributed applications. Bugs can be challenging enough to track down in serial environments- and can be even more vexing when involving an MPI program running in a cluster environment. The proposed tutorial will provide an introduction to debugging parallel applications with the TotalView Debugger– which will cover basic operations as well as introduce important concepts such as subset attach and parallel process control.

Tutorial participants will learn how to use TotalView to debug in both serial and parallel applications. Specifically they will learn techniques for examining program state and program data, such as how to work with complex data structures, standard template library collection class objects, large multi-dimensional arrays and data that is distributed across many different processes. They will also learn how to synchronize and precisely control their parallel application using techniques such as automatically and manually defined process and thread groups, breakpoints, barrier points and evaluation points. The format will include both a prepared PowerPoint presentation to provide context and content, as well as hands-on interactive sessions so that participants can gain confidence that they are fully grasping and can apply the ideas covered in the session.

Tutorial Outline Introduction to MPI Debugging * Goals, organization, challenges and context Getting Started * Initiating a debugging session * Orientation to the interface * Breakpoints, variables and serial process control Interactive Session 1 TotalView as a Parallel Debugger * Parallel process control concepts * Overview of automatically and manually defined processes and thread groups, barrier points and evaluation points Interactive Session 2 Examining Program State and Program Data * How to work with complex data structures, standard template library collection class objects, large multi-dimensional arrays and data that is distributed across many different processes Interactive Session 3 MPI Debugging Techniques * Effectively solving problems in parallel applications * Interactive techniques that can be applied to MPI applications running on a single node or a large cluster * Strategies for scalability * Overview of techniques for working effectively in batch environments and at “production scale”: attaching to running jobs, working with subsets of full jobs, etc. The Basics of Scripting TotalView * How to use a script to diagnose a problem semi-interactively with a long running job in a batch queue environment Duration Full day tutorial Audience This tutorial will be appropriate for developers using FORTRAN 90 and C/C++ with MPI to develop parallel applications. The training includes introductory material for participants who are unfamiliar with the usage of the highly graphical TotalView debugging interface, and then builds to a discussion of parallel debugging and concludes with a few advanced topics. Content Level 40% Introductory 40% Intermediate 20% Advanced Prerequisites Familiarity with programming concepts, C/C++ or FORTRAN language and basic tools such as a compiler is assumed. Familiarity with basic concepts of MPI and parallel programming is required. No prior experience with TotalView is required. Participants should bring an Intel-based laptop (x86/x64 system) with a CD drive to get the most out of the course. Participants will be provided with a “live” CD that contains a standard Linux environment, the debugger and the tutorial target programs. Live Demo Content.

The proposed tutorial will include several short live demos of the latest version of TotalView running on a simple serial-type application and an MPI application. Demo content will be in a variety of languages including FORTRAN 90 and C/C++. Hands-on Exercise Content During the proposed tutorial, participants will have the opportunity to reinforce the material delivered in each section by completing hands-on exercises. Exercises will start out very simple and build as the day goes on. Preliminary exercise topics include: * Getting Started – Using the live CD, compiling the first example and attaching to a process. * Examining a Parallel Application – Processes, threads and variables. * Controlling a Parallel Application – Command scope, groups and barriers. These exercises will be worked through with the whole group in 15-20 minute blocks throughout the day. Access to exercise content will be provided in the form of a “live” Linux distribution CD (potentially based on Ubuntu) that can be used in any Linux or Windows PC. The exercises will be in source and binary form and the TotalView binary and evaluation licenses will be included so that after the session, users may take the trial license and examples to use on other systems.
title HDF5 – Addressing Data Management Challenges on High-Performance Linux Clusters
author(s) Elena Pourmal, Mike Folk, and Ruth Aydt, The HDF Group, USA
presenter Elena Pourmal
abstract Scalable and manageable I/O is critical for many applications that run on Linux clusters. Applications often require not only hundreds of processors and hours of computation time, but also need an efficient way to store and manage gigabytes or more of complex and diverse data. While parallel filesystems, such as GPFS and PVFS, in combination with parallel I/O libraries, such as MPI-IO, enable scalable access to data, they do not provide convenient high-level interfaces to deal with data complexity, extensibility, and portability. Nor do they provide a straightforward way to manage the complex relationships among the application's data components - relationships that often extend beyond a single execution and that may bridge multiple applications running on different computational platforms. To address the gaps between I/O libraries and the applications' view of their data, in-house data formats and manipulation routines are often developed by individuals or teams to meet the specific needs of their project. While the initial time to develop and deploy such a solution may be quite low, the results are often not portable, not extensible, and not high-performance. In many cases, the time devoted to extending and maintaining the data management portion of the code takes an increasingly large percentage of the total development effort and reduces the time available for the primary objectives of the project. As an alternative to in-house development, many users have turned to HDF (Hierarchical Data Format). Backed by a 20 year history, HDF5 offers a flexible format and powerful library with high-level interfaces that take advantage of the advanced capabilities of underlying filesystems, and support both sequential and parallel/I/O. HDF5 provides an efficient mechanism to store application data structures regardless of size and complexity, employing a binary format that is self-describing, extensible, and portable. The full day tutorial on HDF5 will provide the participants with the background they need to use HDF5 effectively on High-Performance Linux Clusters. Through a combination of presentations, case studies, hands-on sessions, and informal discussions, HDF5 basic and advanced concepts critical to success in High-Performance Linux Cluster environments will be covered.
title Tuning for Performance: A Case Study Applied to OpenMPI
author(s) George Bosilca, University of Tennessee, USA
presenter George Bosilca
abstract The complexity of parallel machines increased in terms of processor features, as well as network interconnects. While some of these features can be discovered and accounted for automatically by the MPI implementation, providing the best overall performance can be a difficult task. Additionally, the system administrators, as well as the power users may want to tune the behavior of their MPI library at runtime. A relatively new feature, heterogeneous networking -- using multiple networks to communicate between processes -- is becoming increasingly relevant, both in LAN environments, as organizations accumulate different types of networks, and in Grid / WAN environments. This tutorial will focus on Open MPI, an emergent MPI library designed around a pluggable framework, allowing a fine-grained tuning of the MPI behavior at runtime. We will present the Open MPI modular approach, how to configure it and how to get the best performance from your cluster. Focusing on multi-network support, as well as dynamically selected collective algorithms, this tutorial will benefit all system administrators and everyday MPI developers and users.
title Cluster Carpentry with OCS 5
author(s) Mehdi Bozzo-Rey, Platform Computing Inc., Canada
presenter Mehdi Bozzo-Rey
abstract Every clustering project is unique by its hardware and software design, size, complexity, and has to be seen as a complete solution. Reliability and integration with commercial software may be the key for some organizations when others may be more interested in scalability, performance or a highly customized environment. Some projects may be too small to have a dedicated cluster admin whereas others will have resources for software development and advanced system tuning and customization. One challenge in building such solution is the increasing number of hardware and software components. Of course, any clustering solution has to include a workload management component, in order to maximize the use of the resources. The goal of this tutorial is to give an overview of the new Platform Open Cluster Stack 5 (OCS 5) framework. We will show how this new framework reduces the complexity to a manageable level and increases both stability and performance, and how the new capabilities of OCS 5 can cover a wide variety of solutions, from the “out of the box” solution, to the highly customized cluster integrated in a complex environment. We will see how the Intel Cluster Ready program can bring down the complexity to a manageable level, for both architects and users. From stateful to stateless nodes, workload management is a common component and will also be covered, with an overview of the latest version of LSF HPC and the new version of the Open Source LAVA. Duration : half-day Outline : Introduction : - Cluster management overview - Workload management overview OCS 5: key concepts - Installer Node / Primary Installer Node - Node Groups - Kit - Component - Snapshot OCS 5: core system components and node provisioning - Kits - CFM - Yum - Services - PXE bootstrap process OCS 5: cluster administration - Administrator commands (addhost, boothost, buildimage, netedit, ngedit …) - Generic commands (kitops, repoman, repopatch, pdsh …) Workload management with LSF HPC : - LSF framework - PJL / blaunch framework - MPI / applications integrations The GPL solution : Kusu and the LAVA kit - Overview of the LAVA batch system Conclusion : - OCS Roadmap - Questions and Answers
title Service-Oriented Computing for HPC Data Centers and Grids
author(s) Jason Cope, University of Colorado at Boulder, USA; Henry Tufo, U. Colorado and NCAR, USA; Matthew Woitaszek, University of Colorado at Boulder, USA
presenter Jason Cope
abstract In this interactive tutorial, we will demonstrate the use of Web and Grid services to develop service-oriented architectures and workflows in Grid computing environments such as the TeraGrid. The material covered will include the basics of Grid and Web service development, the common Grid APIs and tools used by these services, security and authorization considerations for these environments, and supporting the dynamic nature of these systems through information services. This tutorial will focus on how to support this computing paradigm in Linux cluster and Grid computing environments, such as the integration of local resource managers and file transfer tools using the Globus APIs. This tutorial is suitable for Linux cluster administrators, HPC software developers, and other disciplines interested in learning more about the use of service-oriented computing in these environments. A goal of this tutorial is to familiarize the HPC community about the capabilities of these services and systems so that they can form appropriate polices for management of them.


title Petascale Computing at ORNL
author(s) Buddy Bland, ORNL, USA
presenter Buddy Bland
Oak Ridge National Laboratory (ORNL) will install two PetaFLOPs computers over the next year, one for the Department of Energy’s Leadership Computing Facility, and one in collaboration with the University of Tennessee’s National Institute for Computational Sciences.  I will discuss the plans and architecture of each system, the computing environment that surrounds these systems, the challenges we face in these deployments and the lessons learned through deploying a series of leadership systems.
title The Peta- to Exa-Scale Challenge of Modeling Core Collapse Supernovae
author(s) Tony Mezzacappa, ORNL/UTK, USA
presenter Tony Mezzacappa
Available soon.
title Roadrunner: Science, Cell and a Petaflop/s
author(s) Andrew White, LANL, USA
presenter Andrew White
abstract Roadrunner is a hybrid LINUX cluster.  Each compute node consists of 2 dual core AMD Opterons connected to 4 IBM PowerXCell 8i (enhanced Cell) chips.  The full system will nominally have 18 interconnected sub-clusters, each with 180 compute nodes and 12 I/O nodes.  Peak node performance is approximately 450 gigaflop/s; peak system performance is more than 1.4 petaflop/s.  System footprint is approximately 6800 ft^2 and 3.9 MW.  Communication among the nodes is via OpenMPI over a two-level InfiniBand switch fabric.  Intranode communication is via IBM's DaCS library over PCIe.  A number of application codes have already been implemented on Roadrunner and we expect early science runs to include plasma physics, materials science, cosmology, turbulence and biology.  The Roadrunner system is a partnership among IBM, LANL and NNSA.
title Software Development Tools for Petascale Systems
author(s) Shirley Moore, UTK, USA
presenter Shirley Moore
Software developers will face a major challenge with the advent of  petascale computing systems.  The growing use of multicore processors and the shift towards building systems with larger numbers of lower  speed cores is resulting in an explosion of hardware parallelism that  must be matched by increased concurrency in software.  Software developers will need scalable and effective libraries and programming tools to cope with these challenges.  This talk will discuss current research and development in scalable math libraries and correctness and performance analysis tools for petascale systems, as well as further research needed to enable software to meet the petascale challenge.


title Intelligent Environment Scheduling with Moab - Green Computing and Multi-OS Environements
author(s) Douglas Wightman, Cluster Resources, USA .
presenter Douglas Wightman
abstract A general overview: using scheduling to adapt a compute environment to meet specific objectives. Followup discussion will cover the core technology, a sampling of the variety of areas it can apply to, and specific details about Green Computing/Power Savings and Dynamic Multi-OS Hybrid Environments.