This is an archived page of the 2006 conference

abstracts

2006 Abstracts: Plenary Presentations, Applications Track, Systems Track, and Vendor Presentations


Last updated: March 31, 2006


Plenary Presentations

Plenary I

Title

 

An Overview of High-Performance Computing and Challenges for the Future

Author(s)

 

Jack Dongarra

Author Inst

 

University of Tennessee, USA

Presenter

 

Jack Dongarra

Abstract


In this talk we will examine how high-performance computing has changed over the past ten years and look toward the future in terms of trends. A new generation of software libraries and algorithms are needed for the effective and reliable use of (wide-area) dynamic, distributed and parallel environments. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile-time and run-time techniques. However, the increased scale of computation, multicore, depth-of-memory hierarchies, range of latencies, and increased run-time environment variability will make these problems much harder.

 

 

Plenary II

Title

 

High-End Operating Systems: Status and Future

Author(s)

 

Frederick C. Johnson

Author Inst

 

DOE, Office of Science, USA

Presenter

 

Frederick C. Johnson

Abstract


With the exception of microkernels used on the compute nodes of some systems, all operating system (OS) software for high-end systems is based on Unix or Linux. These OS's share some common attributes: the basic design is old (~35 years for Unix), they are littered with system processes unneeded for high-end computing, and neither scalability or performance was designed in from scratch. Microkernels, in conjunction with user-space messaging, offer performance and scalability but do not support a full suite of user services. The success of Unix and Linux has had a detrimental effect on OS research for high-end systems, and very little research has been undertaken for several years. This talk will provide an overview of current high-end OS approaches, discuss some of the barriers that must be overcome for terascale and petascale OS's, and describe some new research activities that have been initiated as part of the Office of Science Next Generation Architecture activity.

 

 

Plenary III

Title

 

Cluster Computing in Everyday Biomedical Research: Past, Present, and Future

Author(s)

 

Klaus Schulten

Author Inst

 

University of Illinois at Urbana-Champaign, USA

Presenter

 

Klaus Schulten

Abstract


This lecture will review 12 years of successful cluster computing in biomedical research, which has permitted large-scale simulations of biomolecular machinery in living cells. We started in 1993 with a few high-performance workstations connected through an optical switch that permitted 30,000 atom simulations, integrating commodity desktops and switches at a 32-processor scale (in 1998), which permitted 100,000 atom simulations. In 2003, using rack-server clusters integrating 50 (in the work group) to thousands (at the National Centers, permitting multi-million atom simulations) of processors, biomedical and computer scientists adapted their software (NAMD, honored through a Gordon Bell award) and underlying algorithms continuously to existing technical solutions. In 2005, the team of researchers and developers prepared NAMD for automated grid computing, making the grid available from the desktop as a unified resource. Today, the team prepares NAMD for use on the next generation of clusters with tens of thousands of processors and specialized computational hardware such as FPGAs and GPUs.

This lecture will illustrate, in particular, how technical development advanced biomedical research in quantum leaps and what the advances actually mean from a biomedical perspective.

 

 

 

Applications Track Abstracts:

Applications Papers I: Performance


Title


A Compuarison of Single-Core and Dual-Core Opteron Processor Performance for HPC

Author(s)

Douglas M. Pase and Matthew A. Eckl

Author Inst.

IBM

Presenter

Douglas M. Pase

Abstract

Dual-core AMD Opteron™ processors represent the latest significant development in microprocessor technology. In this paper, using the IBM® eServer™ 326, we examine the performance of dual-core Opteron processors. We measure the core performance under ideal work loads using the Linpack HPL benchmark, and show it to be 60% faster than the fastest single-core Opteron. We measure unloaded memory latency and show that the dual-core processor’s slower clock frequency makes the latency longer. However, we also show that its memory throughput is 10% greater than single-core processors. Finally, we show that dual-core Opteron processors offer a significant performance advantage even on realistic applications, such as those represented by the SPEC® CPU2000 benchmark suite.



Title

Performance Analysis of AERMOD on Commodity Platforms

Author(s)

George Delic

Author Inst.

HiPERiSM Consulting

Presenter

George Delic

Abstract

This report examines performance of the AERMOD Air Quality Model in two versions and two data sets with three compilers on Intel and AMD processors using the PAPI performance event library to collect hardware performance counter values where possible. The intent is to identify performance metrics that indicate where performance inhibiting factors occur when the codes execute on commodity hardware. Results for operations, instructions, cycles, cache, table look-aside buffer misses, and branching instructions, are discussed in detail. An execution profile and source code analysis uncovers the causes of performance inhibiting behavior and how they lead to bottle-necks on commodity hardware. Based on this analysis, source code modifications are proposed with a view to potential performance enhancement.



Title

Performance Lessons from the Cray XT3

Author(s)

Jeff Larkin

Author Inst.

Cray Inc.

Presenter

Jeff Larkin

Abstract

Whether designing a cluster or a massively parallel processor machine, many of the most important designs must be made early in the process and are difficult to change. In this talk I will discuss some of the design choices made by Cray when designing the XT3, Cray's third generation of massively parallel processor machines.  I will specifically discuss what choices we made for processor, network, and operating system and what we have learned from these choices. Finally, I will show real-world performance results from the Cray XT3.


Applications Papers II:


Title

Experiences in Optimizing a Numerical Weather Prediction Model: An Exercise in Futility?

Author(s)

Dan Weber and Henry Neeman

Author Inst

University of Oklahoma

Presenter

Dan Weber

Abstract

This paper describes the basic optimization methods applied, including cache memory optimization and message hiding, to a grid point CFD numerical weather prediction model to reduce the wall time for time critical weather forecasting and compute intensive research. The problem is put into perspective via a brief history of the type of computing hardware used for this application during the past 15 years and the trend in computing hardware that has brought us to the current state of code efficiency. The detailed performance analysis of the code identifies the most likely parts of the numerical solver for optimization and the most promising methods for improving single- and parallel-processor performance. Results from each performance-enhancing ploy are presented and help define the envelope of potential performance improvement using standard commodity-based cluster technology.



Title

Benchmark Analysis of 64-bit Servers for Linux Clusters for Application in Molecular Modeling and Atomistic Simulations

Author

Stefano Cozzini1, Roger Rousseau2, Axel Kohlmeyer3

Author Inst

1CNR-INFM Democritos National Simulation Center, 2Sissa, 3University of Pennsylvania

Presenter

TBD

Abstract

We present a detailed comparison of the performance of synthetic test programs such as DGEMM and STREAM as well as typical atomistic simulation codes DIPROTEIN and CPMD, which are extensively used in computational physics, chemistry and biology on a large number of high-performance computing platforms. With an eye toward maximizing the price/performance ratio for applications in atomistic simulations, we examine a wide class of commonly used 64-bit machines and discuss our results in terms of the various aspects of the machine architecture, such as CPU speed, SMP memory performance and network. We find that although the Intel EM64T machines show superior performance for applications that extensively exploit the MKL library the Opteron-based machines show superior performance with less optimized codes. Moreover, for large memory applications such as electronic structure codes, the SMP performance of the Opteron is superior. An overview of which architecture is suitable for which applications and a comparison of AMD dual-core CPU technology to Intel hyper-threading are also discussed.



Applications Papers III: Tools


Title

Intel Cluster Tools

Author(s)

Ullrich Becker-Lemgau

Author Inst

Intel

Presenter

Ullrich Becker-Lemgau

Abstract

Parallel Systems like clusters and shared memory systems require optimized parallel and scalable applications to benefit from the computing power provided by these systems. But the standard for distributed parallel programming—the message-passing interface MPI—does not provide any features or paradigms to reach high-parallel efficiency and scalability for high numbers of processes. Tools are required to help software developers understand the communication pattern of their applications and to indicate where optimization is needed.

The Intel® Trace Analyzer & Collector 6.0 is a tool to get detailed information on the parallelism of distributed applications. The application is linked with the Intel® Trace Collector Library, which writes a trace file during execution. All MPI calls and events are recorded in this trace file. Information about user code can also be recorded if the code is instrumented with the Intel® Trace Collector API. After the application execution stops, Intel® Trace Analyzer is used to visualize all the information included in the trace file. Intel® Trace Analyzer features different charts like event timelines for detailed information on what each process is executing at a given time, showing messages between different processes, and it gives direct access to the source code of the originating MPI calls. Several other charts are available showing statistical information in a timeline or matrix fashion. All information can be aggregated, tagged or filtered in time and processes dimensions to enable effective analysis of parallel applications with thousands of processes. The new release 6.0 of the Intel® Trace Analyzer & Collector features a redesigned and faster GUI with better scaling over time and processes, bigger trace file capabilities, and choice of execution on Linux or Windows® XP.

The Intel Trace Analyzer & Collector tool set is included in the Intel Cluster Toolkit which combines a full set of cluster tools provided by Intel. Besides performance tools programming libraries like the Intel MPI Library are part of the Intel Cluster Toolkit. Intel MPI Library features many MPI-2 functions and multi-fabric support. For an application build with Intel MPI Library the user is able to choose the network fabric when launching the application. The Intel Cluster Toolkit is a one-stop solution for developing and optimizing Linux Cluster applications and systems.



Title

A Test Harmess TH for Evaluating Code Changes in Scientific Software

Author(s)

Brian T. Smith

Author Inst.

Numerica 21 Inc.

Presenter

Brian T. Smith

Abstract

Not available.



Title

An Integrated Performance Tools Environment

Author(s)

Luiz DeRose

Author Inst.

Cray Inc.

Presenter

Luiz DeRose

Abstract

Not available.



Applications Papers IV: Applications, Visualization


Title

Evaluation of RDMA over Ethernet Technology for Building Cost-Effective Linux Clusters

Author(s)

Michael Oberg1, Henry M. Tufo2, Theron Voran1, Matthew Woitaszek1

Author Inst

1University of Colorado, Boulder; 2U. of Colorado, Boulder/National Center for Atmospheric Research

Presenter

Matthew Woitaszek

Abstract

Remote Direct Memory Access (RDMA) is an effective technology for reducing system load and improving performance. Recently, Ethernet offerings that exploit RDMA technology have become available that can potentially provide a high-performance fabric for MPI communications at lower cost than other competing technologies. The goal of this paper is to evaluate RDMA over gigabit Ethernet (ROE) as a potential Linux cluster interconnect. We present an overview of current RDMA technology from Ammasso, describe our performance measurements and experiences, and discuss the viability of using ROE in HPC applications. In a series of point-to-point tests, we find that the RDMA interface provides higher throughput and lower latency than legacy gigabit Ethernet. In addition, even when functioning in non-RDMA mode, the ROE cards demonstrate better performance than the motherboard network interfaces. For application benchmarks, including LINPACK and a climate model, the Ammasso cards provide a speedup over standard gigabit Ethernet even in small node configurations.



Title

Performance of Voltaire InfiniBand in IBM 64-Bit Commodity HPC Clusters

Author(s)

Douglas M. Pase

Author Inst

IBM

Presenter

Douglas M. Pase

Abstract

Not available.



Systems Track Abstracts:

Systems Papers I: Filesystems and Cluster Management


Title

Building of a GNU/Linux-Based Bootable Cluster CD

Author

Paul Gray1, Jeff Chapin1, Tobias McNulty2

Author Inst

1University of Northern Iowa and 2Earlham College

Presenter

TBD

Abstract

The Bootable Cluster CD (BCCD) is an established, well-maintained, cluster toolkit used nationally and internationally within several levels of the academic system. During the Education Programs of Supercomputing conferences 2002, 2003, and 2004, the BCCD image was used to support instruction of issues related to parallel computing education. It has been used in the undergraduate curriculum to illustrate principles of parallelism and distributed computing and widely used to facilitate graduate research in parallel environments. The standard BCCD image is packaged in the 3”, mini-CD format, easily fitting inside most wallets and purses. Variations include PXE-bootable (network-bootable) and USB-stick bootable images. All software components are pre-configured to work together making the time required to go from boot-up to functional cluster less than five minutes. A typical Windows or Macintosh lab can be temporarily converted into a working GNU/Linux-based computational cluster without modification to original disk or operating system. Students can immediately use this computational cluster framework to run a variety of real scientific models conveniently located on the BCCD and downloadable into any running BCCD environment. This paper discusses building, configuring, modifying, and deploying aspects of the Bootable Cluster CD.



Title

Improving Cluster Mangement with Scalble Filesystems

Author

Adam Boggs1, Jason Cope1, Sean McCreary2, Michael Oberg2, Henry M. Tufo2, Theron Voran2, Matthew Woitaszek2

Author Inst

1University of Colorado, Boulder; 2U. of Colorado, Boulder/National Center for Atmospheric Research

Presenter

TBD

Abstract

Reducing the complexity of the hardware and software components of Linux cluster systems can significantly improve management infrastructure scalability. Moving parts, in particular hard drives, generate excess heat and have the highest failure rates among cluster node components. The use of diskless nodes simplifies deployment and management, improves overall system reliability, and reduces operational costs. Previous diskless node implementations have relied on a central server exporting node images using a high-level protocol such as NFS or have employed virtual disks and a block protocol such as iSCSI to remotely store the root filesystem. We present a mechanism to provide the root filesystems of diskless computation nodes using the Lustre high-performance cluster file system. In addition to eliminating the downtime caused by disk failures, this architecture allows for highly scalable I/O performance that can be free from the single point of failure of a central fileserver. We evaluate our management architecture using a small cluster of diskless computation nodes and extrapolate from our results the ability to provide the manageability, scalability, performance and reliability required by current and future cluster designs.



Title

The Hydra Filesystem: A Distributed Storage Framework

Author

Benjamin Gonzalez and George K.Thiruvathukal

Author Inst

Loyola University, Chicago

Presenter

TBD

Abstract

Hydra File System (HFS) is an experimental framework for constructing parallel and distributed filesystems. While parallel and distributed applications requiring scalable and flexible access to storage and retrieval are becoming more commonplace, parallel and distributed filesystems remain difficult to deploy easily and to configure for different needs. HFS aims to be different by being true to the tradition of high-performance computing while employing modern design patterns to allow various policies to be configured on a per-instance basis (e.g. storage, communication, security, and indexing schemes). We describe a working prototype (available for public download) that has been implemented in the Python programming language.



Systems Papers II: Cluster Efficiencies


Title

ClearSpeed Acceleratiors in Linux Clusters

Author

John L. Gustafson

Author Inst

ClearSpeed Technology

Presenter

John L. Gustafson

Abstract

While the use of commodity processors and interconnect and the Linux operating system have permitted great advances in the cost-effectiveness of HPC systems, they have exposed a new limitation: power and space requirements. It is common for a high-density Linux cluster to require more watts per cabinet than earlier proprietary HPC designs. The cost of the electrical power over the life of the system can exceed the cost of the system. The ClearSpeed coprocessor accelerates kernels specific to 64-bit scientific computing such as matrix multiplication and Fourier transforms. For technical applications, it raises both the performance and the performance-per-watt on those operations relative to more general-purpose processor designs. Commodity processors like those used in Linux clusters cannot incorporate HPC-specific features without reducing competitiveness in the broader application space, so the ClearSpeed accelerator option restores an HPC feature emphasis to clusters that require it.

We present the architecture of the ClearSpeed CSX600 chip and its use in the Advance™ plug-in board, issues of algorithm-architecture fit, approaches for ease-of-use, and applications either available now or under development. We also show the impact on performance and facilities requirements of equipping typical Linux clusters with ClearSpeed accelerators.



Title

Maestro-VC: On-Demand Secure Cluster Computing Using Virtualization

Author

Nadir Kiyanclar, Gregory A. Koenig, William Yurcik

Author Inst

National Center for Supercomputing Applications/UIUC

Presenter

TBD

Abstract

On-demand computing is the name given to technology which enables an infrastructure where computing cycles are treated as a commodity, and where such a commodity can be accessed upon request. In this way the goals of on-demand computing overlap with and are similar to those of Grid computing: both enable the pooling of global computing resources to solve complex computational problems.

Recently, virtualization has emerged as a viable mechanism for improving the utilization of commodity com-puting hardware. This field has seen much research for potential applications in the field of distributed and Grid computing. In this paper, we present an architecture and prototype implementation for Maestro-VC, a system which takes advantage of virtualization to provide a sandboxed environment in which administrators of cluster hardware can execute untrusted user code. User code can run unmodified, or can optionally take advantage of the special features of our system to improve performance and adapt to changes in the environment.



Title

Architectural Tradeoffs for Unifying Campus Grid Resources

Author

Bart Taylor1 and Amy Apon2

Author Inst

1Acxiom Corporation and 2University of Arkansas

Presenter

TBD

Abstract

Most universities have a powerful collection of computing resources on campus for use in areas from high-performance computing to general-access student labs. However, these resources are rarely used to their full potential. Grid computing offers a way to unify these resources and to better utilize the capability they provide. The complexity of some grid tools makes learning to use them a daunting task for users not familiar with using the command line. Combining these tools into a single web-portal interface provides campus faculty and students with an easy way to access the campus resources. This paper presents some of the grid and portal tools that are currently available and the tradeoffs in their selection and use. The successful implementation of a subset of these tools at the University of Arkansas and the functionality they provide are discussed in detail.



Systems Papers III:


Title

LEA: A Cluster Intensive Simulation Software for Unit Commitment

Author

Riadh Zorgati1, Wim Van Ackooij1, Jean-Marc Luel1, Pierre Thomas1, Michael Uchanski2, Kevin Shea2

Author Inst

1EDF, 2The Mathworks, Inc.

Presenter

TBD

Abstract

The Unit Commitment Problem (UCP) consists in defining the generation schedule of minimal cost for a given set of power units. Due to complex technical constraints, the UCP is a challenging large-size, non-convex, non-linear optimization problem. The UCP has been solved satisfactory in an industrial way. Solving thus the deterministic basic UCP takes about fifteen minutes. At bi-weekly horizon, uncertainty and hazards cannot be neglected but solving the UCP as a stochastic problem at such short-term horizon, is a challenging task.

In this paper, we report the LEA experience, a software for cluster intensive simulation in unit commitment, implementing a multi-scenarios technique, which allows to take uncertainty and hazards into account in a simplified way. This technique naturally requires important computing resources but becomes industrially tractable when using a cluster. Regarding the first obtained results, the underlying main concept consisting in combining intensive simulation with smartly chosen uncertainty scenarios seems to be efficient.



Title

Lessons for the Cluster Community from an Experiment in Model Coupling with Python

Author

Michael Tobis1, Mike Steder1, Ray Pierrehumbert1, Robert Jacob2

Author Inst.

1University of Chicago, 2Argonne National Laboratory

Presenter

TBD

Abstract

Available soon.



 

Title

An Equation-by-Equation Method for Large Problems in a Distributed Computing Enviroment

Author

Ganesh Thiagarajan and Anoop G. Varghese

Author Inst.

University of Missouri, Kansas City

Presenter

TBD

Abstract

For finite element problems involving millions of unknowns, iterative methods set up in a parallel computing platform, are more efficient than direct solver techniques. Considerations that are important in the solution of such problems include the time of computation, the memory required and the type of platform being used to solve the problem. Traditionally shared memory machines were popular. However, distributed memory machines are now gaining wider acceptance due to their relatively low cost and ease of setup. The conventional approach to set up an iterative finite element solver is the Element-by-Element (EBE) method with the preconditioned conjugate gradient (PCG) solver. The EBE method is reported by Horie and Kuramae (1997, Microcomputers in Civil Engineering, 2, 12) to be suitable for shared-memory parallel architectures, but has certain conflicts in distributed memory machines. This paper proposes a new algorithm that is developed for the parallelization of the solution algorithm of the finite element setup for a distributed memory environment. The new method, called the Equation-by Equation (EQBYEQ) method, is based on generating and storing the stiffness matrix on an equation-by-equation basis in contrast to the element-by-element basis. This paper discusses the algorithm and implementation of details. The advantages of the EQBYEQ scheme when compared to the EBE scheme, in distributed environment, is discussed.



 

Systems Papers IV: High Availability

 

Title

On the Survivability of Standard MPI Applications

Author

Anand Tikotekar1, Chokchai Leangsuksun1, Stephen L. Scott2

Author Inst.

1Louisiana Tech University, 2Oak Ridge National Laboratory

Presenter

TBD

Abstract

Job loss due to failure represents a common vulnerability in high-performance computing (HPC), especially in the Message Passing Interface (MPI) environment. Rollback-recovery has been used to mitigate faulty issues for long-running applications. However, to date, the rollback-recovery such as checkpoint mechanism alone may not be sufficient to ensure fault tolerance for MPI applications due to a static view of MPI-cooperating machines and lack of resilient ability to endure outages. In fact, MPI applications are prone to cascading failures, where one participating node causes the total failure. In this paper we address fault issues in the MPI environment by improving runtime availability with self-healing and self-cloning that tolerates the outage of cluster computing systems. We develop a framework that augments a standard HPC cluster with a fault-tolerance capability at job level that preserves the job queue, and a parallel MPI job submitted through a resource manager enabling the nonstop execution even after encountering failure.



 

Title

Cluster Survivability with ByzwATCh: A Byzantine Hardware Fault Detector for Parallel Machines with Charm++

Author

D. Mogilevsky, G. Koenig, W. Yurcik

Author Inst.

National Center for Supercomputing Applications/UIUC

Presenter

TBD

Abstract

As clusters grow larger in size, the sheer number of participating components has created an environment where failures effecting long-running jobs can be expected to increase in frequency for the foreseeable future. The cluster research community is aware of this problem and has proposed many error-recovery protocols, which offer support for fault-tolerant computing; however, each of these recovery protocols depends on the underlying detection of faults. In this paper we present a prototype system to detect the most difficult faults, specifically Byzantine faults. We describe the operation of ByzwATCH, a system for the run-time detection of Byzantine faults as part of the Charm++ parallel programming framework. Results show that ByzwATCH is both accurate in detection and lightweight for high-performance computing environments. While we demonstrate this work for a Linux cluster environment, it is extensible to other environments requiring high reliability (e.g. carrier class server environments) with error-recovery protocols using this fault detection system as its foundation.



 

Title

RASS Framework for a Cluster-Aware SELinux

Author

Arpan Darivemula1, Anand Tikotekar1, Chokshai Leangsuksun1, Makan Pourzandi2

Author Inst.

1Louisiana Tech University, 2Ericsson Research Canada

Presenter

TBD

Abstract

The growing deployments of clusters to solve critical and computationally intensive problems imply that survivability is a key requirement through which the systems must possess Reliability, Availability, Serviceability and Security (RASS) together. In this paper, we conduct a feasibility study on SELinux and the existing cluster-aware RASS framework. We start by understanding a semantic mapping from cluster-wide security policy to individual nodes’ Mandatory Access Control (MAC). Through our existing RASS framework, we then construct an experimental cluster-aware SELinux system. Finally, we demonstrate feasibility of mapping distributed security policy (DSP) to SELinux equivalences and the cohesiveness of cluster enforcements, which, we believe, leads to a layered technique and thus becomes highly survivable.



 

Vendor Presentations

Title

HPC Into The Mainstream

Author(s)

Stephen Wheat

Author Inst

Intel

Presenter

Stephen Wheat

Abstract

Processor performance continues with its historical climb, bringing an ever increasing performance-per-cost ratio to computing platforms. Technology and capabilities, which were once exclusively in the hands of the high-performance computing community, will now become available to a broader set of the world's population. The computing possibilities this will release are broad — from bringing HPC to previously computing-challenged activities, to enabling each college student to have their own HPC system.  Technologies supporting this HPC proliferation will be discussed, as will a view into some of the future usage models.

   

 

Title

Real Application Scaling

Author(s)

Greg Lindahl

Author Inst

PathScale

Presenter

Greg Lindahl

Abstract

Available soon.



Title

Myri-10G: Overview and a Report on Early Deployments

Author(s)

Tom Leinberger

Author Inst

Myricom

Presenter

Tom Leinberger

Abstract

Not available.



Title

PGI Compilers and Tools for Scientists and Engineers

Author(s)

Douglas Miles

Author Inst

PGI

Presenter

Douglas Miles

Abstract

The differences between AMD64 and EM64T and what that means for compiler users.