This is an archived page of the 2002 conference

abstracts

Plenary Presentations, Cluster Busters, Programming Models, Applications and Tools Track, Systems Administration Track, and Vendor Presentations Abstracts

Last updated: 11 October 2002


Plenary Presentations


Plenary Session I
Title   Seismic & Linux: From Curiosity to Commodity
Author(s)   Dr. Mike Turff
Author Inst   PGS Data Processing
Presenter   Dr. Mike Turff
Abstract
The seismic processing business has always required massive amounts of compute capacity and has historically been an early adopter of anything providing price/performance improvements in this field. This has been exacerbated by the competitive nature of this business, the downward pressure on prices from the oil majors and the tender & bid process that decides the allocation of the available work. In this talk, I review the history of computer systems in PGS, the move to Linux clusters and the new applications that this price/performance step has enabled our industry to bring to the seismic processing market. I will talk from a business prospective about the acceptance of Linux clusters and show that, for the right applications, Linux clusters can be regarded as off-the-shelf commodities.
 
 
Plenary Session II
Title   Scaling Up and Scaling Down
Author(s)   Dr. Daniel A. Reed
Author Inst   National Center for Supercomputing Applications (NCSA)
Presenter   Dr. Daniel A. Reed
Abstract
The continuum of computing continues to expand, ranging from mobile, embedded (or soon implantable) devices to terascale and petascale clusters. New applications of distributed sensors (e.g., environmental or structural monitoring, health and safety) pose new software challenges. How do we manage millions or even billions of mobile devices, each with power, weight, processing, and memory constraints? At the other end of the continuum, large-scale science and engineering applications demand ever faster and larger clustered systems. How do we design, package, and support systems with hundreds of thousands of processors in a reliable way? This talk sketches some of the technical challenges and opportunities in realizing the ubiquitous infosphere of interconnected systems, highlighted by recent developments at NCSA and the U.S. TeraGrid.
 
 
Plenary Session III
Title   Challenges of Terascale Computing
Author(s)   Dr. William Pulleyblank
Author Inst   IBM Research
Presenter   Dr. William Pulleyblank
Abstract
We are rapidly moving into the domain of terascale computing, which includes the ASCI program, and the Blue Gene project. In addition to the significant hardware problems to be solved, we will also face software issues that are at least as large. I will discuss these in the context of BlueGene - a 180 teraflop/s super computer being built at IBM Research which will run on Linux as its operating system. Enabling Linux for this platform has necessitated a number of innovations in order to achieve the targeted levels of performance. We also discuss issues of reliability and availability, the so-called autonomic issues. We relate this to current activity on grid computing.
 
 

Cluster Busters


Title

Scalability and Performance of Salinas on the Computational Plant
Author(s) Manoj Bhardwaj,  Ron Brightwell, Garth Reese
Author Inst University of New Mexico, USA
Presenter Ron Brightwell
Abstract Parallel computing platforms composed of commodity personal computers (PCs) interconnected by gigabit network technology are a viable alternative to traditional proprietary supercomputing platforms. Small- and medium-sized clusters are now ubiquitous, and larger-scale procurements, such as those made recently by NSF for the Distributed Teraflops Facility and by Pacific Northwest National Lab, are becoming more prevalent. The cost effectiveness of these platforms has allowed for larger numbers of processors to be purchased.

In spite of the continued increase in the number of processors, few real-world application results on large-scale clusters have been published. Traditional large parallel computing platforms have benefited from many years of research, development, and experience dedicated to improving their scalability and performance. PC clusters have only recently started to receive this kind of attention. In order for clusters of PCs to compete with traditional proprietary large-scale platforms, this type of knowledge and experience may be crucial as larger clusters are procured and constructed.

The goal of the Computational Plant (Cplant) project at Sandia National Laboratories is to provide a large-scale, massively parallel computing resource composed of commodity-based PC's that not only meets the level of compute performance required by Sandia's key applications, but that also meets the levels of usability and reliability of past traditional large-scale parallel machines. Cplant is a continuation of research into system software for massively parallel computing on distributed-memory message-passing machines. We have transitioned our scalable system software architecture developed for large-scale massively parallel processing machines to commodity clusters of PCs.

In this paper, we examine the performance and scalability of one of Sandia's key applications on Cplant. We present performance and scalability results from Salinas, a general-purpose, finite element structural dynamics code designed to be scalable on massively parallel processing machines. Salinas offers static analysis, direct implicit transient analysis, eigenvalue analysis for computing modal response, and modal superposition-based frequency response and transient response. In addition, semi-analytical derivatives of many response quantities with respect to user-selected design parameters are calculated. It also includes an extensive library of standard one-, two-, and three-dimensional elements, nodal and element loading, and multi-point constraints. Salinas solves systems of equations using an iterative, multilevel solver, which is specifically designed to exploit massively parallel machines. Salinas uses a linear solver that was selected based on the criteria of robustness, accuracy, scalability and efficiency.

Salinas employs a multilevel domain decomposition method, Finite Element Tearing and Interconnect (FETI), which has been characterized as the most successful parallel solver for the linear systems applicable to structural mechanics. FETI is a mature solver, with some versions used in commercial finite element packages. For plates and shells, the singularity in the linear systems has been traced to the subdomain corners. To solve such linear systems, an additional coarse problem is automatically introduced that removes the corner point singularity. FETI is scalable in the sense that as the number of unknowns increases and the number of unknowns per processor remains constant, the time to solution does not increase. Further, FETI is accurate in the sense that the convergence rate does not deteriorate as iterates converge. Finally the computational bottleneck in FETI, a sparse direct subdomain solve, is amenable to high performance solution methods. An eigensolver was selected for Salinas based on robustness, accuracy, scalability and efficiency.

In this paper, we will describe in detail the hardware architecture of our 1792-node Cplant cluster. This machine is composed of single-processor Compaq Alpha workstations connected by a Myrinet gigabit network in a three-dimensional mesh topology. This machine has many unique characteristics, such as the ability to switch a large portion of the machine from any one of four network "heads" to another, allowing for greater flexibility in responding to user demand for classified versus unclassified computing.

In addition to hardware, we describe the system software environment of our cluster. Linux is used as the base operating system for all nodes in the cluster, and we have designed and developed software for the parallel runtime system, high-performance message passing, and parallel I/O. We will discuss the details of this software environment that influence the scalability and performance of applications.

Finally, we describe the performance and scalability of Salinas on up to 1000 nodes of Cplant. We will also discuss some of the challenges related to scaling Salinas beyond several hundred processors and how these challenges were addressed and overcome. We will show that the performance and scalability of Salinas is comparable to proprietary large-scale parallel computing platforms.

 
Title Real Time Response to Streaming Data on Linux Clusters
Author(s) Beth Plale, George Turner, and Akshay Sharma
Author Inst. Indiana University
Presenter George Turner
Abstract Beowulf-style cluster computers which have traditionally served batch jobs via resource man-agers such as Portable Batch System (PBS) are now under pressure from the scientific community to support real time response to data stream applications. That is, timely response to events streamed from remote sensors or humans interactively exploring the parameter space. Cluster management must be agile and timely in its response to new resource demands imposed by the occurrence of real-world events.  The work described in this paper quantitatively evaluates two approaches to integrating support for streaming applications into the framework of a PBS managed cluster. One approach works cooperatively with PBS to reallocate computational resources to meet the needs of real time data analysis. The second approach works outside of the PBS context utilizing the real time scheduling capabilities of the Linux kernel to acquire the resources necessary for its real time analysis.

Introduction
Beowulf-style clusters have traditionally used resource managers such as PBS Pro [6] to opti-mize the utilization of a cluster’s resources. Today there is a growing need for these clusters to be responsive to non-deterministic event driven data streams. Upon recognition of an application specific event occurring outside the realm of PBS and the cluster, the necessary CPU, memory, network, and file system resources must be directed toward a timely response. 

Most production Beowulf clusters use batch scheduling to ensure fairness among users and to maximize a cluster’s throughput. But these objectives are at odds with the dynamic nature of data stream processing [5, 4, 1, 2], where resource needs cannot be predicted in advance, and any projection of duration of usage can only be estimated because of the variability of real-world events such as storms, earthquakes, etc.

This paper addresses the need for integrated support for streaming applications in a batch job scheduled system. Through a quantitative evaluation of startup time in response to an environmental event, we are exploring the solution space for real-time responsiveness to data streaming.

Motivation
An atmospheric scientist at IU studies atmosphere-surface exchange of trace chemicals such as CO2, water, and energy from a variety of ecosystems. These atmospheric flux measurements generate large streams of data; sensors operating at rates up to 10 Hz can generate data at rates of hundreds of MB/day/sensor. The remote sensors use wireless 802.11b to communicate to a central computer remotely located within the forest. The remote computer is connected via a T1 line to the IUB campus network. Under normal operation the data from the remote sensors are collected at the remote site and streamed to IU at a nominal rate of 1-3 Hz where the event data are analyzed for particular meteorological conditions. The response to the detection of a meteorological condition is an increasing of the event rate, and the attendant enlistment of additional CPU cycles to handle the high-fidelity analysis.

This work is being carried out on the proto-AVIDD facility on the Bloomington campus of Indiana University. The cluster consists of five (5) dual Pentium-IV Xeon nodes, two (2) Itaniuim dual processor IA64 nodes, and a RAID-5 filesystem NFS served internally to the cluster. Proto-AVIDD runs RedHat 7.2 with cluster management tools provided by the Oscar 1.2 project. Proto-AVIDD is a testbed cluster used to study the issues of interactive and stream data analysis in a production-level Linux cluster.

AVIDD (Analyzing and Visualizing Instrument Driven Data) will be a geographically distributed data analysis cluster sited at three of the IU campuses: Bloomington (IUB), Indianapolis (IUPUI), and Gary (IUN). Due to ongoing vendor negotiations, an explicit description of the AVIDD cluster is not possible at this time; however, it is expect to be in the range of 0.5 –1 TFOPSaggregate, 6 TBytes of distributed cluster storage and will leverage IU’s existing HPSS based massive data storage system. Local intra-cluster networking will be provided by a Myrinet fabric with the the inter-cluster networking utilizing Indiana’s 1 Tb/s I-Light infrastructure. By the time of the conference, it is anticipated that initial experiences and early results will be available from the production AVIDD cluster.

Approach
Application events streaming from remote sources must be analyzed in real time in order to achieve timely recognition of scientifically interesting meteorological conditions. Our approach is to employ a user space daemon resident on a dedicated communications node within the AVIDD cluster to receive the data stream and monitor it for interesting events. Under normal conditions, the daemon simply drops the events into an event cache implemented as a circular buffer. When a meteorological event occurs, the daemon responds first by sending a message to the remote computer commanding it to reconfigure the sensor system for higher fidelity data acquisition, and second by enlisting additional resources within the cluster to process the higher fidelity data streams. This startup scenario is depicted in Figure 1.

[Figure 1]

Figure 1.  Data stream processing response to application event of interest.


We are exploring the solution space for real-time responsiveness to data streaming through a quantitative evaluation of startup time in response to an environmental event. Our first approach is to work within the PBS domain to allocate resources and provide job startup services in response to the detection of developing real-world phenomenon. The second approach explores a ’bullying’ relationship with PBS in which the real time scheduling features of the Linux kernel are utilized to preempt resources already allocated by PBS. The full paper will quantify response times under both approaches and explore interesting issues involved in each approach.

Startup Within Context of PBS Resource managers such as PBS have a coherent view of a cluster’s state. They also have the tools necessary to stop, start, and signal jobs running under their control. It would appear to be natural choice to use a resource manager’s preexisting infrastructure to implement cluster resource reallocation. In the final paper we will evaluate the steps involved in the startup of the high fidelity processes as well as the benefits and shortcomings of this approach.

Linux Real-time Scheduling Startup High fidelity processes are scheduled and started up out-side the PBS domain using normal Linux process control mechanisms. If these data analysis tasks are scheduled by the kernel scheduler as real time processes they will be assured of being scheduled on the CPUs any time they are computable. Resources already used by a previous PBS job would only be consumed as needed. For example, memory previously used by the PBS job would be swapped out only as needed. In the final paper we will evaluate the steps involved in the startup of the high fidelity processes as well as the benefits and shortcomings of this approach.

Other Issues Issues which will be touched upon but not discussed in detail include Ssh, Globus Toolkit [3], MPI, process preemption, process checkpointing. A general discussion of the various steps involved in the data acquisition, data transmission, and data disposition will be provided as a introductory foundation.

[1] Shivnath Babu and Jennifer Widom. Continuous queries over data streams. In International Conference on Management of Data (SIGMOD), 2001.
[2] Fabi´ an E. Bustamante and Karsten Schwan. Active I/O streams for heterogeneous high performance computing. In Parallel Computing (ParCo) 99, Delft, The Netherlands, August 1999.
[3] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997.
[4] Sam Madden and Michael J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In International Conference on Data Engineering ICDE, 2002.
[5] Beth Plale and Karsten Schwan. dQUOB: Managing large data flows using dynamic embedded queries. In Proc. 9th IEEE Intl. High Performance Distributed Computing (HPDC), Los Alamitos, CA, August 2000. IEEE Computer Society.
[6] PBS Pro. Portable Batch System. http://www.pbspro.com/, 2002.
 

Programming Models


Title Remote Memory Operations on Linux Clusters Using the Global Arrays Toolkit, GPSHMEM and ARMCI
Author(s) Chona S. Guiang, A. Purkayastha and K. F. Milfeld
Author Inst Texas Advanced Computing Center,University of Texas - Austin, USA
Presenter Chona S. Guiang
Abstract Remote memory access (RMA) is a communication model that offers different functionalities from those provided in shared memory programming and message passing. The use of one-sided communication exhibits some of the advantages of shared memory programming such as direct access to global data, without suffering from the limited scalability that characterizes SMP applications. Moreover, RMA has advantages over MPI on two important respects:

1.  Communications overhead: MPI send/receive routines generate two-way communications traffic (“handshaking”) to ensure accuracy and coherency in data transfers that is often unnecessary. In some scientific applications (e.g., some finite difference and molecular dynamics codes), the communication workload involves accessing data that does not require explicit cooperation between the communicating processes (e.g., writes to non-overlapping locations in memory). Remote memory operations in ARMCI [1], Global Arrays (GA) Toolkit [2], and GPSHMEM [3] decouple the data transfer operation from handshaking and synchronization. There is no extra coordination required between processes in a one-sided put/get operation, so there is less of the enforced synchronization that is sometimes unnecessary. Additional calls to synchronization constructs may be included in the application as needed.

2.  API: The distributed data view in MPI is less intuitive to use in scientific applications that require access to a global array. In contrast, the GA API provides a global data view of the underlying distributed data that does not distinguish between accessing local versus remote array data.

The Global Arrays (GA) Toolkit provides an SMP-like development environment for applications on distributed computing systems. GA implements a global address space, which makes it possible for elements in distributed arrays to be accessed as if they were in shared memory. The details of actual data distribution are hidden, but are still within user control through utility functions included in the library. MPI and GA library calls can be combined in a single program, as GA is designed to complement rather than replace existing implementations of message passing libraries. The GA library consists of routines that implement one-sided operations, interprocess, collective array operations, and utility and diagnostic operations (e.g., for querying available memory, data locality). As the name implies, GA implements a global address space for arrays of data, not scalars. Therefore, GA one-sided communications are designed to access arrays of data (double, integer, and double complex), not arbitrary memory addresses.

GPSHMEM is a portable implementation of the Cray Research Inc. SHMEM [4], which is a one-sided communication library for Cray and SGI systems. Like GA, GPSHMEM implements a global address space for accessing distributed data, but GPSHMEM contains support for more data types. GPSHMEM operations are limited to symmetric data objects, objects for which local and remote addresses have a known relationship (e.g., arrays in common blocks). The library routines provide the following functionalities: 1) noncollective RMA operations including those for strided access , and atomic operations; 2) collective operations; 3) parallel environment query functions. In addition to the supported Cray SHMEM routines, GPSHMEM has added block strided get/put operations for 2D arrays.

All RMA operations in GA and GPSHMEM use the Aggregate Remote Memory Copy Interface (ARMCI) library.  ARMCI is a portable, one-sided communication library. On Linux clusters, ARMCI relies on low-level network communication layers (e.g., Elan for Quadrics, GM for Myrinet) to implement remote memory copies, while utilizing the native MPI implementation for collective communication operations. Noncontiguous data transfers are supported in ARMCI, which facilitates access of sections of multidimensional arrays in GA and 2D arrays in GPSHMEM.

In general, MPI-2 RMA is more difficult to use than GA, and GPSHMEM because of the greater number of  steps involved. A single one-sided operation includes setting up a ‘window’ for the group of communicating processes, moving the data, and ensuring completion of data transfer (among processes in the group) using different synchronization constructs for “active” or “passive” targets. RMA operations with passive targets are serialized, and hence degrades performance. Notwithstanding these caveats, MPI one-sided communication is part of the MPI-2 standard and consequently is more likely to gain support on a greater number of platforms than GA, GPSHMEM, and ARMCI.        

Previous work by Nieplocha et. al. [5, 6] on implementing ARMCI on Linux clusters has demonstrated that the ARMCI put/get operations outperform the MPICH-GM implementation of MPI 1.1 send and receive for both strided and contiguous data access. We have conducted additional performance measurements of ARMCI put/get versus MPI 1.1 send/receive on the TACC IA32 and IA64 clusters, using a simple “halo exchange” prototype where each process exchanges contiguous and strided data with its nearest neighbors on a 1-D process grid.  Results, as well as further description of the experiments are shown in Figures 1 and 2. For contiguous data transfers, the ARMCI put/get operations exhibit better performance than the equivalent MPI 1.1 send/receive over the entire range of message sizes considered on the IA64 system. On the IA32 cluster, the ARMCI put and get operations outperform MPI send and receive for message lengths exceeding 8KB. On both systems, ARMCI put/get perform better than MPI send/receive for transfer of strided data over the entire range of message sizes. Future work will perform the following experiments to address important issues of RMA operations:

1.  Latency and bandwidth measurements over different message sizes will be conducted for put/get operations as implemented in LAM-MPI versus ARMCI. The poor bandwidth obtained with the LAM-MPI implementation of the MPI-2 put/get operations, as shown in Figure 1, could be attributed to the fact that the most recent stable version does not take full advantage of the underlying GM layer. Our future study will use the most recent version of LAM-MPI, which has a completely rewritten version of its GM RPI (request progression interface) [7].

2.  The transfer of contiguous array data in GPSHMEM and GA will be compared against the corresponding point-to-point and/or collective communication calls in MPI. These timing results will be compared with ARMCI measurements to obtain the overhead involved in the GA and GPSHMEM routines.

3.  Applications characterized by dynamic communication patterns, irregular data structures, and noncontiguous data transfers are likely to benefit from the remote memory copy functionality that GA and GPSHMEM provide. Molecular dynamics and finite difference applications fit this description well, since both involve the exchange of sections of multidimensional arrays at every time step of the simulation. To test the efficiency and applicability of GA and GPSHMEM  for these types of applications, we will evaluate the performance of  a simple MD and finite difference code using three different implementations for the transfer of array sections --- corresponding to the use of GA, GPSHMEM, and MPI point-to-point as well as collective communication functions. The experience on these applications will also provide insight on the general feasibility of using SMP-like programming interfaces on distributed-memory systems.

All computational work will be performed on TACC’s Linux clusters: a 32-node cluster of Pentium III-based dual-processor SMPs and an Itanium-based cluster of 20 dual-processor SMP nodes. Each cluster uses a Myrinet 2000 switch fabric.

[Figure 1]
Figure 1

[Figure 2]  
Figure 2

The communication throughput of a “halo exchange” prototype code is measured on four processors of the TACC IA32 (Figure 1) and IA64 clusters (Figure 2) using the MPICH-GM 1.2.1.7 implementation of MPI 1.1 send/receive; the latest version of ARMCI; and the LAM-MPI 6.5.6 implementation of the MPI-2 put/get operations.

References:
[1] http://www.emsl.pnl.gov:2080/docs/parsoft/armci/
[2] http://www.emsl.pnl.gov:2080/docs/global/
[3] K. Parzyszek, J. Nieplocha, and R. Kendall, “A Generalized Portable SHMEM Library for High Performance Computing in Proceedings of the Twelfth IASTED International Conference Parallel and Distributed Computing and Systems
[4] http://dynaweb.tacc.utexas.edu:8080/dynaweb/app_prog
/004-2178-002/@Generic__BookTextView/2819;hf=0?DwebQuery=SHMEM#1
[5] J. Nieplocha, V. Tipparaju, A. Saify, D.K. Panda, "Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters," in Proc. Workshop Communication Architecture for Clusters (CAC02) of  IPDPS'02
[6] J. Nieplocha, J. Ju, E. Apra, “One-sided Communication on  Myrinet-based SMP Clusters  using the GM Message-Passing Library,” in Proc. CAC01 Workshop of  IPDPS'01, San Francisco, 2001
[7]  http://www.lam-mpi.org/MailArchives/lam/msg04345.php
Title

Supercomputing in Plain English: Teaching High Performance Computing to Inexperienced Programmers
Author H. Neeman, J. Mullen, L. Lee & G.K. Newman
Author Inst Oklahoma University, USA and Worcester Polytechnic Institute
Presenter Henry Neeman
Abstract Although the field of High Performance Computing (HPC) has been evolving rapidly, the development of standardized software systems -- MPI, OpenMP, LAPACK, PETSc and others -- has in principle made HPC accessible to a wider community. However, relatively few physical scientists and engineers are taking advantage of computational science in their research, in part because of the considerable degree of sophistication about computing that HPC appears to require. Yet the fundamental concepts of HPC are fairly straightforward and have remained relatively stable over time. These facts raise an important question: can scientists and engineers with relatively modest computing experience learn HPC concepts well enough to take advantage of them in their research?

To answer this question, HPC educators must address several issues.  First, what are the fundamental issues and concepts in computational science? Second, what are the fundamental issues and concepts in HPC?  Third, how can these ideas be expressed in a manner that is clear to a person with relatively modest computing experience? Finally, is classroom exposure sufficient, or is guidance required to assist investigators in incorporating HPC in their research codes?

A helpful way to describe the fundamental issues of computational science is as a chain of abstractions: phenomenon, physics, mathematics (continuous), numerics (discrete), algorithm, implementation, port, solution, analysis and verification. In general, physics, mathematics and numerics are addressed well by existing science and engineering curricula -- though often in isolation from one another -- and therefore instruction should be provided on issues relating primarily to the later items, and on the interrelationships between all of them. For example, algorithm choice is a fundamental issue whose gravity many researchers with less computing experience do not appreciate; a common mistake is to solve a linear system by inverting the matrix of coefficients, without regard for performance, conditioning, or exploitation of the properties of the matrix.

As for HPC, the literature tends to agree on the following fundamental issues: the storage hierarchy; instruction-level parallelism; high performance compilers; shared memory parallelism (e.g., OpenMP); distributed parallelism (e.g., MPI); hybrid parallelism. The pedagogical challenge is to find ways to express the basic concepts with minimal jargon and maximal intuitiveness.

Although expressing these concepts in a manner appropriate for the target audience can be challenging, it does not need to be daunting.  For example, the use of analogies and narratives to explain these concepts can capture the fundamental underlying principles without distracting the students with technical details. Once the students understand the basic principles, the details of how to implement HPC solutions are much easier to digest.

However, it is probably not reasonable, in many cases, to expect  novice programmers to immediately understand how to apply HPC concepts to their science. As a follow on, regular interactions with experienced HPC users can provide the needed insight and practical advice to move research forward.

We will discuss an ongoing effort to develop materials that express sophisticated scientific computing concepts in a manner accessible to a broad audience. In addition, we will examine a programmatic approach that incorporates both the use of these materials and the crucial contribution of followup.

 

Applications and Tools Track Abstracts:


Applications Session I: Applications Performance Analysis
Title Scaling Behavior of Linear Solvers on Large Linux Cluster
Author(s) John Fettig, Wai-Yip Kwok, Faisal Saied
Author Inst NCSA/University of Illinois, USA
Presenter John Fettig
Abstract Large sparse linear systems of equations arise in many computational science and engineering disciplines. One common scenario occurs when implicit methods are used for time-dependent partial differential equations, and a nonlinear system of equations has to be solved at each time step. With a Newton-type approach, this in turn leads to a linear system at each Newton iteration. In such situations, the linear solver can account for a large part of the overall simulation time. Demonstrating the effectiveness of linear solvers on large Linux clusters will help establish the usefulness of these platforms for applications in such diverse areas as CFD, Astrophysics, Bio-physics, earthquake engineering, and structural mechanics.

In this paper, we investigate the effectiveness of Linux clusters for state of the art software for parallel linear solvers such as PETSc, Hypre, and other packages drawn from the research community.  In particular, we study scaling to large processor counts, and large system sizes (tens of millions of unknowns). The scalability of the algorithms, of the implementation, and of the cluster architecture will be investigated. We will focus on iterative Krylov subspace solvers like Conjugate Gradient, GMRES(k) and BiCGSTAB, coupled with parallel preconditioners, such as block Jacobi, domain decomposition, and multigrid methods.

In our analysis of the parallel performance of these solvers on Linux clusters, we will identify which components of clusters (CPU, memory bandwidth, network latency/bandwidth) are the critical factors in the performance of the different software components of the parallel solvers, such as matrix vector products, inner products, preconditioners, grid transfer operators, coarse grid solvers. We will present results for Linux clusters based on first and second-generation 64-bit Itanium processors from Intel, connected by a Myrinet switch, used primarily in message-passing mode.

The number of performance tools available on Linux clusters is increasing. In our performance analysis of parallel linear solvers, we will use these tools to give a more detailed performance profile of the solvers. In particular, tools that access hardware performance counters will be used to study the impact of the memory hierarchy on performance, and MPI analysis tools will be used to quantify the message-passing performance of the network for sparse linear solvers.

 
Title Benchmarking of a Commercial Linux Cluster
Author(s) Daniele Tessera
Author Inst. Università di Pavia, Italy
Presenter Daniele Tessera
Abstract Exploiting the cumulative computing capabilities of workstations and PCs has been pursued, for a long time, as an inexpensive solution for delivering high performance to many scientific and industrial applications. Various solutions have been proposed to allow parallel applications to be executed over a set of independent systems. Cluster machines, built on top of commodity off the shelf components, initially developed as prototype machines by research centers, are becoming very popular. The increasing performance of commodity components and the availability of interconnecting networks able to link large number of such components have fueled the development of these cluster machines. Moreover, the development of specialized hardware and software components tuned for matching the demands of HPC applications have resulted in cluster machines approaching the supercomputers' performance and ranked among the top 500 most powerful supercomputers. Hardware vendors, offering scalable cluster machines, also testify the maturity of cluster computing as a cost effective solution to cope with the high performance needs of scientists and numerical engineers.

This paper is focused on a detailed study of the performance achieved by an IBM NetFinity supercluster when performing various scientific benchmarks. Note that, although we have analyzed the performance of a specific machine, our results have a broader applicability. Indeed, the IBM NetFinity adopts a typical cluster architecture based on Intel Pentium processors interconnected via both Myrinet and Ethernet networks, and the Linux operating system. The cluster is composed of 260 nodes each of them housing two Intel Pentium processors. As a preliminary overview of the cluster performance, we have investigated the run time behavior of a few kernels performing widely used numerical algorithms, such as, matrix LU factorization, multigrid solvers, and distributed rankings.

The aim of this study is to evaluate the performance benefits experienced by numerical applications due to specialized high bandwidth communication components, i.e., Myrinet switches, over a typical Beowulf cluster based on Ethernet networking. For such a purpose, we have analyzed the speedup achieved by the various kernels executed over either Myrinet or Fast Ethernet network. Our results are based on the post mortem analysis of the low level timing measurements collected by monitoring the executions of the various kernels. Performance models of simple distributed computations, aimed at assessing the severity of the communication versus computation performance, have been derived.

The analysis of communication times has then focused on the wall clock
times accounted by processors in data exchanging. Note that, these times take into account both the wall clock times required to deliver the data over the network links and the times spent to perform the adopted communication protocols. As a rule of thumb, Myrinet allows a greater scalability with respect to Ethernet, even when the amount of workload to be processed by each allocated processor is quite small. The speedup of a few kernels executed over Ethernet saturates when more that 32 processors are allocated. On the other hand, a surprising result is that, Ethernet performance is superior when parallel kernels issue a large number of blocking send/receive. For each communication protocol, we have also analyzed the computation overheads, that is, the increases of computation wall clock times due to the concurrent execution of system processes for managing the communication activities.

Our performance characterization study has then focused on the behavior of a complex climate model benchmark, that is, the PSTSWM kernel from the Oak Ridge National Laboratory. Performance figures related to various testbed simulations, varying the size of the physical grid, the number of simulated steps, as well as the number of allocated processors, have been analyzed. Statistical clustering techniques have been then applied to these performance figures to derive synthetic characterizations of the behaviors of the allocated processors. The identified groups of homogeneous behaviors have been then related to the characteristics of the corresponding simulations, that is, with the processed workload.

As a further step towards the characterization of the performance achieved by the IBM NetFinity cluster, we have studied the impact of the processor allocation policy on the various kernel execution times. Indeed parallel applications might be though as composed of tasks to be allocated to the various processors. Since each IBM NetFinity node houses two processors, two allocation policies are then available: we can allocate either one or two application tasks on each node. Note that, when only one task is allocated on each node both the processors might concur to the execution of such an application task. Numerical kernels experience performance degradations when two tasks per node are allocated on a large number of nodes (i.e., more that 32 nodes). These performance degradations are due to increases in both computation and communication wall clock time. Indeed, an in depth analysis of timing measurements results in outlining that communications between tasks allocated on the same node require longer times with respect to communications between tasks allocated on distinct nodes. Moreover, since the computation phases are somewhat synchronized across all application tasks, contentions due to synchronous memory accesses and concurrent execution of eventually spawned communication processes also result in increases in the computation wall clock times. Experimental evidences, presented in the paper, will substantiate that, for some type of applications, being fixed the number of nodes, allocating both their processors results in longer execution times with respect to allocating only one processor per node.

 
Applications Session II: Applications Performance on Clusters

Title A Comparative Study of the Performance of a CFD program across different Linux Cluster Architectures
Author(s) Th. Hauser, R.P. LeBeau, T. Mattox, G.P. Huang, H. Dietz
Author Inst Utah State University
Presenter Thomas Hauser
Abstract [Paper #26]

 
Title Scalability of a Tera-scale Linux-based Clusters from Parallel Ab initio Molecular Dynamics
Author(s) Hongsuk Yi, Jungwoo Hong, Hyoungwoo Park, and Sangsan Lee
Author Inst KISTI, Korea
Presenter Jungwoo Hong
Abstract In the beginning of 2000, KISTI has initiated a TeraCluster project to design a linux-based cluster with tera-scale performance. The main goal of this project is to provide resources composed of PC clusters that meets the level of compute performance required by the grand challenge applications in Korea. The UP2000 with Myrinet and DS10 with Fast Ethernet systems as a test platforms of the TeraCluster project have benefited from many years of research dedicated to improving their scalability and performance. For actual experiments, we have investigated the performance of scalable ab initio molecular dynamics and quantum chromodynamics (QCD) applications on the test platforms and compare with the Cray T3E system. The parallelization is based on the Massage Passing Interface standard. Results show that the QCD application which requires only nearest neighbor communications between computing nodes reveals an excellent parallelization. While for the ab initio molecular dynamics the operations responsible for the use of three-dimensional fast Fourier transformation dominate, representing the most time-consuming part in the calculation of C36 fullerence. Combining these results with the performance evaluations of the additional grand challenge applications involving the computational fluid dynamics, structural analysis, and computational chemistry, we have designed a prototype of the TeraCluster with 516 computing nodes and build the first phase of TeraCluster in 2002.


TABLE I: The hardware and software components of the three different platforms and the basic results of NPB and PMB benchmark suite


System

DS10

UP2000

Cray

T3E

CPU architecture

EV6

EV6

EV5

Pentium-4

Network

Fast Ethernet

Myrinet

FDDI

Myrinet

CPU clock (MHz)

466

667

450

1700

Number of nodes

64

64

128

128

Rpeak (Gflop/s)

59.6

85.3

108

435.2

Rmax (Gflop/s)

21.0

43.6

82.0

208.3

Latency (microsec)

140.0

13.0

1.3

12.5

Bandwidth (MB/s)

8.9

139.7

157.0

65.1



We have built two different linux-base PC clusters of 64 computing nodes with one server node as a pilot systems [1]. The DS10 system with 466 MHz Ev6 and UP2000 with 667 MHz Ev6 Alpha processors are connected with Fast Ethernet and Myrinet, respectively. Each com-puting nodes have 250 MB memory and 20 GB local IDE disk, while the server node has 128 MB memory and 30 GB disk. The systems operate on Linux Redhat 6:0 with a kernel level of 2.2.0. The MPICH is used as a MPI communication library, and portable batch system (PBS) is selected as a queuing system. The pilot project has been a success story with the col-laboration of a few industrial partners in Korea. From the performance evaluations of the major KISTI [2] applications, we build the first phase of the TeraCluster denoted as PhaseI. The details is summarized in Table I.

[Figure 1]
FIG. 1: (a) The sustained performance and (b) the corresponding speed-up of the
full QCD with the L = 12 x 123 lattice size for three different machines.

In Fig. 1(a), we illustrate the execution times of the quenched QCD program on the pilot systems and the first phase of TeraCluster for the large size of lattices. The execution time at the single processor shows that the UP2000 and PhaseI systems are 2~3 times faster than the Cray T3E system. Due to the difference of CPU architecture, the PhaseI and UP2000 surpass the DS10 system in speed even for 64 working nodes. This results reflect the high compute to communication ratio of the QCD application which has only nearest neighbor communications.

[Figure 2]

FIG. 2: Speedup of the ab initio molecular dynamics for the UP2000, DS10, and
Cray T3E. The dashed line indicates the ideal speed-up. Speedup of the
ab initio application using VASP on the PhaseI.

To elucidate the scalability of the pilot system we present the speed-up S = T1=TP, where T 1 and TP is the execution time for single and number of P processors, respectively.  Fig. 1(b) shows the obtained S on the different three PC clusters. The UP2000 shows a nearly linear scalability, while for the PhaseI system the linear scalability is not found at all.  Moreover, the scalability of the DS10 is rapidly suppressed due to the narrow capacity of the Fast Ethernet bandwidth and latency. This results implies that the capacity of bandwidth is crucial for the large scale computation system as a scalable scientific platform.

We also calculate a new solid phase of carbon made from C 36 fullerence. The super-
cell approximation is used to simulate the periodic boundary condition and the cell size is 11 x 11 x 11 °A3 for C 36 . Kohn-Sham single electron wavefunction is expanded by plane waves up to a cut-off energy 36 Ry. Since C36 is molecule we use a single k-point in the calculation. In Fig. 2(a), we illustrate the speedup of the C36 fullerence on the pilot system and compare it with the Cray T3E. In contrast to the above results, each shows an incomplete behavior in scalability. Especially, the Cray T3E system retains the linear scalability up to 16 nodes, but a rather monotonous increase of S reaches to nearly 16 with using the 64 computing nodes, while the DS10 exhibits a more deficient scalability. These results account for the importance of the capacity of the network interface through the collective communication between the executing nodes. Indeed, the overhead of data communication increases rapidly as the working nodes increases, reflecting the significant data exchange during the 3D FFT. In Fig. 2(b), we also present the performance evaluations of ab initio molecular dynamics using VASP package on the PhaseI. However, the scalability has not found at all even for 32 working nodes. In the remainder of this paper, we will discuss on the scalability of the ab initio grand challenging problems in more details.

[1] TeraCluster Project (http://cluster.or.kr).
[2] KISTI (http://www.hpcnet.ne.kr).

 
Title openMosix vs Beowulf: a case study
Author(s) Moshe Bar, Stefano Cozzini, Maurizio Davini and Alberto Marmodoro
Author Inst. INFN and the University of Pisa, Italy
Presenter Stefano Cozzini
Abstract Today two main clustering paradigms exist for the Linux environment: openMosix and Beowulf (MPI-based). Whereas in MPI-based solutions, the parallelization needs to be coded explicitly through special library directives, in openMosix the usual Unix style programming concepts apply. Also, for clustering applications it becomes increasingly 
evident that the latency of the network interconnect component quickly gains a very significant weight on the overall system performance and throughput. In this paper we will describe our evaluation work of these two different clustering paradigms applied to a suite of production codes currently used in a specific scientific environment recently installed at Trieste (Italy): the INFM center of excellence DEMOCRITOS. The codes represent the standard computational tools used in computational research in atomistic simulations. The suite can be roughly divided in three main categories of codes:  


  1. First-principles or ab-initio packages where the quantum description of the electronic properties of matter is taken into account; the size of  complex systems simulated is limited to, at most, hundreds of atoms: examples includes metal surfaces or active sites of biological systems.  
  2. Classical Molecular Dynamics (MD) packages needed  to simulate biological systems composed by several thousand of atoms like for instance proteins.  
  3. Quantum Monte Carlo (QMC) programs which are used to study strongly correlated systems.
The above classification applies also to the computational characteristics of the codes:  

  1. Ab-initio codes are computationally demanding in term of memory and CPU and specific higly scalable parallel versions of electronic structure codes have been developed in the past years using the Message Passing (MP) paradigm.  
  2. Classical Molecular Dynamics codes have different characteristics: lower  memory requirements allows to implement easily parallel codes using replicated data algorithm. This however limits scalability.
  3. The codes implementing Quantum Monte Carlo (QMC) techniques need the so called embarrassing parallelization strategy: each processor works on its own task. In this class we have  both serial and parallel codes.
The heterogeneity of the codes and the different computational requirements suggest that different cluster solutions should be considered for different classes of codes.

We therefore decided to provide Democritos with two small clusters implementing the first the Beowulf paradigm, the second the openMosix approach. Both the cluster are the IBM 1300 cluster solution equipped by PIII 1.4.Ghz processor with the same number of CPUs (16 bi-processor nodes).  One cluster is equipped with the Myrinet high speed network in order to allow us to test the role of an high speed network on both the paradigms.

Our benchmarking work want to see which solution fits some precise requirements about scalability, performance and throughput aspects.  Scalability and performance values are measured in a standard way and therefore compared against other (proprietary) parallel platforms. We  define them satisfactory just computing a " performance/price'" ratio. These requirements are easily met by both the paradigm for all the class of codes. The aspect that makes the difference is throughput one: this is fundamental in order to guarantee efficient resource sharing among users and high percentage of utilization of the computational resources.

Finally a key role is played by the "time to production" variable. It is  in fact fundamental for our research to reduce at minimum the time needed to implement a scalable and efficient parallel application starting from a serial ones.

Our preliminary results indicate that for highly parallel codes like ab initio ones the standard Beowulf paradigm should be the preferred solution in term of efficiency, scalability and throughput.  We note however that the superiority of this solution is mainly due to the fact that highly parallel and scalable codes are ready to be implemented and therefore the time to production for this class of code is practically zero.

On the other hand for QMC codes the Openmosix approach is certainly to be preferred. Efficiency and scalability are, as already said, the same as in the Beowulf case; the two main advantages are related to the other two aspects.  Throughput is easily obtained through the load balancing technology within openMosix, and the "time to production" for serial programs is practically negligible: just run many instances of the program and all it is done.

It is interesting to note that second class of codes (classical MD) despite the fact they are already parallel, generally run better on the openMosix cluster. These codes have in fact a low level of scalability (1-8 CPU) making the sharing of resources among this class of codes and the ab-initio ones (which requires lot of processors) difficult to be efficiently scheduled with standard queue systems on small cluster like ours.On the contrary, the openMosix solution deals at best with the requirements of QMC and moderate scalable codes like these ones.

In the final paper we will report  the results obtained on the computational suite under various lab environments considering the following aspect: topology of nodes, typology of node anf Network interconnects On hand of special hardware for the various interconnects (Myrinet, Dolphin, Fast Ethernet, Gigabit) we will show the resulting (reproducible) scalability, performance and throughput aspects regarding our computational problem.

 
Applications Session III: Experiences with Portability and I/O

Title Adventures with Portability
Author Elizabeth Post
Author Inst Lincoln University, New Zealand
Presenter Elizabeth Post
Abstract As computers become more powerful it becomes possible to develop computer models that more accurately approximate the real-world systems they are simulating.  However these models are increasingly complex, often taking years to develop, so it is important to re-use existing work as far as possible rather than continually “re-inventing the wheel” and duplicating effort which could be better spent elsewhere. 

With this aim of re-use scientists at Dexcel (a New Zealand dairy research institute) developed a dairy farm simulation framework with some capability to incorporate existing computer models of distinct entities in a pastoral dairy farm, e.g. pasture and crop growth models, animal metabolism models. This framework also allows run-time selection from several alternative models for each component (where they exist).  With some being simpler and others more complex, this permits the same basic farm configuration to be modeled at different levels of detail, as appropriate for different research exercises.  

A major practical difficulty in implementing a system with this flexibility is that the various component sub-models have typically been written in a range of different computer languages. The dairy farm simulation framework itself is written in Smalltalk, and so far makes use of sub-models for the climate, pasture and animal components written in various languages such as Smalltalk, C, Fortran and Pascal.  Initially the framework was developed as a research tool for agricultural scientists, and run on a Windows operating system using the Microsoft COM/DCOM protocol for interprocess communication.  However there is now a requirement to investigate automated optimization of farm specification and operating parameters, and also visualisation of the associated high-dimensional objective functions. This work requires very large numbers of simulations to be performed so we are attempting to port the framework to run in parallel on a Linux cluster.

While porting the existing framework and component models to run on the Linux cluster we also need to maintain the same dairy farm simulation model running on a Windows operating system. Ideally, as far as possible, we would like to maintain one single version that runs on both operating systems and also on a uniprocessor or in a parallel processing environment to avoid the difficulties of keeping the development of multiple versions for different environments synchronised. In practice this may not be completely achievable, but as far as possible we want to minimize the differences between these implementations. Thus we have had to use techniques that are portable between these environments, to ensure that all environments can use the same version. An added complexity is that researchers using the Windows version use a Windows interface but the parallel version must run without interacting with a Windows interface, so the code for the model itself must be independent of the code for the interface. Also, as we are using component models developed by other researchers we cannot change the component models themselves, and have to ensure that the interfaces between the framework and the component models are such that new versions of the component models may be substituted as they become available. It may also happen that alternative models for a particular component, such as the pasture or animal component, may require quite different inputs and produce outputs in different formats, or even completely different types of output. It is therefore something of a challenge to devise generic portable interfaces between the framework and the component models that will provide the necessary input parameters for any component model, and also be able to make use of the output from these component models, if necessary converting the output to a different form before it can be used in the dairy farm simulation.

This paper identifies and discusses the problems to be solved in this porting process, and describes the techniques that have been used successfully to date.  These include porting the framework and some of its component models from a Windows uniprocessor environment to a Linux parallel processing environment, the initial use of files to transfer data between processes, and then replacing the file transfers by socket communication.  This has led to some interesting exercises in portability. As we use MPI to manage the parallel processing, and have not yet persuaded Smalltalk to communicate by using MPI, we have a C program, using MPI, which manages the parallelism. This C program then starts the Smalltalk framework, which in turn communicates with its component models, written in Smalltalk, C, Fortran or Pascal, and returns a final result for each simulation to the controlling C program. The problem has been in getting different processes, written in different languages, and perhaps started and running independently of one another, to communicate and pass data between themselves. We hope to include some preliminary results of extending MPI outside its normal C/C++/Fortran domain and making it the basic inter-process communication mechanism in the framework and will also be portable across operating systems.

 
Title Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters
Author Kent Milfeld, Avijit Purkayastha, Chona Guiang
Author Inst TACC/University of Texas - Austin, USA
Presenter Kent Milfeld
Abstract One of the key elements to the Linux Cluster HPC Revolution is the high speed networking technology used to interconnect nodes within a cluster.  It is the backbone for “sharing” data in a distributed memory environment.  This backbone, when combined with local high-speed disks, also provides the framework for developing parallel I/O paradigms and implementing parallel I/O systems—an HPC environment for “sharing” disks in a parallel file system.

While much of the emphasis in cluster computing has focused on optimizing processor performance (e.g., compilers, hardware monitors, numerical libraries, optimization tools) and increasing message passing efficiency (e.g., MPICH-GM, MPI-Pro, LAM, M-VIA), much less effort has been directed towards harnessing the parallel capacity of multiple processors and disks to obtain “HPC I/O” in clusters.  This is partly due to the fact that the MPI-I/O standard was not formulated until version 2 of MPI.  It is also due to the cost of dedicating processors/nodes to I/O services.  (For workloads that include only a few I/O intensive applications, dedicated nodes for parallel I/O may be an underused resource, to say the least.)  In addition, dedicated I/O nodes introduce heterogeneity into the system configuration, and therefore require additional resources, planning, management, and user education.

The Parallel Virtual File System (PVFS) [1] is a shared file system designed to take advantage of parallel I/O service on multiple nodes within a Linux cluster.  In PVFS, Parallel I/O requests are initiated on clients (compute nodes), and the parallel I/O service occurs on I/O servers (I/O nodes).  The PVFS API includes a partitioning mechanism to handle common, simple-strided access[2] with a single call.  The file system is virtual in the sense that it can run without kernel support (that is, in user space) in its native form (pvfs), although a UNIX I/O interface is supported through a loadable kernel module.  Also, PVFS is built on top of the local file system on each I/O node. Hence, there is a rich parameter space for obtaining optimal performance through a local file system.  The authors of PVFS (Cettei , Ross & Ligon)[3] have championed this file system, and have performed many different benchmarks to illustrate its capabilities[4].

The main attributes of PVFS are:

·    a single file name space
·    multiple I/O servers (files are striped across I/O nodes)
·    single metadata manager (mgr): metadata is separated from data
·    independent I/O server, client and manager location
·    data transfer through TCP/IP
·    multiple user interfaces (MPI-I/O, UNIX/POSIX I/O, native pvfs)
·    (PVFS has been integrated into the ADIO layer of ROMIO)
·    free and downloadable

The purpose of this paper is to investigate the PVFS file system on a wide range of parameters, including different local file systems and varied multi-user conditions.  

In many, small “workcluster” systems the nodes are frequently 2-way SMPs (for price-performance reasons).  In these systems, where jobs are often run in dedicated mode (all processors devoted to a single user), there are two basic configurations that we evaluate for I/O-intensive applications:  a compute processor and an I/O processor can coexist on each node, or the nodes can be divided into separate I/O nodes and compute nodes.  More specifically, the cases are:
  1. Some nodes are configured as I/O nodes and they are committed to performing only IO (and are excluded from performing computations in the compute node pool).
  2. No nodes are configured exclusively for I/O; the I/O services are then resident on the compute nodes, and the parallel I/O service is performed “on node” during the I/O phase of an execution (and a larger pool of processors can be assigned to the computational tasks). 
Another concern is the underlying file system. Since PVFS is built on top of the local file system on each I/O node, alternative file systems, such as ext3fs and tmpfs, are evaluated for conditions that provide enhanced performance.  

Also, on a loaded production cluster, the state of the system is often far from the conditions under which benchmarks are performed.  In our evaluation of PVFS we simulate different scenarios of simultaneous I/O access by multiple users to test I/O nodes under the stress that can occur in large systems.  While the cost of “multiplexing” multiple user requests is not important in a dedicated environment, it is an important measurement for large production systems, and should not be outweighed by single user characteristics. 

[1] http://parlweb.parl.clemson.edu/pvfs/
[2] Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter Ellis, and Michael Best, File Access Characteristics of Parallel Scientific Workloads, IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, October 1996.
[3] Matthew M. Cettei, Walter B. Ligon III, and Robert B. Ross, Clemson University.
[4] http://parlweb.parl.clemson.edu/pvfs/papers.html, http://www-unix.mcs.anl.gov/~rross/resume.htm

 

Systems Administration Track Abstracts:


Systems Session I:  System Performance Issues
Title Checkpointing and Migration of Parallel Processes Based on Message Passing Interface
Author Zhang Youhui, Wang Dongsheng, Zheng Weimin
Author Inst Department of Computer Science, Tsinghua University, China
Presenter Zhang Youhui
Abstract Linux Clusters offer a cost-effective platform for high-performance, long-running parallel computations and have been used widely in recent years. However the main problem with programming on clusters is the fact that it is prone to change. Idle computers may be available for computation at one moment, and gone the next due to load, failure or ownership[1].The probability that clusters will fail increases with the number of nodes. During normal computing, one abnormal events is likely to cause the entire application to fail. To avoid this kind of time waste, it is necessary to achieve high availability for clusters. Checkpointing & Rollback Recovery (CRR) and Process Migration offer a low overhead and full solution to this problem[1][2]. 

CRR, which records the process state to stable storage at regular intervals and restarts the process from the last recorded checkpoint upon system failure, is a method that avoids the waste of computations accomplished prior to the occurrence of the failure. Checkpointing also facilitates process migration, which suspends the execution of a process on one node and subsequently resumes its execution on another. However, Checkpointing a parallel application is more complex than just having each processor take checkpoints independently. To avoid domino effect and other problems upon recovery, the state of inter-process communication must be recorded, and the global checkpoint must be consistent. And there some related CRR software, including MPVM[3], Condor[4], Hector[5].

This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. In this system some important fault tolerant mechanisms are designed and implemented, which include coordinated checkpointing protocol, synchronized rollback recovery, synchronized process migration, and so on.

Using ChaRM4MPI, the node transient faults can be recovered automatically, and the permanent fault can also be recovered through checkpoint mirroring and process migration techniques. If any node that is running the computation drops out of the clusters, the computation will not be interrupted. Moreover, users can migrate MPI processes from one node to another manually for load balance or system maintenance. ChaRM4MPI is a user-transparent implementation, which means it is needless for users to modify their source codes.

Abstract 18 Image

ChaRM4MPI (Figure 1), is based on MPICHP4, an MPI implementation developed at Mississippi State University and Argonne National Laboratories that employs P4 library as the device layer. Our work is focused on the follows:
  1. A coordinated CRR and migration protocol for parallel applications is designed.
  2. One management process is implemented as the coordinator of parallel CRR and process migration. It can operate users’ checkpoint /migration commands received through a GUI.
  3. We modify source codes of P4 listener. On reception of commands from the coordinator, P4 listener will interrupt computing processes by signals. The most difficult task is how to design this procedure carefully to avoid any dead lock.
  4. Some signal handlers are integrated with P4 master to deal with CRR/migration signals, which employ libckpt[6] to take checkpointing. Moreover, a message-driven mechanism is implemented to record the state of inter-process communication. The potential dead lock problem should be avoided, too.
  5. We modify the start-up procedure of MPICHP4, so all MPI processes will register themselves to the coordinator. In this way, the latter can maintain a global process-info-table to manage the whole system.
In ChaRM4MPI, computing processes in one node save their checkpoints information and in-transit messages to local disk. Checkpointing to local disk can recovery any number of node transient faults. Moreover it uses RAID like checkpoint mirroring technique to tolerate one or more node permanent faults. Each node uses a background process to mirror the checkpoint file and related information to other nodes besides its local disk. When some node fails, the recovery information of application processes running on the node will be available on other nodes. Therefore the application process that ran on the fault node would be resumed on the other corresponding node where the checkpoint information is saved, and go on running from the checkpoint.

Some MPI parallel applications are tested on ChaRM4MPI, which include the sorting programs and LU decomposition developed by NASA, an overfall simulation, and so on. We study the extra time overhead introduced by CRR mechanism, and the result is fairly satisfied. The average extra overhead per checkpointing is less than 2% of the original running time.

In the full text, the coordinated CRR and migration protocol is described first, and then we introduce the work involved in modifying MPICHP4 in detail. At last, testing results will be presented and analyzed.

References:
1. Elnozahy E N, Johnson D B, Wang Y M. A Survey of Rollback Recovery Protocols in Message-Passing System. Technical Report. Pittsburgh, PA: CMU-CS-96-181. Carnegie Mellon University, Oct 1996
2. Elnozahy E N. Fault tolerance for clusters of workstations. Banatre M and Lee P (Editors), chapter 8, Spring Verlag, Aug. 1994.
3. J. Casas, Dan Clark, et. al.  MPVM: A Migration Transparent Version of PVM, Technical Report, Dept. of Comp. Sci. & Eng., Oregon State Institute of Sci. & Tech.
4. M. J. Litzkow, et. al. Condor— A Hunter of Idle Workstations,  In Proc. of the 8th IEEE Intl. Conf. on Distributed Computing Systems, pp104-111, June 1988
5. S. H. Russ, B. Meyers, C.--H. Tan, and B. Heckel, ``User--Transparent Run--Time Performance Optimization'', Proceedings of the 2nd International Workshop on Embedded High Performance Computing, Associated with the 11th International Parallel Processing Symposium (IPPS 97), Geneva, April 1997.
6. James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent checkpointing under Unix. In Proceedings of the 1995 USENIX Technical Conference, pages 213-224, January 1995.
them.

 
Title An Analytical Model to Evaluate the Performance of Cluster
Architectures
Author
Leonardo Brenner, Paulo Fernandes, César A. F. De Rose
Author Inst Faculdade de Informatica, Catholic University of Rio Grande do Sul, Brazil
Presenter César A. F. De Rose
Abstract [Unavailable]

 
Systems Session II: Cluster Management and Monitoring
Title High-Performance Linux Cluster Monitoring Using Java
Author Curtis Smith, David Henry
Author Inst Linux NetworX
Presenter David Henry
Abstract Monitoring is at the heart of cluster management. Instrumentation data is used to schedule tasks, load-balance devices and services, notify administrators of hardware and software failures, and generally monitor the health and usage of a system. The information used to perform these operations must be gathered from the cluster without impacting performance. This paper discusses some of the performance barriers to efficient high-performance monitoring, and presents an optimized technique to gather monitored data using the /proc file system and Java.

 
Title CRONO: A Configurable Management System for Linux Clusters
Author Marco Aurélio Stelmar Netto, César A. F. De Rose
Author Inst Research Center in High Performance Computing, Catholic University of Rio Grande do Sul, Brazil
Presenter César A. F. De Rose
Abstract 1. Introduction
Cluster architectures are becoming a very attractive alternative when high performance is needed. With a very good cost/performance relation and good scalability big cluster systems with hundred of nodes are becoming more popular in universities, research labs and industries. Linux is usually the operating system used in these systems because it’s free, efficient and very stable.

One big problem in such a system is to manage all the nodes efficiently as one machine and deal with issues like access rights, time and space sharing, reservation and jobs queuing. When looking for such a manager for our research lab we were not satisfied with the services and level of configuration provided by the most popular systems like Computing Center Software [1] and Portable Batch System [2] and decided to write our own management system.

2. Cluster management over Linux
The Crono system was developed to be a highly configurable manager for Linux clusters. It has been coded using the C language and rely on several scripts and configuration files to be adapted to specific administration needs and machine configurations. The Linux operating system has several advantages for the development of a cluster management system (CMS). Linux is an open source system and its configuration files are very simple to read and modify, therefore the CMS can define environment variables and access restrictions easily. Another main advantage is the possibility to modify the kernel to better support the CMS.

3. Crono main functionalities
Crono provides two allocation modes, space-sharing and time-sharing. The first one is used when the user needs exclusive access to allocated nodes, for example when application performance is being measured. The second one is used in situations where the users are only testing their programs and, therefore, they do not care about performance. Space-sharing is very interesting alternative in teaching environments, allowing large groups of students to use the cluster at the same time.

Another main feature of Crono is its flexibility by the definition of access rights. With configuration files the system administrator can create user categories and associate access restrictions to these categories or to individual users. These restrictions are defined by the maximum time and maximum amount of nodes used in allocations and reservations. There is also the possibility to define restrictions based on period of the day, day of the week and target machine.

To configure the execution environment for programs, Crono supplies scripts for the pre and post processing of the requisitions. When it initiates the time of a user, Crono will use two scripts: one of them controlled by the administrator, and the other by the user itself. This mechanism is very useful, for example, to automatically generate MPI [3] machine files. When the time of a user is over, two post processing scripts will be used in the same way.

Users can interact to the system through a graphical interface or using commands in the shell (bash, tcsh, etc).  Several services are supported for users and the system administrator like: to show information about the allocation queue, to submit jobs to execution, to release the resources, to configure the execution environment and to get information on access rights.

Figure 1 presents the system architecture with its four main module, the User Interface (UI), the Access Manager (AM), the Requisition Manager (RM) and the Node Manager (NM). The communication between the modules is done through the sockets interface, therefore allowing modules to be in different machines. The full paper will present the Crono architecture in more detail.





Figure 1: Crono’s architecture and main modules

4. Using Crono in production mode
For the last 6 months Crono is being used to manage the three Linux clusters in our research lab[*]. Access to the clusters is centralized trough one host machine that runs the Crono Access Manager and three Requisition managers, one for each of the clusters. Each node of the clusters has its own Node Manager. Each cluster has its own peculiarities, having a different number of nodes (4, 16, and 32) and different interconnection networks (combinations of SCI, Fast-Ethernet and Myrinet). We also support several program environments, including different MPI implementations and some own developed execution kernels. With simple commands like cralloc, crcomp and crrun users can choose the cluster, the number of nodes, the target interconnection network and the programming environment they want. Crono will automatically generate machine files, choose the right libraries for the compilation, load the processes and start the program. If the needed nodes are not available, waiting queues are managed for each cluster. Because we have different types of users, access rights can be defined for each user depending on status, project, target machine and day period separately. A student may have for example only access to the small cluster during the peek hours and to 16 nodes of the main cluster during the night. The system is running over the Linux Slackware 8 distribution and is already very stable. We developed the system in a way that it can be easily configured to work with other cluster configurations and a different set of tools. More information about Crono can be found at www.cpad.pucrs.br/crono together with its source code that is distributed under the GNU license.

5. References
[1] KELLER, Axel, REINEFEL, Alexander, Anotomy of Resource Management System for HPC Clusters, vol.
3, 2001.
[2] OpenPBS web site, "www.openpbs.org"
[3] W. Gropp, E. Lusk, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message
Passing Interface Standard, July 1996.

[*] Research supported by HP Brazil.

 
Title Parallel, Distributed Scripting with Python
Author(s) Patrick Miller
Author Inst. Lawrence Livermore National Laboratory, USA
Presenter Patrick Miller
Abstract [Need to conver from PDF to HTML]

 
Systems Session III:  Processor and network Performance
Title A Study of Hyper-Threading in High Performance Computing Clusters
Author(s) Tau Leng, Rizwan Ali, Jenwei Hsieh, Victor Mashayekhi, Reza Rooholamini
Author Inst Dell Computer Corporation, USA
Presenter Tau Leng
Abstract Intel’s Hyper-Threading technology makes a single physical processor appear as two logical processors. The physical processor resources are shared and the architectural state is duplicated for the two logical processors [1]. The premise is that this duplication allows a single physical processor to execute two data instructions from different threads in parallel rather than in serial, and therefore, can lead to better processor utilization. Processors enabled with the Hyper-Threading technology can improve the performance of applications with high degree of parallelism. Previous studies have shown that the Hyper-Threading technology improves multi-threaded applications’ performance by the range of 10 to 30 percentages depending on the characteristics of the applications [2]. These studies also suggest that the potential gain is only obtained if the application is multi-threaded by any means of parallelization techniques.

With the addition of Hyper-Threading support in Linux kernels 2.4.9-13 and above, Linux cluster practitioners have started to assess its performance impact on their applications. In our area of interest, High Performance Computing Clusters, applications are commonly implemented by using standard message-passing interface, such as MPI or PVM. Applications implemented in message-passing programming model are usually not multi-treaded. Rather, they use a mechanism like “mpirun” to spawn multiple processes and map them to processors in the system. Parallelism is achieved through the message-passing interface among the processes to coordinate the parallel tasks.

Unlike the multithreaded processes in which the values of application variables are shared by all the threads, a message-passing application runs as a collective of autonomous processes, each with its own local memory. This type of applications can also benefit from Hyper-Threading technology in that the number of processes spawned on the physical processors can be doubled on the logical processors, and potentially complete the parallel tasks faster. Applying the Hyper-Threading and doubling the processes running on the cluster simultaneously will increase the utilization of the execution-units’ resources of the system. Therefore, the performance can be improved. On the other hand, overheads might be introduced in the following ways:


[Figure 1]
Figure 1. The HPL Linpack performance results on single node. The thick lines are the results of 1 physical-processor runs, and the thin lines are the results of 2 physical-processors runs, with HT (Hyper-Treading) disabled and enabled respectively.

Whether these overheads can be amortized by the performance enhancements from better utilization of resources or doubling the processes to complete the parallel job faster depends on the application characteristics. In this paper, we used an experimental approach to demonstrate the impact of Hyper-Threading on a small Linux cluster by using various MPI benchmark programs, and discuss the adaptability of this new technology into HPC clusters for improving performance.

Our testing environment is based on a cluster consisting of 8 Dell PowerEdge 4600 servers interconnected with Myrinet. Each PowerEdge 4600 has two Intel Pentium 4 Xeon processors running at 2.2 GHz with 512KB L2 cache, 2GB 400 MHz of DDR-RAM memory. The chipset of PowerEdge 4600 is the ServerWorks GC-HE, which accommodates up to twelve registered DDR (double data rate) 200 (PC1600) DIMMs, each with capacities of 128 MB through 1GB with a 4-way interleaved memory architecture. Each of the two PCI-X controllers on 4600 has its own dedicated 1.6 GB/s full duplex connection to the North Bridge to accommodate the peak traffic generated by the PCI-X busses it controls.


[Figure 2]
Figure 2. The HPL Linpack performance results of single CPU cluster (on the left) and for dual CPUs cluster (on the right) runs, from 1 node to 8 nodes. The memory used or the problem size for each run is based on the principle of 1GB RAM per physical processor.

The MPI program we used for the experiments is High-Performance Linpack (HPL), commonly used in the High Performance Computing arena. It uses a number of linear algebra routines to measure the time it takes to solve dense linear equations in double precision (64 bits) arithmetic using the Gaussian elimination method. The measurement obtained from Linpack is in the number of floating-point operations per second (FLOPS). Linpack mainly exercises the floating-point calculation capability of the system. However, the communication bandwidth of the system for running Linpack also plays a significant role on the overall performance; when using dual processors compute nodes interconnected with high-speed networking, such as Myrinet, the actual performance of a cluster may reach almost 60% of its theoretical performance, and the percentage could be less than 30% when using slower interconnect like Fast Ethernet [3].

To understand the impact of Hyper-Threading on single compute node, we first conducted a series of HPL runs from small problem size to large. The results shown in Figure 1 indicates that only when the problem size or memory used is large to some extent, 1600x1600 grid blocks or more, we can see modest performance improvement (around 5%) on Hyper-Threading configurations. Also note that, for the runs where the number of processes is more than the number of physical processors without Hyper-Threading enabled, the performance is considerably worse.

We then ran HPL on the cluster with different combinations of compute nodes and CPUs numbers, and compared the results of Hyper-Threading disabled and enabled. Figure 2 shows the performance results of single physical CPU runs on the left and dual physical CPUs runs on the right. The performance gains from Hyper-Threading are larger on single physical CPU configurations than that on dual physical CPUs configurations, approximately 10% vs. 5%. That is because the “overheads” mentioned above during the runs are more severe for the configurations of four processes on each node when the Hyper-Threading enabled.

Although our preliminary study indicated that Linpack-type applications can benefit from Hyper-Threading, we have seen mixed results when running the NAS parallel benchmarks suite. For example, the IS (integer sort) which is a communication-bound benchmark showed about 10% degradation in performance when applying the Hyper-Threading on a 8x2 (8 nodes with 2 CPU on each node) runs, while the EP (Embarrassingly Parallel), a CPU-bound benchmark showed 40% performance improvement. The two benchmarks are the two extreme cases of NAS. IS is unique in that floating point arithmetic is not involved. Significant data communication, however, is required. With Hyper-Threading enabled configurations, doubling the IS processes running on each node creates much more communication overhead among processes and memory contentions inside the nodes, and yet the CPU floating-point execution units are still underutilized. Hence, the performance degraded. On the other hand, the EP mainly performs floating-point calculations and requires almost no communication during the runs. With Hyper-Threading enabled, running EP could fully utilize the CPUs’ resources of the cluster without being concerned about the communication overhead, which makes the performance increase significantly.

In summary, Hyper-Threading could improve the performance of some MPI applications, but not all, on Linux clusters. Depending on the cluster configuration and most importantly the nature of the application running on the cluster, the performance gain could be varied or even negative. Our next step is to use performance tools to understand what areas contribute to the performance gains, and what areas contribute to the overheads, which lead to performance degradation.


Reference
[1] D. T. Marr el al, “Hyper-Threading Technology Architecture and Micro architecture”, Intel Technology Journal, Vol. 6, Issue 01, February, 2002.
[2] W. Magro, P. Petersen, S. Shah, “Hyper-Threading Technology: Impact on Compute-Intensive Workloads”, Intel Technology Journal, Vol. 6, Issue 01, February, 2002.
[3] “HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers”, http://www.netlib.org/benchmark/hpl/

 
Title A Performance Comparison of Myrinet Protocol Stacks
Author(s) Ron Brightwell, Bill Lawry, Michael Levenhagen, Arthur B. Maccabe
Author Inst University of New Mexico, USA
Presenter Ron Brightwell
Abstract Recent advances in user-level networking technology for cluster computing have led to significant improvements in achievable latency and bandwidth performance. Many of these improvements are based on an implementation strategy called Operating System Bypass, or OS-bypass, which attempts to increase network performance and reduce host CPU overhead by offloading communication operations to intelligent network interfaces. These interfaces, such as Myrinet, are capable of moving data directly from an application's address space without any involvement of the operating system in the data transfer.

Unfortunately, the reduction in host CPU overhead, which has been shown to be the most significant factor in effecting application performance, has not been realized in most implementations of MPI for user-level networking technology. While most MPI microbenchmarks can measure latency, bandwidth, and host CPU overhead, they fail to accurately characterize the actual performance that applications can expect. Communication microbenchmarks typically focus on message passing performance relative to achieving peak performance of the network and do not characterize the performance impact of message passing relative to both the peak performance of the network and the peak performance of the host CPU(s).

We have designed and implemented a portable benchmark suite called COMB, the Communication Offload MPI-based Benchmark, that measures the ability of an MPI implementation to overlap computation and MPI communication. The ability to overlap is influenced by several system characteristics, such as the quality of the MPI implementation and the capabilities of the underlying network transport layer. For example, some message passing systems interrupt the host CPU to obtain resources from the operating system in order to receive packets from the network. This strategy is likely to adversely impact the utilization of the host CPU, but may allow for an increase in MPI bandwidth. We believe our benchmark suite can provide insight into the relationship between network performance and host CPU performance in order to better understand the actual performance delivered to applications.

In this paper, we will compare four different network protocol stacks on Myrinet using identical computing and networking hardware. In addition to MPI microbenchmark results for latency and bandwidth, we will gather results from the COMB suite and characterize CPU availability and the ability to overlap computation with MPI communication. We expect this direct comparison to provide insight into how performance is impacted at each layer of the protocol stack. The following describes the four different protocol stacks that we will compare.

At the highest level, all of our benchmarks tests are coded using MPI. This gives us the maximum amount of portability between systems and establishes a baseline for comparison. Below MPI, we will be comparing systems that use the GM message passing API from Myricom and our Portals 3.0 API.

The Portals 3.0 API is composed of elementary building blocks that can be combined to implement a wide variety of higher-level data movement layers. We have tried to define these building blocks and their operations so that they are flexible enough to support other layers as well as MPI. Data movement in Portals is based on one-sided operations. Other processes can use a Portal index to read (get) or write (put) the memory associated with the portal. Each data movement operation involves two processes, the initiator and the target. In a put operation, the initiator sends a put request containing the data to the target. The target translates the Portal addressing information using its local Portal structures. When the request has been processed, the target may send an acknowledgement. In a get operation, the initiator sends a get request to the target. The target translates the portal addressing information using its local Portal structures and sends a reply with the requested data.

For our initial implementation of Portals 3.0 on Myrinet, we chose a simple RTS/CTS protocol called RMPP (Reliable Message Passing Protocol). RMPP supports end-to-end flow control and message-level retransmission in the event of a dropped or corrupt data packet. In this protocol, all management decisions are made by the receiver when handling an RTS packet. By the time the receiver has generated a CTS packet, it has made all of the significant management decisions, including where the cleared data packets will be placed in the application's address space and the buffer space for incoming packets that will be needed by the network interface.

Portions of the RMPP can be offloaded onto intelligent or programmable network interfaces. Our current implementation is a Linux kernel module that interfaces into the Linux networking stack, making it possible to use any network device that Linux supports. We are currently developing an enhanced version of this protocol that offloads parts of the protocol to the Myrinet network interface. We expect this offloading to offer significant performance improvements.

In addition to these two RMPP-based implementations of Portals, we also have a Myrinet Control Program that implements Portals on top of a reliability protocol modeled after the one used for the wormhole network on the ASCI/Red machine. In this implementation, all of the reliability and flow control protocols are implemented in the Myrinet NIC, while Portals processing can occur either on the NIC or in the kernel. We will discuss the design of this protocol and contrast it with the approaches taken in RMPP.

Finally, we compare results from these protocols with the standard Myrinet GM and MPICH/GM software stack available from Myricom.

 

Vendor Presentations


Vendor Session I
Title Delivering Linux Cluster Solutions to the High-Performace Marketplace
Author(s) Joseph Banas
Author Inst IBM
Presenter Joseph Banas
Abstract IBM continues to deliver many of the most powerful Linux clusters in the world, and has an established track record in building several "top 500" clusters. In this presentation, we will survey the current set of offerings around Linux clusters from IBM, with a particular focus on the IBM eServer Linux Cluster 1350 that was unveiled at LinuxWorld Expo in San Francisco earlier this summer. This presentation will also highlight how IBM is working with key Partners and ISVs in the High Performance Computing arena to deliver solutions that meet the demanding needs of these customers.

 
Vendor Session II
Title Designing Supercomputers that Fit Your Needs
Author(s) Tim Mattson
Author Inst Intel
Presenter Tim Mattson
Abstract The architecture wars in High Performance Computing are nearing their logical conclusion. You can find plenty of confusing data to muddle the issue, but the overall direction of HPC is clear. Clusters are where people are investing their time and money for supercomputing. The reason for this state of affairs is crystal clear: the economics of the semiconductor industry combined with the technology treadmill called Moore's Law makes it all but impossible for any other supercomputer architecture to compete with clusters.

In this talk, I will explore the state of cluster computing today and where it is going tomorrow. We will take a look at the next generation building-blocks available to the cluster designer and how these building-blocks let us build supercomputers that are carefully tuned to our specific needs (and price-points).

This is not going to be a pure "cheerleading" talk, however. In addition to "singing the praises of clusters", I will also discuss the challenges presented by clusters. Clusters are not as easy to use as vector supercomputers. Clusters are poorly balanced with respect to moving data through the memory/storage hierarchy. We can use these challenges to make excuses and justify the need to prop up failed supercomputing business models. Or we can use these challenges to motivate the algorithm work required to make clusters work across the full spectrum of HPC problems. It all comes down to whether you want to embrace the future, or hang on to the past.