Last updated: 11 October 2002
|Title||Seismic & Linux: From Curiosity to Commodity|
|Author(s)||Dr. Mike Turff|
|Author Inst||PGS Data Processing|
|Presenter||Dr. Mike Turff|
|Abstract||The seismic processing business has always required massive amounts of compute capacity and has historically been an early adopter of anything providing price/performance improvements in this field. This has been exacerbated by the competitive nature of this business, the downward pressure on prices from the oil majors and the tender & bid process that decides the allocation of the available work. In this talk, I review the history of computer systems in PGS, the move to Linux clusters and the new applications that this price/performance step has enabled our industry to bring to the seismic processing market. I will talk from a business prospective about the acceptance of Linux clusters and show that, for the right applications, Linux clusters can be regarded as off-the-shelf commodities.|
|Title||Scaling Up and Scaling Down|
|Author(s)||Dr. Daniel A. Reed|
|Author Inst||National Center for Supercomputing Applications (NCSA)|
|Presenter||Dr. Daniel A. Reed|
|Abstract||The continuum of computing continues to expand, ranging from mobile, embedded (or soon implantable) devices to terascale and petascale clusters. New applications of distributed sensors (e.g., environmental or structural monitoring, health and safety) pose new software challenges. How do we manage millions or even billions of mobile devices, each with power, weight, processing, and memory constraints? At the other end of the continuum, large-scale science and engineering applications demand ever faster and larger clustered systems. How do we design, package, and support systems with hundreds of thousands of processors in a reliable way? This talk sketches some of the technical challenges and opportunities in realizing the ubiquitous infosphere of interconnected systems, highlighted by recent developments at NCSA and the U.S. TeraGrid.|
|Title||Challenges of Terascale Computing|
|Author(s)||Dr. William Pulleyblank|
|Author Inst||IBM Research|
|Presenter||Dr. William Pulleyblank|
|Abstract||We are rapidly moving into the domain of terascale computing, which includes the ASCI program, and the Blue Gene project. In addition to the significant hardware problems to be solved, we will also face software issues that are at least as large. I will discuss these in the context of BlueGene - a 180 teraflop/s super computer being built at IBM Research which will run on Linux as its operating system. Enabling Linux for this platform has necessitated a number of innovations in order to achieve the targeted levels of performance. We also discuss issues of reliability and availability, the so-called autonomic issues. We relate this to current activity on grid computing.|
||Scalability and Performance of Salinas on the Computational Plant|
|Author(s)||Manoj Bhardwaj, Ron Brightwell, Garth Reese|
|Author Inst||University of New Mexico, USA|
|Abstract||Parallel computing platforms composed
of commodity personal computers (PCs) interconnected
by gigabit network technology are a viable alternative
to traditional proprietary supercomputing platforms.
Small- and medium-sized clusters are now ubiquitous,
and larger-scale procurements, such as those made
recently by NSF for the Distributed Teraflops Facility
and by Pacific Northwest National Lab, are becoming
more prevalent. The cost effectiveness of these
platforms has allowed for larger numbers of processors
to be purchased.
In spite of the continued increase in the number of processors, few real-world application results on large-scale clusters have been published. Traditional large parallel computing platforms have benefited from many years of research, development, and experience dedicated to improving their scalability and performance. PC clusters have only recently started to receive this kind of attention. In order for clusters of PCs to compete with traditional proprietary large-scale platforms, this type of knowledge and experience may be crucial as larger clusters are procured and constructed.
The goal of the Computational Plant (Cplant) project at Sandia National Laboratories is to provide a large-scale, massively parallel computing resource composed of commodity-based PC's that not only meets the level of compute performance required by Sandia's key applications, but that also meets the levels of usability and reliability of past traditional large-scale parallel machines. Cplant is a continuation of research into system software for massively parallel computing on distributed-memory message-passing machines. We have transitioned our scalable system software architecture developed for large-scale massively parallel processing machines to commodity clusters of PCs.
In this paper, we examine the performance and scalability of one of Sandia's key applications on Cplant. We present performance and scalability results from Salinas, a general-purpose, finite element structural dynamics code designed to be scalable on massively parallel processing machines. Salinas offers static analysis, direct implicit transient analysis, eigenvalue analysis for computing modal response, and modal superposition-based frequency response and transient response. In addition, semi-analytical derivatives of many response quantities with respect to user-selected design parameters are calculated. It also includes an extensive library of standard one-, two-, and three-dimensional elements, nodal and element loading, and multi-point constraints. Salinas solves systems of equations using an iterative, multilevel solver, which is specifically designed to exploit massively parallel machines. Salinas uses a linear solver that was selected based on the criteria of robustness, accuracy, scalability and efficiency.
Salinas employs a multilevel domain decomposition method, Finite Element Tearing and Interconnect (FETI), which has been characterized as the most successful parallel solver for the linear systems applicable to structural mechanics. FETI is a mature solver, with some versions used in commercial finite element packages. For plates and shells, the singularity in the linear systems has been traced to the subdomain corners. To solve such linear systems, an additional coarse problem is automatically introduced that removes the corner point singularity. FETI is scalable in the sense that as the number of unknowns increases and the number of unknowns per processor remains constant, the time to solution does not increase. Further, FETI is accurate in the sense that the convergence rate does not deteriorate as iterates converge. Finally the computational bottleneck in FETI, a sparse direct subdomain solve, is amenable to high performance solution methods. An eigensolver was selected for Salinas based on robustness, accuracy, scalability and efficiency.
In this paper, we will describe in detail the hardware architecture of our 1792-node Cplant cluster. This machine is composed of single-processor Compaq Alpha workstations connected by a Myrinet gigabit network in a three-dimensional mesh topology. This machine has many unique characteristics, such as the ability to switch a large portion of the machine from any one of four network "heads" to another, allowing for greater flexibility in responding to user demand for classified versus unclassified computing.
In addition to hardware, we describe the system software environment of our cluster. Linux is used as the base operating system for all nodes in the cluster, and we have designed and developed software for the parallel runtime system, high-performance message passing, and parallel I/O. We will discuss the details of this software environment that influence the scalability and performance of applications.
Finally, we describe the performance and scalability of Salinas on up to 1000 nodes of Cplant. We will also discuss some of the challenges related to scaling Salinas beyond several hundred processors and how these challenges were addressed and overcome. We will show that the performance and scalability of Salinas is comparable to proprietary large-scale parallel computing platforms.
|Title||Real Time Response to Streaming Data on Linux Clusters|
|Author(s)||Beth Plale, George Turner, and Akshay Sharma|
|Author Inst.||Indiana University|
|Abstract||Beowulf-style cluster computers
which have traditionally served batch jobs via
resource man-agers such as Portable Batch System
(PBS) are now under pressure from the scientific
community to support real time response to data
stream applications. That is, timely response to
events streamed from remote sensors or humans interactively
exploring the parameter space. Cluster management
must be agile and timely in its response to new
resource demands imposed by the occurrence of real-world
events. The work described in this paper
quantitatively evaluates two approaches to integrating
support for streaming applications into the framework
of a PBS managed cluster. One approach works cooperatively
with PBS to reallocate computational resources
to meet the needs of real time data analysis. The
second approach works outside of the PBS context
utilizing the real time scheduling capabilities
of the Linux kernel to acquire the resources necessary for its real
Beowulf-style clusters have traditionally used resource managers such as PBS Pro  to opti-mize the utilization of a cluster’s resources. Today there is a growing need for these clusters to be responsive to non-deterministic event driven data streams. Upon recognition of an application specific event occurring outside the realm of PBS and the cluster, the necessary CPU, memory, network, and file system resources must be directed toward a timely response.
Most production Beowulf clusters use batch scheduling to ensure fairness among users and to maximize a cluster’s throughput. But these objectives are at odds with the dynamic nature of data stream processing [5, 4, 1, 2], where resource needs cannot be predicted in advance, and any projection of duration of usage can only be estimated because of the variability of real-world events such as storms, earthquakes, etc.
This paper addresses the need for integrated support for streaming applications in a batch job scheduled system. Through a quantitative evaluation of startup time in response to an environmental event, we are exploring the solution space for real-time responsiveness to data streaming.
An atmospheric scientist at IU studies atmosphere-surface exchange of trace chemicals such as CO2, water, and energy from a variety of ecosystems. These atmospheric flux measurements generate large streams of data; sensors operating at rates up to 10 Hz can generate data at rates of hundreds of MB/day/sensor. The remote sensors use wireless 802.11b to communicate to a central computer remotely located within the forest. The remote computer is connected via a T1 line to the IUB campus network. Under normal operation the data from the remote sensors are collected at the remote site and streamed to IU at a nominal rate of 1-3 Hz where the event data are analyzed for particular meteorological conditions. The response to the detection of a meteorological condition is an increasing of the event rate, and the attendant enlistment of additional CPU cycles to handle the high-fidelity analysis.
This work is being carried out on the proto-AVIDD facility on the Bloomington campus of Indiana University. The cluster consists of five (5) dual Pentium-IV Xeon nodes, two (2) Itaniuim dual processor IA64 nodes, and a RAID-5 filesystem NFS served internally to the cluster. Proto-AVIDD runs RedHat 7.2 with cluster management tools provided by the Oscar 1.2 project. Proto-AVIDD is a testbed cluster used to study the issues of interactive and stream data analysis in a production-level Linux cluster.
AVIDD (Analyzing and Visualizing Instrument Driven Data) will be a geographically distributed data analysis cluster sited at three of the IU campuses: Bloomington (IUB), Indianapolis (IUPUI), and Gary (IUN). Due to ongoing vendor negotiations, an explicit description of the AVIDD cluster is not possible at this time; however, it is expect to be in the range of 0.5 –1 TFOPSaggregate, 6 TBytes of distributed cluster storage and will leverage IU’s existing HPSS based massive data storage system. Local intra-cluster networking will be provided by a Myrinet fabric with the the inter-cluster networking utilizing Indiana’s 1 Tb/s I-Light infrastructure. By the time of the conference, it is anticipated that initial experiences and early results will be available from the production AVIDD cluster.
Application events streaming from remote sources must be analyzed in real time in order to achieve timely recognition of scientifically interesting meteorological conditions. Our approach is to employ a user space daemon resident on a dedicated communications node within the AVIDD cluster to receive the data stream and monitor it for interesting events. Under normal conditions, the daemon simply drops the events into an event cache implemented as a circular buffer. When a meteorological event occurs, the daemon responds first by sending a message to the remote computer commanding it to reconfigure the sensor system for higher fidelity data acquisition, and second by enlisting additional resources within the cluster to process the higher fidelity data streams. This startup scenario is depicted in Figure 1.
Figure 1. Data stream processing response to application event of interest.
We are exploring the solution space for real-time responsiveness to data streaming through a quantitative evaluation of startup time in response to an environmental event. Our first approach is to work within the PBS domain to allocate resources and provide job startup services in response to the detection of developing real-world phenomenon. The second approach explores a ’bullying’ relationship with PBS in which the real time scheduling features of the Linux kernel are utilized to preempt resources already allocated by PBS. The full paper will quantify response times under both approaches and explore interesting issues involved in each approach.
Startup Within Context of PBS Resource managers such as PBS have a coherent view of a cluster’s state. They also have the tools necessary to stop, start, and signal jobs running under their control. It would appear to be natural choice to use a resource manager’s preexisting infrastructure to implement cluster resource reallocation. In the final paper we will evaluate the steps involved in the startup of the high fidelity processes as well as the benefits and shortcomings of this approach.
Linux Real-time Scheduling Startup High fidelity processes are scheduled and started up out-side the PBS domain using normal Linux process control mechanisms. If these data analysis tasks are scheduled by the kernel scheduler as real time processes they will be assured of being scheduled on the CPUs any time they are computable. Resources already used by a previous PBS job would only be consumed as needed. For example, memory previously used by the PBS job would be swapped out only as needed. In the final paper we will evaluate the steps involved in the startup of the high fidelity processes as well as the benefits and shortcomings of this approach.
Other Issues Issues which will be touched upon but not discussed in detail include Ssh, Globus Toolkit , MPI, process preemption, process checkpointing. A general discussion of the various steps involved in the data acquisition, data transmission, and data disposition will be provided as a introductory foundation.
 Shivnath Babu and Jennifer Widom. Continuous queries over data streams. In International Conference on Management of Data (SIGMOD), 2001.
 Fabi´ an E. Bustamante and Karsten Schwan. Active I/O streams for heterogeneous high performance computing. In Parallel Computing (ParCo) 99, Delft, The Netherlands, August 1999.
 I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997.
 Sam Madden and Michael J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In International Conference on Data Engineering ICDE, 2002.
 Beth Plale and Karsten Schwan. dQUOB: Managing large data flows using dynamic embedded queries. In Proc. 9th IEEE Intl. High Performance Distributed Computing (HPDC), Los Alamitos, CA, August 2000. IEEE Computer Society.
 PBS Pro. Portable Batch System. http://www.pbspro.com/, 2002.
|Title||Remote Memory Operations on Linux Clusters Using the Global Arrays Toolkit, GPSHMEM and ARMCI|
|Author(s)||Chona S. Guiang, A. Purkayastha and K. F. Milfeld|
|Author Inst||Texas Advanced Computing Center,University of Texas - Austin, USA|
|Presenter||Chona S. Guiang|
|Abstract||Remote memory access (RMA) is a
communication model that offers different functionalities
from those provided in shared memory programming
and message passing. The use of one-sided communication
exhibits some of the advantages of shared memory
programming such as direct access to global data,
without suffering from the limited scalability
that characterizes SMP applications. Moreover,
RMA has advantages over MPI on two important respects:
1. Communications overhead: MPI send/receive routines generate two-way communications traffic (“handshaking”) to ensure accuracy and coherency in data transfers that is often unnecessary. In some scientific applications (e.g., some finite difference and molecular dynamics codes), the communication workload involves accessing data that does not require explicit cooperation between the communicating processes (e.g., writes to non-overlapping locations in memory). Remote memory operations in ARMCI , Global Arrays (GA) Toolkit , and GPSHMEM  decouple the data transfer operation from handshaking and synchronization. There is no extra coordination required between processes in a one-sided put/get operation, so there is less of the enforced synchronization that is sometimes unnecessary. Additional calls to synchronization constructs may be included in the application as needed.
2. API: The distributed data view in MPI is less intuitive to use in scientific applications that require access to a global array. In contrast, the GA API provides a global data view of the underlying distributed data that does not distinguish between accessing local versus remote array data.
The Global Arrays (GA) Toolkit provides an SMP-like development environment for applications on distributed computing systems. GA implements a global address space, which makes it possible for elements in distributed arrays to be accessed as if they were in shared memory. The details of actual data distribution are hidden, but are still within user control through utility functions included in the library. MPI and GA library calls can be combined in a single program, as GA is designed to complement rather than replace existing implementations of message passing libraries. The GA library consists of routines that implement one-sided operations, interprocess, collective array operations, and utility and diagnostic operations (e.g., for querying available memory, data locality). As the name implies, GA implements a global address space for arrays of data, not scalars. Therefore, GA one-sided communications are designed to access arrays of data (double, integer, and double complex), not arbitrary memory addresses.
GPSHMEM is a portable implementation of the Cray Research Inc. SHMEM , which is a one-sided communication library for Cray and SGI systems. Like GA, GPSHMEM implements a global address space for accessing distributed data, but GPSHMEM contains support for more data types. GPSHMEM operations are limited to symmetric data objects, objects for which local and remote addresses have a known relationship (e.g., arrays in common blocks). The library routines provide the following functionalities: 1) noncollective RMA operations including those for strided access , and atomic operations; 2) collective operations; 3) parallel environment query functions. In addition to the supported Cray SHMEM routines, GPSHMEM has added block strided get/put operations for 2D arrays.
All RMA operations in GA and GPSHMEM use the Aggregate Remote Memory Copy Interface (ARMCI) library. ARMCI is a portable, one-sided communication library. On Linux clusters, ARMCI relies on low-level network communication layers (e.g., Elan for Quadrics, GM for Myrinet) to implement remote memory copies, while utilizing the native MPI implementation for collective communication operations. Noncontiguous data transfers are supported in ARMCI, which facilitates access of sections of multidimensional arrays in GA and 2D arrays in GPSHMEM.
In general, MPI-2 RMA is more difficult to use than GA, and GPSHMEM because of the greater number of steps involved. A single one-sided operation includes setting up a ‘window’ for the group of communicating processes, moving the data, and ensuring completion of data transfer (among processes in the group) using different synchronization constructs for “active” or “passive” targets. RMA operations with passive targets are serialized, and hence degrades performance. Notwithstanding these caveats, MPI one-sided communication is part of the MPI-2 standard and consequently is more likely to gain support on a greater number of platforms than GA, GPSHMEM, and ARMCI.
Previous work by Nieplocha et. al. [5, 6] on implementing ARMCI on Linux clusters has demonstrated that the ARMCI put/get operations outperform the MPICH-GM implementation of MPI 1.1 send and receive for both strided and contiguous data access. We have conducted additional performance measurements of ARMCI put/get versus MPI 1.1 send/receive on the TACC IA32 and IA64 clusters, using a simple “halo exchange” prototype where each process exchanges contiguous and strided data with its nearest neighbors on a 1-D process grid. Results, as well as further description of the experiments are shown in Figures 1 and 2. For contiguous data transfers, the ARMCI put/get operations exhibit better performance than the equivalent MPI 1.1 send/receive over the entire range of message sizes considered on the IA64 system. On the IA32 cluster, the ARMCI put and get operations outperform MPI send and receive for message lengths exceeding 8KB. On both systems, ARMCI put/get perform better than MPI send/receive for transfer of strided data over the entire range of message sizes. Future work will perform the following experiments to address important issues of RMA operations:
1. Latency and bandwidth measurements over different message sizes will be conducted for put/get operations as implemented in LAM-MPI versus ARMCI. The poor bandwidth obtained with the LAM-MPI implementation of the MPI-2 put/get operations, as shown in Figure 1, could be attributed to the fact that the most recent stable version does not take full advantage of the underlying GM layer. Our future study will use the most recent version of LAM-MPI, which has a completely rewritten version of its GM RPI (request progression interface) .
2. The transfer of contiguous array data in GPSHMEM and GA will be compared against the corresponding point-to-point and/or collective communication calls in MPI. These timing results will be compared with ARMCI measurements to obtain the overhead involved in the GA and GPSHMEM routines.
3. Applications characterized by dynamic communication patterns, irregular data structures, and noncontiguous data transfers are likely to benefit from the remote memory copy functionality that GA and GPSHMEM provide. Molecular dynamics and finite difference applications fit this description well, since both involve the exchange of sections of multidimensional arrays at every time step of the simulation. To test the efficiency and applicability of GA and GPSHMEM for these types of applications, we will evaluate the performance of a simple MD and finite difference code using three different implementations for the transfer of array sections --- corresponding to the use of GA, GPSHMEM, and MPI point-to-point as well as collective communication functions. The experience on these applications will also provide insight on the general feasibility of using SMP-like programming interfaces on distributed-memory systems.
All computational work will be performed on TACC’s Linux clusters: a 32-node cluster of Pentium III-based dual-processor SMPs and an Itanium-based cluster of 20 dual-processor SMP nodes. Each cluster uses a Myrinet 2000 switch fabric.
 K. Parzyszek, J. Nieplocha, and R. Kendall, “A Generalized Portable SHMEM Library for High Performance Computing in Proceedings of the Twelfth IASTED International Conference Parallel and Distributed Computing and Systems
 J. Nieplocha, V. Tipparaju, A. Saify, D.K. Panda, "Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters," in Proc. Workshop Communication Architecture for Clusters (CAC02) of IPDPS'02
 J. Nieplocha, J. Ju, E. Apra, “One-sided Communication on Myrinet-based SMP Clusters using the GM Message-Passing Library,” in Proc. CAC01 Workshop of IPDPS'01, San Francisco, 2001
||Supercomputing in Plain English: Teaching High Performance Computing to Inexperienced Programmers|
|Author||H. Neeman, J. Mullen, L. Lee & G.K. Newman|
|Author Inst||Oklahoma University, USA and Worcester Polytechnic Institute|
|Abstract||Although the field of High Performance
Computing (HPC) has been evolving rapidly, the
development of standardized software systems --
MPI, OpenMP, LAPACK, PETSc and others -- has in
principle made HPC accessible to a wider community.
However, relatively few physical scientists and
engineers are taking advantage of computational
science in their research, in part because of the
considerable degree of sophistication about computing
that HPC appears to require. Yet the fundamental
concepts of HPC are fairly straightforward and
have remained relatively stable over time. These
facts raise an important question: can scientists
and engineers with relatively modest computing
experience learn HPC concepts well enough to take
advantage of them in their research?
To answer this question, HPC educators must address several issues. First, what are the fundamental issues and concepts in computational science? Second, what are the fundamental issues and concepts in HPC? Third, how can these ideas be expressed in a manner that is clear to a person with relatively modest computing experience? Finally, is classroom exposure sufficient, or is guidance required to assist investigators in incorporating HPC in their research codes?
A helpful way to describe the fundamental issues of computational science is as a chain of abstractions: phenomenon, physics, mathematics (continuous), numerics (discrete), algorithm, implementation, port, solution, analysis and verification. In general, physics, mathematics and numerics are addressed well by existing science and engineering curricula -- though often in isolation from one another -- and therefore instruction should be provided on issues relating primarily to the later items, and on the interrelationships between all of them. For example, algorithm choice is a fundamental issue whose gravity many researchers with less computing experience do not appreciate; a common mistake is to solve a linear system by inverting the matrix of coefficients, without regard for performance, conditioning, or exploitation of the properties of the matrix.
As for HPC, the literature tends to agree on the following fundamental issues: the storage hierarchy; instruction-level parallelism; high performance compilers; shared memory parallelism (e.g., OpenMP); distributed parallelism (e.g., MPI); hybrid parallelism. The pedagogical challenge is to find ways to express the basic concepts with minimal jargon and maximal intuitiveness.
Although expressing these concepts in a manner appropriate for the target audience can be challenging, it does not need to be daunting. For example, the use of analogies and narratives to explain these concepts can capture the fundamental underlying principles without distracting the students with technical details. Once the students understand the basic principles, the details of how to implement HPC solutions are much easier to digest.
However, it is probably not reasonable, in many cases, to expect novice programmers to immediately understand how to apply HPC concepts to their science. As a follow on, regular interactions with experienced HPC users can provide the needed insight and practical advice to move research forward.
We will discuss an ongoing effort to develop materials that express sophisticated scientific computing concepts in a manner accessible to a broad audience. In addition, we will examine a programmatic approach that incorporates both the use of these materials and the crucial contribution of followup.
Applications and Tools Track Abstracts:
Session I: Applications Performance Analysis
|Title||Scaling Behavior of Linear Solvers on Large Linux Cluster|
|Author(s)||John Fettig, Wai-Yip Kwok, Faisal Saied|
|Author Inst||NCSA/University of Illinois, USA|
|Abstract||Large sparse linear systems of
equations arise in many computational science and
engineering disciplines. One common scenario occurs
when implicit methods are used for time-dependent
partial differential equations, and a nonlinear
system of equations has to be solved at each time
step. With a Newton-type approach, this in turn
leads to a linear system at each Newton iteration.
In such situations, the linear solver can account
for a large part of the overall simulation time.
Demonstrating the effectiveness of linear solvers
on large Linux clusters will help establish the
usefulness of these platforms for applications
in such diverse areas as CFD, Astrophysics, Bio-physics,
earthquake engineering, and structural mechanics.
In this paper, we investigate the effectiveness of Linux clusters for state of the art software for parallel linear solvers such as PETSc, Hypre, and other packages drawn from the research community. In particular, we study scaling to large processor counts, and large system sizes (tens of millions of unknowns). The scalability of the algorithms, of the implementation, and of the cluster architecture will be investigated. We will focus on iterative Krylov subspace solvers like Conjugate Gradient, GMRES(k) and BiCGSTAB, coupled with parallel preconditioners, such as block Jacobi, domain decomposition, and multigrid methods.
In our analysis of the parallel performance of these solvers on Linux clusters, we will identify which components of clusters (CPU, memory bandwidth, network latency/bandwidth) are the critical factors in the performance of the different software components of the parallel solvers, such as matrix vector products, inner products, preconditioners, grid transfer operators, coarse grid solvers. We will present results for Linux clusters based on first and second-generation 64-bit Itanium processors from Intel, connected by a Myrinet switch, used primarily in message-passing mode.
The number of performance tools available on Linux clusters is increasing. In our performance analysis of parallel linear solvers, we will use these tools to give a more detailed performance profile of the solvers. In particular, tools that access hardware performance counters will be used to study the impact of the memory hierarchy on performance, and MPI analysis tools will be used to quantify the message-passing performance of the network for sparse linear solvers.
|Title||Benchmarking of a Commercial Linux Cluster|
|Author Inst.||Università di Pavia, Italy|
|Abstract||Exploiting the cumulative computing
capabilities of workstations and PCs has been pursued,
for a long time, as an inexpensive solution for
delivering high performance to many scientific
and industrial applications. Various solutions
have been proposed to allow parallel applications
to be executed over a set of independent systems.
Cluster machines, built on top of commodity off
the shelf components, initially developed as prototype
machines by research centers, are becoming very
popular. The increasing performance of commodity
components and the availability of interconnecting
networks able to link large number of such components
have fueled the development of these cluster machines.
Moreover, the development of specialized hardware
and software components tuned for matching the
demands of HPC applications have resulted in cluster
machines approaching the supercomputers' performance
and ranked among the top 500 most powerful supercomputers.
Hardware vendors, offering scalable cluster machines,
also testify the maturity of cluster computing
as a cost effective solution to cope with the high
performance needs of scientists and numerical engineers.
This paper is focused on a detailed study of the performance achieved by an IBM NetFinity supercluster when performing various scientific benchmarks. Note that, although we have analyzed the performance of a specific machine, our results have a broader applicability. Indeed, the IBM NetFinity adopts a typical cluster architecture based on Intel Pentium processors interconnected via both Myrinet and Ethernet networks, and the Linux operating system. The cluster is composed of 260 nodes each of them housing two Intel Pentium processors. As a preliminary overview of the cluster performance, we have investigated the run time behavior of a few kernels performing widely used numerical algorithms, such as, matrix LU factorization, multigrid solvers, and distributed rankings.
The aim of this study is to evaluate the performance benefits experienced by numerical applications due to specialized high bandwidth communication components, i.e., Myrinet switches, over a typical Beowulf cluster based on Ethernet networking. For such a purpose, we have analyzed the speedup achieved by the various kernels executed over either Myrinet or Fast Ethernet network. Our results are based on the post mortem analysis of the low level timing measurements collected by monitoring the executions of the various kernels. Performance models of simple distributed computations, aimed at assessing the severity of the communication versus computation performance, have been derived.
The analysis of communication times has then focused on the wall clock
times accounted by processors in data exchanging. Note that, these times take into account both the wall clock times required to deliver the data over the network links and the times spent to perform the adopted communication protocols. As a rule of thumb, Myrinet allows a greater scalability with respect to Ethernet, even when the amount of workload to be processed by each allocated processor is quite small. The speedup of a few kernels executed over Ethernet saturates when more that 32 processors are allocated. On the other hand, a surprising result is that, Ethernet performance is superior when parallel kernels issue a large number of blocking send/receive. For each communication protocol, we have also analyzed the computation overheads, that is, the increases of computation wall clock times due to the concurrent execution of system processes for managing the communication activities.
Our performance characterization study has then focused on the behavior of a complex climate model benchmark, that is, the PSTSWM kernel from the Oak Ridge National Laboratory. Performance figures related to various testbed simulations, varying the size of the physical grid, the number of simulated steps, as well as the number of allocated processors, have been analyzed. Statistical clustering techniques have been then applied to these performance figures to derive synthetic characterizations of the behaviors of the allocated processors. The identified groups of homogeneous behaviors have been then related to the characteristics of the corresponding simulations, that is, with the processed workload.
As a further step towards the characterization of the performance achieved by the IBM NetFinity cluster, we have studied the impact of the processor allocation policy on the various kernel execution times. Indeed parallel applications might be though as composed of tasks to be allocated to the various processors. Since each IBM NetFinity node houses two processors, two allocation policies are then available: we can allocate either one or two application tasks on each node. Note that, when only one task is allocated on each node both the processors might concur to the execution of such an application task. Numerical kernels experience performance degradations when two tasks per node are allocated on a large number of nodes (i.e., more that 32 nodes). These performance degradations are due to increases in both computation and communication wall clock time. Indeed, an in depth analysis of timing measurements results in outlining that communications between tasks allocated on the same node require longer times with respect to communications between tasks allocated on distinct nodes. Moreover, since the computation phases are somewhat synchronized across all application tasks, contentions due to synchronous memory accesses and concurrent execution of eventually spawned communication processes also result in increases in the computation wall clock times. Experimental evidences, presented in the paper, will substantiate that, for some type of applications, being fixed the number of nodes, allocating both their processors results in longer execution times with respect to allocating only one processor per node.
Session II: Applications
Performance on Clusters
|Title||A Comparative Study of the Performance of a CFD program across different Linux Cluster Architectures|
|Author(s)||Th. Hauser, R.P. LeBeau, T. Mattox, G.P. Huang, H. Dietz|
|Author Inst||Utah State University|
|Title||Scalability of a Tera-scale Linux-based Clusters from Parallel Ab initio Molecular Dynamics|
|Author(s)||Hongsuk Yi, Jungwoo Hong, Hyoungwoo Park, and Sangsan Lee|
|Author Inst||KISTI, Korea|
|Abstract||In the beginning of 2000, KISTI
has initiated a TeraCluster project to design a
linux-based cluster with tera-scale performance.
The main goal of this project is to provide resources
composed of PC clusters that meets the level of
compute performance required by the grand challenge
applications in Korea. The UP2000 with Myrinet
and DS10 with Fast Ethernet systems as a test platforms
of the TeraCluster project have benefited from
many years of research dedicated to improving their
scalability and performance. For actual experiments,
we have investigated the performance of scalable
ab initio molecular dynamics and quantum chromodynamics
(QCD) applications on the test platforms and compare
with the Cray T3E system. The parallelization is
based on the Massage Passing Interface standard.
Results show that the QCD application which requires
only nearest neighbor communications between computing
nodes reveals an excellent parallelization. While
for the ab initio molecular dynamics the operations
responsible for the use of three-dimensional fast
Fourier transformation dominate, representing the
most time-consuming part in the calculation of
C36 fullerence. Combining these results
with the performance evaluations of the additional
grand challenge applications involving the computational
fluid dynamics, structural analysis, and computational
chemistry, we have designed a prototype of the
TeraCluster with 516 computing nodes and build
the first phase of TeraCluster in 2002.
TABLE I: The hardware and software components of the three different platforms and the basic results of NPB and PMB benchmark suite
We have built two different linux-base PC clusters of 64 computing nodes with one server node as a pilot systems . The DS10 system with 466 MHz Ev6 and UP2000 with 667 MHz Ev6 Alpha processors are connected with Fast Ethernet and Myrinet, respectively. Each com-puting nodes have 250 MB memory and 20 GB local IDE disk, while the server node has 128 MB memory and 30 GB disk. The systems operate on Linux Redhat 6:0 with a kernel level of 2.2.0. The MPICH is used as a MPI communication library, and portable batch system (PBS) is selected as a queuing system. The pilot project has been a success story with the col-laboration of a few industrial partners in Korea. From the performance evaluations of the major KISTI  applications, we build the first phase of the TeraCluster denoted as PhaseI. The details is summarized in Table I.
FIG. 1: (a) The sustained performance and (b) the corresponding speed-up of the
full QCD with the L = 12 x 123 lattice size for three different machines.
FIG. 2: Speedup of the ab initio molecular dynamics for the UP2000, DS10, and
Cray T3E. The dashed line indicates the ideal speed-up. Speedup of the
ab initio application using VASP on the PhaseI.
We also calculate a new solid phase of carbon made from C 36 fullerence. The super-
cell approximation is used to simulate the periodic boundary condition and the cell size is 11 x 11 x 11 °A3 for C 36 . Kohn-Sham single electron wavefunction is expanded by plane waves up to a cut-off energy 36 Ry. Since C36 is molecule we use a single k-point in the calculation. In Fig. 2(a), we illustrate the speedup of the C36 fullerence on the pilot system and compare it with the Cray T3E. In contrast to the above results, each shows an incomplete behavior in scalability. Especially, the Cray T3E system retains the linear scalability up to 16 nodes, but a rather monotonous increase of S reaches to nearly 16 with using the 64 computing nodes, while the DS10 exhibits a more deficient scalability. These results account for the importance of the capacity of the network interface through the collective communication between the executing nodes. Indeed, the overhead of data communication increases rapidly as the working nodes increases, reflecting the significant data exchange during the 3D FFT. In Fig. 2(b), we also present the performance evaluations of ab initio molecular dynamics using VASP package on the PhaseI. However, the scalability has not found at all even for 32 working nodes. In the remainder of this paper, we will discuss on the scalability of the ab initio grand challenging problems in more details.
 TeraCluster Project (http://cluster.or.kr).
 KISTI (http://www.hpcnet.ne.kr).
|Title||openMosix vs Beowulf: a case study|
|Author(s)||Moshe Bar, Stefano Cozzini, Maurizio Davini and Alberto Marmodoro|
|Author Inst.||INFN and the University of Pisa, Italy|
|Abstract||Today two main clustering paradigms
exist for the Linux environment: openMosix and
Beowulf (MPI-based). Whereas in MPI-based solutions,
the parallelization needs to be coded explicitly
through special library directives, in openMosix
the usual Unix style programming concepts apply.
Also, for clustering applications it becomes increasingly
evident that the latency of the network interconnect component quickly gains a very significant weight on the overall system performance and throughput. In this paper we will describe our evaluation work of these two different clustering paradigms applied to a suite of production codes currently used in a specific scientific environment recently installed at Trieste (Italy): the INFM center of excellence DEMOCRITOS. The codes represent the standard computational tools used in computational research in atomistic simulations. The suite can be roughly divided in three main categories of codes:
We therefore decided to provide Democritos with two small clusters implementing the first the Beowulf paradigm, the second the openMosix approach. Both the cluster are the IBM 1300 cluster solution equipped by PIII 1.4.Ghz processor with the same number of CPUs (16 bi-processor nodes). One cluster is equipped with the Myrinet high speed network in order to allow us to test the role of an high speed network on both the paradigms.
Our benchmarking work want to see which solution fits some precise requirements about scalability, performance and throughput aspects. Scalability and performance values are measured in a standard way and therefore compared against other (proprietary) parallel platforms. We define them satisfactory just computing a " performance/price'" ratio. These requirements are easily met by both the paradigm for all the class of codes. The aspect that makes the difference is throughput one: this is fundamental in order to guarantee efficient resource sharing among users and high percentage of utilization of the computational resources.
Finally a key role is played by the "time to production" variable. It is in fact fundamental for our research to reduce at minimum the time needed to implement a scalable and efficient parallel application starting from a serial ones.
Our preliminary results indicate that for highly parallel codes like ab initio ones the standard Beowulf paradigm should be the preferred solution in term of efficiency, scalability and throughput. We note however that the superiority of this solution is mainly due to the fact that highly parallel and scalable codes are ready to be implemented and therefore the time to production for this class of code is practically zero.
On the other hand for QMC codes the Openmosix approach is certainly to be preferred. Efficiency and scalability are, as already said, the same as in the Beowulf case; the two main advantages are related to the other two aspects. Throughput is easily obtained through the load balancing technology within openMosix, and the "time to production" for serial programs is practically negligible: just run many instances of the program and all it is done.
It is interesting to note that second class of codes (classical MD) despite the fact they are already parallel, generally run better on the openMosix cluster. These codes have in fact a low level of scalability (1-8 CPU) making the sharing of resources among this class of codes and the ab-initio ones (which requires lot of processors) difficult to be efficiently scheduled with standard queue systems on small cluster like ours.On the contrary, the openMosix solution deals at best with the requirements of QMC and moderate scalable codes like these ones.
In the final paper we will report the results obtained on the computational suite under various lab environments considering the following aspect: topology of nodes, typology of node anf Network interconnects On hand of special hardware for the various interconnects (Myrinet, Dolphin, Fast Ethernet, Gigabit) we will show the resulting (reproducible) scalability, performance and throughput aspects regarding our computational problem.
Session III: Experiences
with Portability and I/O
|Title||Adventures with Portability|
|Author Inst||Lincoln University, New Zealand|
|Abstract||As computers become more powerful
it becomes possible to develop computer models
that more accurately approximate the real-world
systems they are simulating. However these
models are increasingly complex, often taking years
to develop, so it is important to re-use existing
work as far as possible rather than continually “re-inventing
the wheel” and duplicating effort which could
be better spent elsewhere.
With this aim of re-use scientists at Dexcel (a New Zealand dairy research institute) developed a dairy farm simulation framework with some capability to incorporate existing computer models of distinct entities in a pastoral dairy farm, e.g. pasture and crop growth models, animal metabolism models. This framework also allows run-time selection from several alternative models for each component (where they exist). With some being simpler and others more complex, this permits the same basic farm configuration to be modeled at different levels of detail, as appropriate for different research exercises.
A major practical difficulty in implementing a system with this flexibility is that the various component sub-models have typically been written in a range of different computer languages. The dairy farm simulation framework itself is written in Smalltalk, and so far makes use of sub-models for the climate, pasture and animal components written in various languages such as Smalltalk, C, Fortran and Pascal. Initially the framework was developed as a research tool for agricultural scientists, and run on a Windows operating system using the Microsoft COM/DCOM protocol for interprocess communication. However there is now a requirement to investigate automated optimization of farm specification and operating parameters, and also visualisation of the associated high-dimensional objective functions. This work requires very large numbers of simulations to be performed so we are attempting to port the framework to run in parallel on a Linux cluster.
While porting the existing framework and component models to run on the Linux cluster we also need to maintain the same dairy farm simulation model running on a Windows operating system. Ideally, as far as possible, we would like to maintain one single version that runs on both operating systems and also on a uniprocessor or in a parallel processing environment to avoid the difficulties of keeping the development of multiple versions for different environments synchronised. In practice this may not be completely achievable, but as far as possible we want to minimize the differences between these implementations. Thus we have had to use techniques that are portable between these environments, to ensure that all environments can use the same version. An added complexity is that researchers using the Windows version use a Windows interface but the parallel version must run without interacting with a Windows interface, so the code for the model itself must be independent of the code for the interface. Also, as we are using component models developed by other researchers we cannot change the component models themselves, and have to ensure that the interfaces between the framework and the component models are such that new versions of the component models may be substituted as they become available. It may also happen that alternative models for a particular component, such as the pasture or animal component, may require quite different inputs and produce outputs in different formats, or even completely different types of output. It is therefore something of a challenge to devise generic portable interfaces between the framework and the component models that will provide the necessary input parameters for any component model, and also be able to make use of the output from these component models, if necessary converting the output to a different form before it can be used in the dairy farm simulation.
This paper identifies and discusses the problems to be solved in this porting process, and describes the techniques that have been used successfully to date. These include porting the framework and some of its component models from a Windows uniprocessor environment to a Linux parallel processing environment, the initial use of files to transfer data between processes, and then replacing the file transfers by socket communication. This has led to some interesting exercises in portability. As we use MPI to manage the parallel processing, and have not yet persuaded Smalltalk to communicate by using MPI, we have a C program, using MPI, which manages the parallelism. This C program then starts the Smalltalk framework, which in turn communicates with its component models, written in Smalltalk, C, Fortran or Pascal, and returns a final result for each simulation to the controlling C program. The problem has been in getting different processes, written in different languages, and perhaps started and running independently of one another, to communicate and pass data between themselves. We hope to include some preliminary results of extending MPI outside its normal C/C++/Fortran domain and making it the basic inter-process communication mechanism in the framework and will also be portable across operating systems.
|Title||Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters|
|Author||Kent Milfeld, Avijit Purkayastha, Chona Guiang|
|Author Inst||TACC/University of Texas - Austin, USA|
|Abstract||One of the key elements to the
Linux Cluster HPC Revolution is the high speed
networking technology used to interconnect nodes
within a cluster. It is the backbone for “sharing” data
in a distributed memory environment. This
backbone, when combined with local high-speed disks,
also provides the framework for developing parallel
I/O paradigms and implementing parallel I/O systems—an
HPC environment for “sharing” disks
in a parallel file system.
While much of the emphasis in cluster computing has focused on optimizing processor performance (e.g., compilers, hardware monitors, numerical libraries, optimization tools) and increasing message passing efficiency (e.g., MPICH-GM, MPI-Pro, LAM, M-VIA), much less effort has been directed towards harnessing the parallel capacity of multiple processors and disks to obtain “HPC I/O” in clusters. This is partly due to the fact that the MPI-I/O standard was not formulated until version 2 of MPI. It is also due to the cost of dedicating processors/nodes to I/O services. (For workloads that include only a few I/O intensive applications, dedicated nodes for parallel I/O may be an underused resource, to say the least.) In addition, dedicated I/O nodes introduce heterogeneity into the system configuration, and therefore require additional resources, planning, management, and user education.
The Parallel Virtual File System (PVFS)  is a shared file system designed to take advantage of parallel I/O service on multiple nodes within a Linux cluster. In PVFS, Parallel I/O requests are initiated on clients (compute nodes), and the parallel I/O service occurs on I/O servers (I/O nodes). The PVFS API includes a partitioning mechanism to handle common, simple-strided access with a single call. The file system is virtual in the sense that it can run without kernel support (that is, in user space) in its native form (pvfs), although a UNIX I/O interface is supported through a loadable kernel module. Also, PVFS is built on top of the local file system on each I/O node. Hence, there is a rich parameter space for obtaining optimal performance through a local file system. The authors of PVFS (Cettei , Ross & Ligon) have championed this file system, and have performed many different benchmarks to illustrate its capabilities.
The main attributes of PVFS are:
· a single file name space
· multiple I/O servers (files are striped across I/O nodes)
· single metadata manager (mgr): metadata is separated from data
· independent I/O server, client and manager location
· data transfer through TCP/IP
· multiple user interfaces (MPI-I/O, UNIX/POSIX I/O, native pvfs)
· (PVFS has been integrated into the ADIO layer of ROMIO)
· free and downloadable
The purpose of this paper is to investigate the PVFS file system on a wide range of parameters, including different local file systems and varied multi-user conditions.
In many, small “workcluster” systems the nodes are frequently 2-way SMPs (for price-performance reasons). In these systems, where jobs are often run in dedicated mode (all processors devoted to a single user), there are two basic configurations that we evaluate for I/O-intensive applications: a compute processor and an I/O processor can coexist on each node, or the nodes can be divided into separate I/O nodes and compute nodes. More specifically, the cases are:
Also, on a loaded production cluster, the state of the system is often far from the conditions under which benchmarks are performed. In our evaluation of PVFS we simulate different scenarios of simultaneous I/O access by multiple users to test I/O nodes under the stress that can occur in large systems. While the cost of “multiplexing” multiple user requests is not important in a dedicated environment, it is an important measurement for large production systems, and should not be outweighed by single user characteristics.
 Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter Ellis, and Michael Best, File Access Characteristics of Parallel Scientific Workloads, IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, October 1996.
 Matthew M. Cettei, Walter B. Ligon III, and Robert B. Ross, Clemson University.
 http://parlweb.parl.clemson.edu/pvfs/papers.html, http://www-unix.mcs.anl.gov/~rross/resume.htm
Systems Administration Track Abstracts:
I: System Performance
|Title||Checkpointing and Migration of Parallel Processes Based on Message Passing Interface|
|Author||Zhang Youhui, Wang Dongsheng, Zheng Weimin|
|Author Inst||Department of Computer Science, Tsinghua University, China|
|Abstract||Linux Clusters offer a cost-effective
platform for high-performance, long-running parallel
computations and have been used widely in recent
years. However the main problem with programming
on clusters is the fact that it is prone to change.
Idle computers may be available for computation
at one moment, and gone the next due to load, failure
or ownership.The probability that clusters will
fail increases with the number of nodes. During
normal computing, one abnormal events is likely
to cause the entire application to fail. To avoid
this kind of time waste, it is necessary to achieve
high availability for clusters. Checkpointing & Rollback
Recovery (CRR) and Process Migration offer a low
overhead and full solution to this problem.
CRR, which records the process state to stable storage at regular intervals and restarts the process from the last recorded checkpoint upon system failure, is a method that avoids the waste of computations accomplished prior to the occurrence of the failure. Checkpointing also facilitates process migration, which suspends the execution of a process on one node and subsequently resumes its execution on another. However, Checkpointing a parallel application is more complex than just having each processor take checkpoints independently. To avoid domino effect and other problems upon recovery, the state of inter-process communication must be recorded, and the global checkpoint must be consistent. And there some related CRR software, including MPVM, Condor, Hector.
This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. In this system some important fault tolerant mechanisms are designed and implemented, which include coordinated checkpointing protocol, synchronized rollback recovery, synchronized process migration, and so on.
Using ChaRM4MPI, the node transient faults can be recovered automatically, and the permanent fault can also be recovered through checkpoint mirroring and process migration techniques. If any node that is running the computation drops out of the clusters, the computation will not be interrupted. Moreover, users can migrate MPI processes from one node to another manually for load balance or system maintenance. ChaRM4MPI is a user-transparent implementation, which means it is needless for users to modify their source codes.
Some MPI parallel applications are tested on ChaRM4MPI, which include the sorting programs and LU decomposition developed by NASA, an overfall simulation, and so on. We study the extra time overhead introduced by CRR mechanism, and the result is fairly satisfied. The average extra overhead per checkpointing is less than 2% of the original running time.
In the full text, the coordinated CRR and migration protocol is described first, and then we introduce the work involved in modifying MPICHP4 in detail. At last, testing results will be presented and analyzed.
1. Elnozahy E N, Johnson D B, Wang Y M. A Survey of Rollback Recovery Protocols in Message-Passing System. Technical Report. Pittsburgh, PA: CMU-CS-96-181. Carnegie Mellon University, Oct 1996
2. Elnozahy E N. Fault tolerance for clusters of workstations. Banatre M and Lee P (Editors), chapter 8, Spring Verlag, Aug. 1994.
3. J. Casas, Dan Clark, et. al. MPVM: A Migration Transparent Version of PVM, Technical Report, Dept. of Comp. Sci. & Eng., Oregon State Institute of Sci. & Tech.
4. M. J. Litzkow, et. al. Condor— A Hunter of Idle Workstations, In Proc. of the 8th IEEE Intl. Conf. on Distributed Computing Systems, pp104-111, June 1988
5. S. H. Russ, B. Meyers, C.--H. Tan, and B. Heckel, ``User--Transparent Run--Time Performance Optimization'', Proceedings of the 2nd International Workshop on Embedded High Performance Computing, Associated with the 11th International Parallel Processing Symposium (IPPS 97), Geneva, April 1997.
6. James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent checkpointing under Unix. In Proceedings of the 1995 USENIX Technical Conference, pages 213-224, January 1995.
Analytical Model to Evaluate the Performance
||Leonardo Brenner, Paulo Fernandes, César A. F. De Rose|
|Author Inst||Faculdade de Informatica, Catholic University of Rio Grande do Sul, Brazil|
|Presenter||César A. F. De Rose|
II: Cluster Management and Monitoring
|Title||High-Performance Linux Cluster Monitoring Using Java|
|Author||Curtis Smith, David Henry|
|Author Inst||Linux NetworX|
|Abstract||Monitoring is at the heart of cluster management. Instrumentation data is used to schedule tasks, load-balance devices and services, notify administrators of hardware and software failures, and generally monitor the health and usage of a system. The information used to perform these operations must be gathered from the cluster without impacting performance. This paper discusses some of the performance barriers to efficient high-performance monitoring, and presents an optimized technique to gather monitored data using the /proc file system and Java.|
|Title||CRONO: A Configurable Management System for Linux Clusters|
|Author||Marco Aurélio Stelmar Netto, César A. F. De Rose|
|Author Inst||Research Center in High Performance Computing, Catholic University of Rio Grande do Sul, Brazil|
|Presenter||César A. F. De Rose|
Cluster architectures are becoming a very attractive alternative when high performance is needed. With a very good cost/performance relation and good scalability big cluster systems with hundred of nodes are becoming more popular in universities, research labs and industries. Linux is usually the operating system used in these systems because it’s free, efficient and very stable.
One big problem in such a system is to manage all the nodes efficiently as one machine and deal with issues like access rights, time and space sharing, reservation and jobs queuing. When looking for such a manager for our research lab we were not satisfied with the services and level of configuration provided by the most popular systems like Computing Center Software  and Portable Batch System  and decided to write our own management system.
2. Cluster management over Linux
The Crono system was developed to be a highly configurable manager for Linux clusters. It has been coded using the C language and rely on several scripts and configuration files to be adapted to specific administration needs and machine configurations. The Linux operating system has several advantages for the development of a cluster management system (CMS). Linux is an open source system and its configuration files are very simple to read and modify, therefore the CMS can define environment variables and access restrictions easily. Another main advantage is the possibility to modify the kernel to better support the CMS.
3. Crono main functionalities
Crono provides two allocation modes, space-sharing and time-sharing. The first one is used when the user needs exclusive access to allocated nodes, for example when application performance is being measured. The second one is used in situations where the users are only testing their programs and, therefore, they do not care about performance. Space-sharing is very interesting alternative in teaching environments, allowing large groups of students to use the cluster at the same time.
Another main feature of Crono is its flexibility by the definition of access rights. With configuration files the system administrator can create user categories and associate access restrictions to these categories or to individual users. These restrictions are defined by the maximum time and maximum amount of nodes used in allocations and reservations. There is also the possibility to define restrictions based on period of the day, day of the week and target machine.
To configure the execution environment for programs, Crono supplies scripts for the pre and post processing of the requisitions. When it initiates the time of a user, Crono will use two scripts: one of them controlled by the administrator, and the other by the user itself. This mechanism is very useful, for example, to automatically generate MPI  machine files. When the time of a user is over, two post processing scripts will be used in the same way.
Users can interact to the system through a graphical interface or using commands in the shell (bash, tcsh, etc). Several services are supported for users and the system administrator like: to show information about the allocation queue, to submit jobs to execution, to release the resources, to configure the execution environment and to get information on access rights.
Figure 1 presents the system architecture with its four main module, the User Interface (UI), the Access Manager (AM), the Requisition Manager (RM) and the Node Manager (NM). The communication between the modules is done through the sockets interface, therefore allowing modules to be in different machines. The full paper will present the Crono architecture in more detail.
Figure 1: Crono’s architecture and main modules
For the last 6 months Crono is being used to manage the three Linux clusters in our research lab[*]. Access to the clusters is centralized trough one host machine that runs the Crono Access Manager and three Requisition managers, one for each of the clusters. Each node of the clusters has its own Node Manager. Each cluster has its own peculiarities, having a different number of nodes (4, 16, and 32) and different interconnection networks (combinations of SCI, Fast-Ethernet and Myrinet). We also support several program environments, including different MPI implementations and some own developed execution kernels. With simple commands like cralloc, crcomp and crrun users can choose the cluster, the number of nodes, the target interconnection network and the programming environment they want. Crono will automatically generate machine files, choose the right libraries for the compilation, load the processes and start the program. If the needed nodes are not available, waiting queues are managed for each cluster. Because we have different types of users, access rights can be defined for each user depending on status, project, target machine and day period separately. A student may have for example only access to the small cluster during the peek hours and to 16 nodes of the main cluster during the night. The system is running over the Linux Slackware 8 distribution and is already very stable. We developed the system in a way that it can be easily configured to work with other cluster configurations and a different set of tools. More information about Crono can be found at www.cpad.pucrs.br/crono together with its source code that is distributed under the GNU license.
 KELLER, Axel, REINEFEL, Alexander, Anotomy of Resource Management System for HPC Clusters, vol.
 OpenPBS web site, "www.openpbs.org"
 W. Gropp, E. Lusk, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message
Passing Interface Standard, July 1996.
[*] Research supported by HP Brazil.
|Title||Parallel, Distributed Scripting with Python|
|Author Inst.||Lawrence Livermore National Laboratory, USA|
|Abstract||[Need to conver from PDF to HTML]|
III: Processor and
|Title||A Study of Hyper-Threading in High Performance Computing Clusters|
|Author(s)||Tau Leng, Rizwan Ali, Jenwei Hsieh, Victor Mashayekhi, Reza Rooholamini|
|Author Inst||Dell Computer Corporation, USA|
|Abstract||Intel’s Hyper-Threading technology
makes a single physical processor appear as two
logical processors. The physical processor resources
are shared and the architectural state is duplicated
for the two logical processors . The premise
is that this duplication allows a single physical
processor to execute two data instructions from
different threads in parallel rather than in serial,
and therefore, can lead to better processor utilization.
Processors enabled with the Hyper-Threading technology
can improve the performance of applications with
high degree of parallelism. Previous studies have
shown that the Hyper-Threading technology improves
multi-threaded applications’ performance
by the range of 10 to 30 percentages depending
on the characteristics of the applications .
These studies also suggest that the potential gain
is only obtained if the application is multi-threaded
by any means of parallelization techniques.
With the addition of Hyper-Threading support in Linux kernels 2.4.9-13 and above, Linux cluster practitioners have started to assess its performance impact on their applications. In our area of interest, High Performance Computing Clusters, applications are commonly implemented by using standard message-passing interface, such as MPI or PVM. Applications implemented in message-passing programming model are usually not multi-treaded. Rather, they use a mechanism like “mpirun” to spawn multiple processes and map them to processors in the system. Parallelism is achieved through the message-passing interface among the processes to coordinate the parallel tasks.
Unlike the multithreaded processes in which the values of application variables are shared by all the threads, a message-passing application runs as a collective of autonomous processes, each with its own local memory. This type of applications can also benefit from Hyper-Threading technology in that the number of processes spawned on the physical processors can be doubled on the logical processors, and potentially complete the parallel tasks faster. Applying the Hyper-Threading and doubling the processes running on the cluster simultaneously will increase the utilization of the execution-units’ resources of the system. Therefore, the performance can be improved. On the other hand, overheads might be introduced in the following ways:
Figure 1. The HPL Linpack performance results on single node. The thick lines are the results of 1 physical-processor runs, and the thin lines are the results of 2 physical-processors runs, with HT (Hyper-Treading) disabled and enabled respectively.
Our testing environment is based on a cluster consisting of 8 Dell PowerEdge 4600 servers interconnected with Myrinet. Each PowerEdge 4600 has two Intel Pentium 4 Xeon processors running at 2.2 GHz with 512KB L2 cache, 2GB 400 MHz of DDR-RAM memory. The chipset of PowerEdge 4600 is the ServerWorks GC-HE, which accommodates up to twelve registered DDR (double data rate) 200 (PC1600) DIMMs, each with capacities of 128 MB through 1GB with a 4-way interleaved memory architecture. Each of the two PCI-X controllers on 4600 has its own dedicated 1.6 GB/s full duplex connection to the North Bridge to accommodate the peak traffic generated by the PCI-X busses it controls.
Figure 2. The HPL Linpack performance results of single CPU cluster (on the left) and for dual CPUs cluster (on the right) runs, from 1 node to 8 nodes. The memory used or the problem size for each run is based on the principle of 1GB RAM per physical processor.
To understand the impact of Hyper-Threading on single compute node, we first conducted a series of HPL runs from small problem size to large. The results shown in Figure 1 indicates that only when the problem size or memory used is large to some extent, 1600x1600 grid blocks or more, we can see modest performance improvement (around 5%) on Hyper-Threading configurations. Also note that, for the runs where the number of processes is more than the number of physical processors without Hyper-Threading enabled, the performance is considerably worse.
We then ran HPL on the cluster with different combinations of compute nodes and CPUs numbers, and compared the results of Hyper-Threading disabled and enabled. Figure 2 shows the performance results of single physical CPU runs on the left and dual physical CPUs runs on the right. The performance gains from Hyper-Threading are larger on single physical CPU configurations than that on dual physical CPUs configurations, approximately 10% vs. 5%. That is because the “overheads” mentioned above during the runs are more severe for the configurations of four processes on each node when the Hyper-Threading enabled.
Although our preliminary study indicated that Linpack-type applications can benefit from Hyper-Threading, we have seen mixed results when running the NAS parallel benchmarks suite. For example, the IS (integer sort) which is a communication-bound benchmark showed about 10% degradation in performance when applying the Hyper-Threading on a 8x2 (8 nodes with 2 CPU on each node) runs, while the EP (Embarrassingly Parallel), a CPU-bound benchmark showed 40% performance improvement. The two benchmarks are the two extreme cases of NAS. IS is unique in that floating point arithmetic is not involved. Significant data communication, however, is required. With Hyper-Threading enabled configurations, doubling the IS processes running on each node creates much more communication overhead among processes and memory contentions inside the nodes, and yet the CPU floating-point execution units are still underutilized. Hence, the performance degraded. On the other hand, the EP mainly performs floating-point calculations and requires almost no communication during the runs. With Hyper-Threading enabled, running EP could fully utilize the CPUs’ resources of the cluster without being concerned about the communication overhead, which makes the performance increase significantly.
In summary, Hyper-Threading could improve the performance of some MPI applications, but not all, on Linux clusters. Depending on the cluster configuration and most importantly the nature of the application running on the cluster, the performance gain could be varied or even negative. Our next step is to use performance tools to understand what areas contribute to the performance gains, and what areas contribute to the overheads, which lead to performance degradation.
 D. T. Marr el al, “Hyper-Threading Technology Architecture and Micro architecture”, Intel Technology Journal, Vol. 6, Issue 01, February, 2002.
 W. Magro, P. Petersen, S. Shah, “Hyper-Threading Technology: Impact on Compute-Intensive Workloads”, Intel Technology Journal, Vol. 6, Issue 01, February, 2002.
 “HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers”, http://www.netlib.org/benchmark/hpl/
|Title||A Performance Comparison of Myrinet Protocol Stacks|
|Author(s)||Ron Brightwell, Bill Lawry, Michael Levenhagen, Arthur B. Maccabe|
|Author Inst||University of New Mexico, USA|
|Abstract||Recent advances in user-level networking
technology for cluster computing have led to significant
improvements in achievable latency and bandwidth
performance. Many of these improvements are based
on an implementation strategy called Operating
System Bypass, or OS-bypass, which attempts to
increase network performance and reduce host CPU
overhead by offloading communication operations
to intelligent network interfaces. These interfaces,
such as Myrinet, are capable of moving data directly
from an application's address space without any
involvement of the operating system in the data
Unfortunately, the reduction in host CPU overhead, which has been shown to be the most significant factor in effecting application performance, has not been realized in most implementations of MPI for user-level networking technology. While most MPI microbenchmarks can measure latency, bandwidth, and host CPU overhead, they fail to accurately characterize the actual performance that applications can expect. Communication microbenchmarks typically focus on message passing performance relative to achieving peak performance of the network and do not characterize the performance impact of message passing relative to both the peak performance of the network and the peak performance of the host CPU(s).
We have designed and implemented a portable benchmark suite called COMB, the Communication Offload MPI-based Benchmark, that measures the ability of an MPI implementation to overlap computation and MPI communication. The ability to overlap is influenced by several system characteristics, such as the quality of the MPI implementation and the capabilities of the underlying network transport layer. For example, some message passing systems interrupt the host CPU to obtain resources from the operating system in order to receive packets from the network. This strategy is likely to adversely impact the utilization of the host CPU, but may allow for an increase in MPI bandwidth. We believe our benchmark suite can provide insight into the relationship between network performance and host CPU performance in order to better understand the actual performance delivered to applications.
In this paper, we will compare four different network protocol stacks on Myrinet using identical computing and networking hardware. In addition to MPI microbenchmark results for latency and bandwidth, we will gather results from the COMB suite and characterize CPU availability and the ability to overlap computation with MPI communication. We expect this direct comparison to provide insight into how performance is impacted at each layer of the protocol stack. The following describes the four different protocol stacks that we will compare.
At the highest level, all of our benchmarks tests are coded using MPI. This gives us the maximum amount of portability between systems and establishes a baseline for comparison. Below MPI, we will be comparing systems that use the GM message passing API from Myricom and our Portals 3.0 API.
The Portals 3.0 API is composed of elementary building blocks that can be combined to implement a wide variety of higher-level data movement layers. We have tried to define these building blocks and their operations so that they are flexible enough to support other layers as well as MPI. Data movement in Portals is based on one-sided operations. Other processes can use a Portal index to read (get) or write (put) the memory associated with the portal. Each data movement operation involves two processes, the initiator and the target. In a put operation, the initiator sends a put request containing the data to the target. The target translates the Portal addressing information using its local Portal structures. When the request has been processed, the target may send an acknowledgement. In a get operation, the initiator sends a get request to the target. The target translates the portal addressing information using its local Portal structures and sends a reply with the requested data.
For our initial implementation of Portals 3.0 on Myrinet, we chose a simple RTS/CTS protocol called RMPP (Reliable Message Passing Protocol). RMPP supports end-to-end flow control and message-level retransmission in the event of a dropped or corrupt data packet. In this protocol, all management decisions are made by the receiver when handling an RTS packet. By the time the receiver has generated a CTS packet, it has made all of the significant management decisions, including where the cleared data packets will be placed in the application's address space and the buffer space for incoming packets that will be needed by the network interface.
Portions of the RMPP can be offloaded onto intelligent or programmable network interfaces. Our current implementation is a Linux kernel module that interfaces into the Linux networking stack, making it possible to use any network device that Linux supports. We are currently developing an enhanced version of this protocol that offloads parts of the protocol to the Myrinet network interface. We expect this offloading to offer significant performance improvements.
In addition to these two RMPP-based implementations of Portals, we also have a Myrinet Control Program that implements Portals on top of a reliability protocol modeled after the one used for the wormhole network on the ASCI/Red machine. In this implementation, all of the reliability and flow control protocols are implemented in the Myrinet NIC, while Portals processing can occur either on the NIC or in the kernel. We will discuss the design of this protocol and contrast it with the approaches taken in RMPP.
Finally, we compare results from these protocols with the standard Myrinet GM and MPICH/GM software stack available from Myricom.
|Title||Delivering Linux Cluster Solutions to the High-Performace Marketplace|
|Abstract||IBM continues to deliver many of the most powerful Linux clusters in the world, and has an established track record in building several "top 500" clusters. In this presentation, we will survey the current set of offerings around Linux clusters from IBM, with a particular focus on the IBM eServer Linux Cluster 1350 that was unveiled at LinuxWorld Expo in San Francisco earlier this summer. This presentation will also highlight how IBM is working with key Partners and ISVs in the High Performance Computing arena to deliver solutions that meet the demanding needs of these customers.|
|Title||Designing Supercomputers that Fit Your Needs|
|Abstract||The architecture wars in High Performance
Computing are nearing their logical conclusion.
You can find plenty of confusing data to muddle
the issue, but the overall direction of HPC is
clear. Clusters are where people are investing
their time and money for supercomputing. The reason
for this state of affairs is crystal clear: the
economics of the semiconductor industry combined
with the technology treadmill called Moore's Law
makes it all but impossible for any other supercomputer
architecture to compete with clusters.
In this talk, I will explore the state of cluster computing today and where it is going tomorrow. We will take a look at the next generation building-blocks available to the cluster designer and how these building-blocks let us build supercomputers that are carefully tuned to our specific needs (and price-points).
This is not going to be a pure "cheerleading" talk, however. In addition to "singing the praises of clusters", I will also discuss the challenges presented by clusters. Clusters are not as easy to use as vector supercomputers. Clusters are poorly balanced with respect to moving data through the memory/storage hierarchy. We can use these challenges to make excuses and justify the need to prop up failed supercomputing business models. Or we can use these challenges to motivate the algorithm work required to make clusters work across the full spectrum of HPC problems. It all comes down to whether you want to embrace the future, or hang on to the past.