 |
2002 Abstracts
Plenary Presentations, Cluster
Busters, Programming Models, Applications
and Tools Track, Systems Administration Track,
and Vendor Presentations Abstracts
Last updated: 11 October 2002
| Plenary Presentations
|
| Plenary Session I
|
Title |
|
Seismic & Linux: From
Curiosity to Commodity |
Author(s) |
|
Dr. Mike Turff |
Author Inst |
|
PGS Data Processing |
Presenter |
|
Dr. Mike Turff |
| Abstract |
|
The seismic processing business has always required
massive amounts of compute capacity and has historically been
an early adopter of anything providing price/performance improvements
in this field. This has been exacerbated by the competitive nature
of this business, the downward pressure on prices from the oil
majors and the tender & bid process that decides the allocation
of the available work. In this talk, I review the history of computer
systems in PGS, the move to Linux clusters and the new applications
that this price/performance step has enabled our industry to bring
to the seismic processing market. I will talk from a business
prospective about the acceptance of Linux clusters and show that,
for the right applications, Linux clusters can be regarded as
off-the-shelf commodities. |
|
|
| Plenary Session II
|
Title |
|
Scaling Up and Scaling
Down |
Author(s) |
|
Dr. Daniel A. Reed |
Author Inst |
|
National Center for Supercomputing Applications
(NCSA) |
Presenter |
|
Dr. Daniel A. Reed |
| Abstract |
|
The continuum of computing continues to expand,
ranging from mobile, embedded (or soon implantable) devices to
terascale and petascale clusters. New applications of distributed
sensors (e.g., environmental or structural monitoring, health
and safety) pose new software challenges. How do we manage millions
or even billions of mobile devices, each with power, weight, processing,
and memory constraints? At the other end of the continuum, large-scale
science and engineering applications demand ever faster and larger
clustered systems. How do we design, package, and support systems
with hundreds of thousands of processors in a reliable way? This
talk sketches some of the technical challenges and opportunities
in realizing the ubiquitous infosphere of interconnected systems,
highlighted by recent developments at NCSA and the U.S. TeraGrid.
|
|
|
| Plenary Session III
|
Title |
|
Challenges of Terascale
Computing |
Author(s) |
|
Dr. William Pulleyblank |
Author Inst |
|
IBM Research |
Presenter |
|
Dr. William
Pulleyblank |
| Abstract |
|
We are rapidly moving into the domain of terascale
computing, which includes the ASCI program, and the Blue Gene
project. In addition to the significant hardware problems to be
solved, we will also face software issues that are at least as
large. I will discuss these in the context of BlueGene - a 180
teraflop/s super computer being built at IBM Research which will
run on Linux as its operating system. Enabling Linux for this
platform has necessitated a number of innovations in order to
achieve the targeted levels of performance. We also discuss issues
of reliability and availability, the so-called autonomic issues.
We relate this to current activity on grid computing. |
|
|
| Cluster
Busters
|
| Title |
|
Scalability
and Performance of Salinas on the Computational Plant |
| Author(s) |
Manoj Bhardwaj, Ron Brightwell, Garth
Reese |
| Author Inst |
University of New Mexico, USA |
| Presenter |
Ron Brightwell
|
| Abstract |
Parallel computing platforms composed of commodity
personal computers
(PCs) interconnected by gigabit network technology are a viable
alternative to traditional proprietary supercomputing platforms.
Small- and medium-sized clusters are now ubiquitous, and larger-scale
procurements, such as those made recently by NSF for the Distributed
Teraflops Facility and by Pacific Northwest National Lab, are
becoming more prevalent. The cost effectiveness of these platforms has
allowed for larger numbers of processors to be purchased.
In spite of the continued increase in the number of processors, few
real-world application results on large-scale clusters have been
published. Traditional large parallel computing platforms have
benefited from many years of research, development, and experience
dedicated to improving their scalability and performance. PC clusters
have only recently started to receive this kind of attention.
In order for clusters of PCs to compete with traditional proprietary
large-scale platforms, this type of knowledge and experience may
be crucial as larger clusters are procured and constructed.
The goal of the Computational Plant (Cplant) project at Sandia
National Laboratories is to provide a large-scale, massively parallel
computing resource composed of commodity-based PC's that not only
meets the level of compute performance required by Sandia's key
applications, but that also meets the levels of usability and
reliability of past traditional large-scale parallel machines.
Cplant
is a continuation of research into system software for massively
parallel computing on distributed-memory message-passing machines.
We
have transitioned our scalable system software architecture developed
for large-scale massively parallel processing machines to commodity
clusters of PCs.
In this paper, we examine the performance and scalability of
one of
Sandia's key applications on Cplant. We present performance and
scalability results from Salinas, a general-purpose, finite element
structural dynamics code designed to be scalable on massively
parallel
processing machines. Salinas offers static analysis, direct implicit
transient analysis, eigenvalue analysis for computing modal response,
and modal superposition-based frequency response and transient
response. In addition, semi-analytical derivatives of many response
quantities with respect to user-selected design parameters are
calculated. It also includes an extensive library of standard
one-,
two-, and three-dimensional elements, nodal and element loading,
and
multi-point constraints. Salinas solves systems of equations using
an
iterative, multilevel solver, which is specifically designed to
exploit massively parallel machines. Salinas uses a linear solver
that
was selected based on the criteria of robustness, accuracy,
scalability and efficiency.
Salinas employs a multilevel domain decomposition method, Finite
Element Tearing and Interconnect (FETI), which has been characterized
as the most successful parallel solver for the linear systems
applicable to structural mechanics. FETI is a mature solver, with
some
versions used in commercial finite element packages. For plates
and
shells, the singularity in the linear systems has been traced
to the
subdomain corners. To solve such linear systems, an additional
coarse
problem is automatically introduced that removes the corner point
singularity. FETI is scalable in the sense that as the number
of
unknowns increases and the number of unknowns per processor remains
constant, the time to solution does not increase. Further, FETI
is
accurate in the sense that the convergence rate does not deteriorate
as iterates converge. Finally the computational bottleneck in
FETI, a
sparse direct subdomain solve, is amenable to high performance
solution methods. An eigensolver was selected for Salinas based
on
robustness, accuracy, scalability and efficiency.
In this paper, we will describe in detail the hardware architecture
of
our 1792-node Cplant cluster. This machine is composed of
single-processor Compaq Alpha workstations connected by a Myrinet
gigabit network in a three-dimensional mesh topology. This machine
has many unique characteristics, such as the ability to switch
a large
portion of the machine from any one of four network "heads"
to
another, allowing for greater flexibility in responding to user
demand
for classified versus unclassified computing.
In addition to hardware, we describe the system software environment
of our cluster. Linux is used as the base operating system for
all
nodes in the cluster, and we have designed and developed software
for
the parallel runtime system, high-performance message passing,
and
parallel I/O. We will discuss the details of this software
environment that influence the scalability and performance of
applications.
Finally, we describe the performance and scalability of Salinas
on up
to 1000 nodes of Cplant. We will also discuss some of the challenges
related to scaling Salinas beyond several hundred processors and
how
these challenges were addressed and overcome. We will show that
the
performance and scalability of Salinas is comparable to proprietary
large-scale parallel computing platforms. |
|
|
Title |
Real Time Response to
Streaming Data on Linux Clusters |
Author(s) |
Beth Plale, George Turner, and Akshay Sharma |
Author Inst. |
Indiana University |
Presenter |
George Turner
|
Abstract |
Beowulf-style cluster computers which have traditionally
served batch jobs via resource man-agers such as Portable Batch
System (PBS) are now under pressure from the scientific community
to support real time response to data stream applications. That
is, timely response to events streamed from remote sensors or
humans interactively exploring the parameter space. Cluster management
must be agile and timely in its response to new resource demands
imposed by the occurrence of real-world events. The work
described in this paper quantitatively evaluates two approaches
to integrating support for streaming applications into the framework
of a PBS managed cluster. One approach works
cooperatively with PBS to reallocate computational resources to
meet the needs of real time data analysis. The second approach
works outside of the PBS context utilizing the real time scheduling
capabilities of the Linux kernel to acquire the resources necessary
for its real time analysis.
Introduction
Beowulf-style clusters have traditionally used resource managers
such as PBS Pro [6] to opti-mize the utilization of a cluster’s
resources. Today there is a growing need for these clusters to
be responsive to non-deterministic event driven data streams.
Upon recognition of an application specific event occurring outside
the realm of PBS and the cluster, the necessary CPU, memory, network,
and file system resources must be directed toward a timely response.
Most production Beowulf clusters use batch scheduling to ensure
fairness among users and to maximize a cluster’s throughput.
But these objectives are at odds with the dynamic nature of data
stream processing [5, 4, 1, 2], where resource needs cannot be
predicted in advance, and any projection of duration of usage
can only be estimated because of the variability of real-world
events such as storms, earthquakes, etc.
This paper addresses the need for integrated support for streaming
applications in a batch job scheduled system. Through a quantitative
evaluation of startup time in response to an environmental event,
we are exploring the solution space for real-time responsiveness
to data streaming.
Motivation
An atmospheric scientist at IU studies atmosphere-surface exchange
of trace chemicals such as CO2, water, and energy from a variety
of ecosystems. These atmospheric flux measurements generate large
streams of data; sensors operating at rates up to 10 Hz can generate
data at rates of hundreds of MB/day/sensor. The remote sensors
use wireless 802.11b to communicate to a central computer remotely
located within the forest. The remote computer is connected via
a T1 line to the IUB campus network. Under normal operation the
data from the remote sensors are collected at the remote site
and streamed to IU at a nominal rate of 1-3 Hz where the event
data are analyzed for particular meteorological conditions. The
response to the detection of a meteorological condition is an
increasing of the event rate, and the attendant enlistment of
additional CPU cycles to handle
the high fidelity analysis.
This work is being carried out on the proto-AVIDD facility on
the Bloomington campus of Indiana University. The cluster consists
of five (5) dual Pentium-IV Xeon nodes, two (2) Itaniuim dual
processor IA64 nodes, and a RAID-5 filesystem NFS served internally
to the cluster. Proto-AVIDD runs RedHat 7.2 with cluster management
tools provided by the Oscar 1.2 project. Proto-AVIDD is a testbed
cluster used to study the issues of interactive and stream data
analysis in a production-level Linux cluster.
AVIDD (Analyzing and Visualizing Instrument Driven Data) will
be a geographically distributed data analysis cluster sited at
three of the IU campuses: Bloomington (IUB), Indianapolis (IUPUI),
and Gary (IUN). Due to ongoing vendor negotiations, an explicit
description of the AVIDD cluster is not possible at this time;
however, it is expect to be in the range of 0.5 –1 TFOPSaggregate,
6 TBytes of distributed cluster storage and will leverage IU’s
existing HPSS based massive data storage system. Local intra-cluster
networking will be provided by a Myrinet fabric with the the inter-cluster
networking utilizing Indiana’s 1 Tb/s I-Light infrastructure.
By the time of the conference, it is anticipated that initial
experiences and early results will be available from the production
AVIDD cluster.
Approach
Application events streaming from remote sources must be analyzed
in real time in order to achieve timely recognition of scientifically
interesting meteorological conditions. Our approach is to employ
a user space daemon resident on a dedicated communications node
within the AVIDD cluster to receive the data stream and monitor
it for interesting events. Under normal conditions, the daemon
simply drops the events into an event cache implemented as a circular
buffer. When a meteorological event occurs, the daemon responds
first by sending a message to the remote computer commanding it
to reconfigure the sensor system for higher fidelity data acquisition,
and second by enlisting additional resources within the cluster
to process the higher fidelity data streams. This startup scenario
is depicted in Figure 1.
Figure 1. Data stream processing response to application
event of interest.
We are exploring the solution space for real-time responsiveness
to data streaming through a quantitative evaluation of startup
time in response to an environmental event. Our first approach
is to work within the PBS domain to allocate resources and provide
job startup services in response to the detection of developing
real-world phenomenon. The second approach explores a ’bullying’
relationship with PBS in which the real time scheduling features
of the Linux kernel are utilized to preempt resources already
allocated by PBS. The full paper will quantify response times
under both approaches and explore interesting issues involved
in each approach.
Startup Within Context of PBS Resource managers such as
PBS have a coherent view of a cluster’s state. They also
have the tools necessary to stop, start, and signal jobs running
under their control. It would appear to be natural choice to use
a resource manager’s preexisting infrastructure to implement
cluster resource reallocation. In the final paper we will evaluate
the steps involved in the startup of the high fidelity processes
as well as the benefits and shortcomings of this approach.
Linux Real-time Scheduling Startup High fidelity processes
are scheduled and started up out-side the PBS domain using normal
Linux process control mechanisms. If these data analysis tasks
are scheduled by the kernel scheduler as real time processes they
will be assured of being scheduled on the CPUs any time they are
computable. Resources already used by a previous PBS job would
only be consumed as needed. For example, memory previously used
by the PBS job would be swapped out only as needed. In the final
paper we will evaluate the steps involved in the startup of the
high fidelity processes as well as the benefits and shortcomings
of this approach.
Other Issues Issues which will be touched upon but not
discussed in detail include Ssh, Globus Toolkit [3], MPI, process
preemption, process checkpointing. A general discussion of the
various steps involved in the data acquisition, data transmission,
and data disposition will be provided as a introductory foundation.
[1] Shivnath Babu and Jennifer Widom. Continuous queries over
data streams. In International Conference on Management of Data
(SIGMOD), 2001.
[2] Fabi´ an E. Bustamante and Karsten Schwan. Active I/O
streams for heterogeneous high performance computing. In Parallel
Computing (ParCo) 99, Delft, The Netherlands, August 1999.
[3] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure
toolkit. International Journal of Supercomputer Applications,
11(2):115–128, 1997.
[4] Sam Madden and Michael J. Franklin. Fjording the stream: An
architecture for queries over streaming sensor data. In International
Conference on Data Engineering ICDE, 2002.
[5] Beth Plale and Karsten Schwan. dQUOB: Managing large data
flows using dynamic embedded queries. In Proc. 9th IEEE Intl.
High Performance Distributed Computing (HPDC), Los Alamitos, CA,
August 2000. IEEE Computer Society.
[6] PBS Pro. Portable Batch System. http://www.pbspro.com/, 2002. |
|
| Programming
Models
|
| Title |
Remote Memory Operations
on Linux Clusters Using the Global Arrays Toolkit, GPSHMEM and
ARMCI |
| Author(s) |
Chona S. Guiang, A. Purkayastha and K. F. Milfeld |
| Author Inst |
Texas Advanced Computing Center,University of
Texas - Austin, USA |
| Presenter |
Chona S. Guiang
|
| Abstract |
Remote memory access (RMA) is a communication
model that offers different functionalities from those provided
in shared memory programming and message passing. The use of one-sided
communication exhibits some of the advantages of shared memory
programming such as direct access to global data, without suffering
from the limited scalability that characterizes SMP applications.
Moreover, RMA has advantages over MPI on two important respects:
1. Communications overhead: MPI send/receive
routines generate two-way communications traffic (“handshaking”)
to ensure accuracy and coherency in data transfers that is often
unnecessary. In some scientific applications (e.g., some finite
difference and molecular dynamics codes), the communication workload
involves accessing data that does not require explicit cooperation
between the communicating processes (e.g., writes to non-overlapping
locations in memory). Remote memory operations in ARMCI [1], Global
Arrays (GA) Toolkit [2], and GPSHMEM [3] decouple the data transfer
operation from handshaking and synchronization. There is no extra
coordination required between processes in a one-sided put/get
operation, so there is less of the enforced synchronization that
is sometimes unnecessary. Additional calls to synchronization
constructs may be included in the application as needed.
2. API: The distributed data view in MPI is
less intuitive to use in scientific applications that require
access to a global array. In contrast, the GA API provides a global
data view of the underlying distributed data that does not distinguish
between accessing local versus remote array data.
The Global Arrays (GA) Toolkit provides an SMP-like development
environment for applications on distributed computing systems.
GA implements a global address space, which makes it possible
for elements in distributed arrays to be accessed as if they were
in shared memory. The details of actual data distribution are
hidden, but are still within user control through utility functions
included in the library. MPI and GA library calls can be combined
in a single program, as GA is designed to complement rather than
replace existing implementations of message passing libraries.
The GA library consists of routines that implement one-sided operations,
interprocess, collective array operations, and utility and diagnostic
operations (e.g., for querying available memory, data locality).
As the name implies, GA implements a global address space for
arrays of data, not scalars. Therefore, GA one-sided communications
are designed to access arrays of data (double, integer, and double
complex), not arbitrary memory addresses.
GPSHMEM is a portable implementation of the Cray Research Inc.
SHMEM [4], which is a one-sided communication library for Cray
and SGI systems. Like GA, GPSHMEM implements a global address
space for accessing distributed data, but GPSHMEM contains support
for more data types. GPSHMEM operations are limited to symmetric
data objects, objects for which local and remote addresses have
a known relationship (e.g., arrays in common blocks). The library
routines provide the following functionalities: 1) noncollective
RMA operations including those for strided access , and atomic
operations; 2) collective operations; 3) parallel environment
query functions. In addition to the supported Cray SHMEM routines,
GPSHMEM has added block strided get/put operations for 2D arrays.
All RMA operations in GA and GPSHMEM use the Aggregate Remote
Memory Copy Interface (ARMCI) library. ARMCI is a portable,
one-sided communication library. On Linux clusters, ARMCI relies
on low-level network communication layers (e.g., Elan for Quadrics,
GM for Myrinet) to implement remote memory copies, while utilizing
the native MPI implementation for collective communication operations.
Noncontiguous data transfers are supported in ARMCI, which facilitates
access of sections of multidimensional arrays in GA and 2D arrays
in GPSHMEM.
In general, MPI-2 RMA is more difficult to use than GA, and GPSHMEM
because of the greater number of steps involved. A single
one-sided operation includes setting up a ‘window’
for the group of communicating processes, moving the data, and
ensuring completion of data transfer (among processes in the group)
using different synchronization constructs for “active”
or “passive” targets. RMA operations with passive
targets are serialized, and hence degrades performance. Notwithstanding
these caveats, MPI one-sided communication is part of the MPI-2
standard and consequently is more likely to gain support on a
greater number of platforms than GA, GPSHMEM, and ARMCI.
Previous work by Nieplocha et. al. [5, 6] on implementing ARMCI
on Linux clusters has demonstrated that the ARMCI put/get operations
outperform the MPICH-GM implementation of MPI 1.1 send and receive
for both strided and contiguous data access. We have conducted
additional performance measurements of ARMCI put/get versus MPI
1.1 send/receive on the TACC IA32 and IA64 clusters, using a simple
“halo exchange” prototype where each process exchanges
contiguous and strided data with its nearest neighbors on a 1-D
process grid. Results, as well as further description of
the experiments are shown in Figures 1 and 2. For contiguous data
transfers, the ARMCI put/get operations exhibit better performance
than the equivalent MPI 1.1 send/receive over the entire range
of message sizes considered on the IA64 system. On the IA32 cluster,
the ARMCI put and get operations outperform MPI send and receive
for message lengths exceeding 8KB. On both systems, ARMCI put/get
perform better than MPI send/receive for transfer of strided data
over the entire range of message sizes. Future work will perform
the following experiments to address important issues of RMA operations:
1. Latency and bandwidth measurements over
different message sizes will be conducted for put/get operations
as implemented in LAM-MPI versus ARMCI. The poor bandwidth obtained
with the LAM-MPI implementation of the MPI-2 put/get operations,
as shown in Figure 1, could be attributed to the fact that the
most recent stable version does not take full advantage of the
underlying GM layer. Our future study will use the most recent
version of LAM-MPI, which has a completely rewritten version of
its GM RPI (request progression interface) [7].
2. The transfer of contiguous array data in
GPSHMEM and GA will be compared against the corresponding point-to-point
and/or collective communication calls in MPI. These timing results
will be compared with ARMCI measurements to obtain the overhead
involved in the GA and GPSHMEM routines.
3. Applications characterized by dynamic communication
patterns, irregular data structures, and noncontiguous data transfers
are likely to benefit from the remote memory copy functionality
that GA and GPSHMEM provide. Molecular dynamics and finite difference
applications fit this description well, since both involve the
exchange of sections of multidimensional arrays at every time
step of the simulation. To test the efficiency and applicability
of GA and GPSHMEM for these types of applications, we will
evaluate the performance of a simple MD and finite difference
code using three different implementations for the transfer of
array sections --- corresponding to the use of GA, GPSHMEM, and
MPI point-to-point as well as collective communication functions.
The experience on these applications will also provide insight
on the general feasibility of using SMP-like programming interfaces
on distributed-memory systems.
All computational work will be performed on TACC’s Linux
clusters: a 32-node cluster of Pentium III-based dual-processor
SMPs and an Itanium-based cluster of 20 dual-processor SMP nodes.
Each cluster uses a Myrinet 2000 switch fabric.
Figure 1
Figure 2
The communication throughput of a “halo exchange”
prototype code is measured on four processors of the TACC IA32
(Figure 1) and IA64 clusters (Figure 2) using the MPICH-GM 1.2.1.7
implementation of MPI 1.1 send/receive; the latest version of
ARMCI; and the LAM-MPI 6.5.6 implementation of the MPI-2 put/get
operations.
References:
[1] http://www.emsl.pnl.gov:2080/docs/parsoft/armci/
[2] http://www.emsl.pnl.gov:2080/docs/global/
[3] K. Parzyszek, J. Nieplocha, and R. Kendall, “A Generalized
Portable SHMEM Library for High Performance Computing in Proceedings
of the Twelfth IASTED International Conference Parallel and Distributed
Computing and Systems
[4] http://dynaweb.tacc.utexas.edu:8080/dynaweb/app_prog
/004-2178-002/@Generic__BookTextView/2819;hf=0?DwebQuery=SHMEM#1
[5] J. Nieplocha, V. Tipparaju, A. Saify, D.K. Panda, "Protocols
and Strategies for Optimizing Performance of Remote Memory Operations
on Clusters," in Proc. Workshop Communication Architecture for
Clusters (CAC02) of IPDPS'02
[6] J. Nieplocha, J. Ju, E. Apra, “One-sided Communication
on Myrinet-based SMP Clusters using the GM Message-Passing
Library,” in Proc. CAC01 Workshop of IPDPS'01, San
Francisco, 2001
[7] http://www.lam-mpi.org/MailArchives/lam/msg04345.php |
| Title |
Supercomputing in Plain
English: Teaching High Performance Computing to Inexperienced
Programmers |
| Author |
H. Neeman, J. Mullen, L. Lee & G.K. Newman |
| Author Inst |
Oklahoma University, USA and Worcester Polytechnic
Institute |
| Presenter |
Henry Neeman
|
| Abstract |
Although the field of High Performance Computing
(HPC) has been evolving rapidly, the development of standardized
software systems -- MPI, OpenMP, LAPACK, PETSc and others -- has
in principle made HPC accessible to a wider community. However,
relatively few physical scientists and engineers are taking advantage
of computational science in their research, in part because of
the considerable degree of sophistication about computing that
HPC appears to require. Yet the fundamental concepts of HPC are
fairly straightforward and have remained relatively stable over
time. These facts raise an important question: can scientists
and engineers with relatively modest computing experience learn
HPC concepts well enough to take advantage of them in their research?
To answer this question, HPC educators must address several issues.
First, what are the fundamental issues and concepts in computational
science? Second, what are the fundamental issues and concepts
in HPC? Third, how can these ideas be expressed in a manner
that is clear to a person with relatively modest computing experience?
Finally, is classroom exposure sufficient, or is guidance required
to assist investigators in incorporating HPC in their research
codes?
A helpful way to describe the fundamental issues of computational
science is as a chain of abstractions: phenomenon, physics, mathematics
(continuous), numerics (discrete), algorithm, implementation,
port, solution, analysis and verification. In general, physics,
mathematics and numerics are addressed well by existing science
and engineering curricula -- though often in isolation from one
another -- and therefore instruction should be provided on issues
relating primarily to the later items, and on the interrelationships
between all of them. For example, algorithm choice is a fundamental
issue whose gravity many researchers with less computing experience
do not appreciate; a common mistake is to solve a linear system
by inverting the matrix of coefficients, without regard for performance,
conditioning, or exploitation of the properties of the matrix.
As for HPC, the literature tends to agree on the following fundamental
issues: the storage hierarchy; instruction-level parallelism;
high performance compilers; shared memory parallelism (e.g., OpenMP);
distributed parallelism (e.g., MPI); hybrid parallelism. The pedagogical
challenge is to find ways to express the basic concepts with minimal
jargon and maximal intuitiveness.
Although expressing these concepts in a manner appropriate for
the target audience can be challenging, it does not need to be
daunting. For example, the use of analogies and narratives
to explain these concepts can capture the fundamental underlying
principles without distracting the students with technical details.
Once the students understand the basic principles, the details
of how to implement HPC
solutions are much easier to digest.
However, it is probably not reasonable, in many cases, to expect
novice programmers to immediately understand how to apply
HPC concepts to their science. As a follow on, regular interactions
with experienced HPC users can provide the needed insight and
practical advice to move research forward.
We will discuss an ongoing effort to develop materials that express
sophisticated scientific computing concepts in a manner accessible
to a broad audience. In addition, we will examine a programmatic
approach that incorporates both the use of these materials and
the crucial contribution of followup. |
|
|
| Applications
and Tools Track Abstracts:
|
| Applications Session I: Applications
Performance Analysis
|
| Title |
Scaling Behavior of
Linear Solvers on Large Linux Cluster |
| Author(s) |
John Fettig, Wai-Yip Kwok, Faisal Saied |
| Author Inst |
NCSA/University of Illinois, USA |
| Presenter |
John Fettig |
| Abstract |
Large sparse linear systems of equations arise
in many computational science and engineering disciplines. One
common scenario occurs when implicit methods are used for time-dependent
partial differential equations, and a nonlinear system of equations
has to be solved at each time step. With a Newton-type approach,
this in turn leads to a linear system at each Newton iteration.
In such situations, the linear solver can account for a large
part of the overall simulation time. Demonstrating the effectiveness
of linear solvers on large Linux clusters will help establish
the usefulness of these platforms for applications in such diverse
areas as CFD, Astrophysics, Bio-physics, earthquake engineering,
and structural mechanics.
In this paper, we investigate the effectiveness of Linux clusters
for state of the art software for parallel linear solvers such
as PETSc, Hypre, and other packages drawn from the research community.
In particular, we study scaling to large processor counts, and
large system sizes (tens of millions of unknowns). The scalability
of the algorithms, of the implementation, and of the cluster architecture
will be investigated. We will focus on iterative Krylov subspace
solvers like Conjugate Gradient, GMRES(k) and BiCGSTAB, coupled
with parallel preconditioners, such as block Jacobi, domain decomposition,
and multigrid methods.
In our analysis of the parallel performance of these solvers
on Linux clusters, we will identify which components of clusters
(CPU, memory bandwidth, network latency/bandwidth) are the critical
factors in the performance of the different software components
of the parallel solvers, such as matrix vector products, inner
products, preconditioners, grid transfer operators, coarse grid
solvers. We will present results for Linux clusters based on first
and second-generation 64-bit Itanium processors from Intel, connected
by a Myrinet switch, used primarily in message-passing mode.
The number of performance tools available on Linux clusters
is increasing. In our performance analysis of parallel linear
solvers, we will use these tools to give a more detailed performance
profile of the solvers. In particular, tools that access hardware
performance counters will be used to study the impact of the memory
hierarchy on performance, and MPI analysis tools will be used
to quantify the message-passing performance of the network for
sparse linear solvers. |
|
|
Title |
Benchmarking of a Commercial
Linux Cluster |
Author(s) |
Daniele Tessera |
Author Inst. |
Università di Pavia, Italy |
Presenter |
Daniele Tessera
|
Abstract |
Exploiting the cumulative computing capabilities
of workstations and PCs has been pursued, for a long time, as
an inexpensive solution for delivering high performance to many
scientific and industrial applications. Various solutions have
been proposed to allow parallel applications to be executed over
a set of independent systems. Cluster machines, built on top of
commodity off the shelf components, initially developed as prototype
machines by research centers, are becoming very popular. The increasing
performance of commodity components and the availability of interconnecting
networks able to link large number of such components have fueled
the development of these cluster machines. Moreover, the development
of specialized hardware and software components tuned for matching
the demands of HPC applications have resulted in cluster machines
approaching the supercomputers' performance and ranked among the
top 500 most powerful supercomputers. Hardware vendors, offering
scalable cluster machines, also testify the maturity of cluster
computing as a cost effective solution to cope with the high performance
needs of scientists and numerical engineers.
This paper is focused on a detailed study of the performance
achieved by an IBM NetFinity supercluster when performing various
scientific benchmarks. Note that, although we have analyzed the
performance of a specific machine, our results have a broader
applicability. Indeed, the IBM NetFinity adopts a typical cluster
architecture based on Intel Pentium processors interconnected
via both Myrinet and Ethernet networks, and the Linux operating
system. The cluster is composed of 260 nodes each of them housing
two Intel Pentium processors. As a preliminary overview of the
cluster performance, we have investigated the run time behavior
of a few kernels performing widely used numerical algorithms,
such as, matrix LU factorization, multigrid solvers, and distributed
rankings.
The aim of this study is to evaluate the performance benefits
experienced by numerical applications due to specialized high
bandwidth communication components, i.e., Myrinet switches, over
a typical Beowulf cluster based on Ethernet networking. For such
a purpose, we have analyzed the speedup achieved by the various
kernels executed over either Myrinet or Fast Ethernet network.
Our results are based on the post mortem analysis of the low level
timing measurements collected by monitoring the executions of
the various kernels. Performance models of simple distributed
computations, aimed at assessing the severity of the communication
versus computation performance, have been derived.
The analysis of communication times has then focused on the
wall clock
times accounted by processors in data exchanging. Note that, these
times take into account both the wall clock times required to
deliver the data over the network links and the times spent to
perform the adopted communication protocols. As a rule of thumb,
Myrinet allows a greater scalability with respect to Ethernet,
even when the amount of workload to be processed by each allocated
processor is quite small. The speedup of a few kernels executed
over Ethernet saturates when more that 32 processors are allocated.
On the other hand, a surprising result is that, Ethernet performance
is superior when parallel kernels issue a large number of blocking
send/receive. For each communication protocol, we have also analyzed
the computation overheads, that is, the increases of computation
wall clock times due to the concurrent execution of system processes
for managing the communication activities.
Our performance characterization study has then focused on the
behavior of a complex climate model benchmark, that is, the PSTSWM
kernel from the Oak Ridge National Laboratory. Performance figures
related to various testbed simulations, varying the size of the
physical grid, the number of simulated steps, as well as the number
of allocated processors, have been analyzed. Statistical clustering
techniques have been then applied to these performance figures
to derive synthetic characterizations of the behaviors of the
allocated processors. The identified groups of homogeneous behaviors
have been then related to the characteristics of the corresponding
simulations, that is, with the processed workload.
As a further step towards the characterization of the performance
achieved by the IBM NetFinity cluster, we have studied the impact
of the processor allocation policy on the various kernel execution
times. Indeed parallel applications might be though as composed
of tasks to be allocated to the various processors. Since each
IBM NetFinity node houses two processors, two allocation policies
are then available: we can allocate either one or two application
tasks on each node. Note that, when only one task is allocated
on each node both the processors might concur to the execution
of such an application task. Numerical kernels experience performance
degradations when two tasks per node are allocated on a large
number of nodes (i.e., more that 32 nodes). These performance
degradations are due to increases in both computation and communication
wall clock time. Indeed, an in depth analysis of timing measurements
results in outlining that communications between tasks allocated
on the same node require longer times with respect to communications
between tasks allocated on distinct nodes. Moreover, since the
computation phases are somewhat synchronized across all application
tasks, contentions due to synchronous memory accesses and concurrent
execution of eventually spawned communication processes also result
in increases in the computation wall clock times. Experimental
evidences, presented in the paper, will substantiate that, for
some type of applications, being fixed the number of nodes, allocating
both their processors results in longer execution times with respect
to allocating only one processor per node. |
|
|
| Applications Session II:
Applications Performance on Clusters
|
Title |
A Comparative Study
of the Performance of a CFD program across different Linux Cluster
Architectures |
Author(s) |
Th. Hauser, R.P. LeBeau, T. Mattox, G.P. Huang,
H. Dietz |
Author Inst |
Utah State University |
Presenter |
Thomas Hauser
|
| Abstract |
[Paper #26] |
|
|
| Title |
Scalability of a
Tera-scale Linux-based Clusters from Parallel Ab initio Molecular
Dynamics |
| Author(s) |
Hongsuk Yi, Jungwoo Hong, Hyoungwoo Park, and
Sangsan Lee |
| Author Inst |
KISTI, Korea |
| Presenter |
Jungwoo Hong |
| Abstract |
In the beginning of 2000, KISTI has initiated
a TeraCluster project to design a linux-based cluster with tera-scale
performance. The main goal of this project is to provide resources
composed of PC clusters that meets the level of compute performance
required by the grand challenge applications in Korea. The UP2000
with Myrinet and DS10 with Fast Ethernet systems as a test platforms
of the TeraCluster project have benefited from many years of research
dedicated to improving their scalability and performance. For
actual experiments, we have investigated the performance of scalable
ab initio molecular dynamics and quantum chromodynamics (QCD)
applications on the test platforms and compare with the Cray T3E
system. The parallelization is based on the Massage Passing Interface
standard. Results show that the QCD application which requires
only nearest neighbor communications between computing nodes reveals
an excellent parallelization. While for the ab initio molecular
dynamics the operations responsible for the use of three-dimensional
fast Fourier transformation dominate, representing the most time-consuming
part in the calculation of C36 fullerence. Combining
these results with the performance evaluations of the additional
grand challenge applications involving the computational fluid
dynamics, structural analysis, and computational chemistry, we
have designed a prototype of the TeraCluster with 516 computing
nodes and build the first phase of TeraCluster in 2002.
TABLE I: The hardware and software components of the three different
platforms and the basic results of NPB and PMB benchmark suite
| System |
DS10 |
UP2000 |
Cray |
T3E |
| CPU architecture |
|
|
|
|
| Network |
|
|
|
|
| CPU clock (MHz) |
|
|
|
|
| Number of nodes |
|
|
|
|
| Rpeak (Gflop/s) |
|
|
|
|
| Rmax (Gflop/s) |
|
|
|
|
| Latency (microsec) |
|
|
|
|
| Bandwidth (MB/s) |
|
|
|
|
We have built two different linux-base PC clusters of 64 computing
nodes with one server node as a pilot systems [1]. The DS10 system
with 466 MHz Ev6 and UP2000 with 667 MHz Ev6 Alpha processors
are connected with Fast Ethernet and Myrinet, respectively. Each
com-puting nodes have 250 MB memory and 20 GB local IDE disk,
while the server node has 128 MB memory and 30 GB disk. The systems
operate on Linux Redhat 6:0 with a kernel level of 2.2.0. The
MPICH is used as a MPI communication library, and portable batch
system (PBS) is selected as a queuing system. The pilot project
has been a success story with the col-laboration of a few industrial
partners in Korea. From the performance evaluations of the major
KISTI [2] applications, we build the first phase of the TeraCluster
denoted as PhaseI. The details is summarized in Table I.
FIG. 1: (a) The sustained performance and (b) the corresponding
speed-up of the
full QCD with the L = 12 x 123 lattice size for three
different machines.
In Fig. 1(a), we illustrate the execution times of the quenched
QCD program on the pilot systems and the first phase of TeraCluster
for the large size of lattices. The execution time at the single
processor shows that the UP2000 and PhaseI systems are 2~3 times
faster than the Cray T3E system. Due to the difference of CPU
architecture, the PhaseI and UP2000 surpass the DS10 system in
speed even for 64 working nodes. This results reflect the high
compute to communication ratio of the QCD application which has
only nearest neighbor communications.
FIG. 2: Speedup of the ab initio molecular dynamics for the
UP2000, DS10, and
Cray T3E. The dashed line indicates the ideal speed-up. Speedup
of the
ab initio application using VASP on the PhaseI.
To elucidate the scalability of the pilot system we present the
speed-up S = T1=TP, where T
1 and TP is the execution time for
single and number of P processors, respectively. Fig.
1(b) shows the obtained S on the different three PC clusters.
The UP2000 shows a nearly linear scalability, while for the PhaseI
system the linear scalability is not found at all. Moreover,
the scalability of the DS10 is rapidly suppressed due to the narrow
capacity of the Fast Ethernet bandwidth and latency. This results
implies that the capacity of bandwidth is crucial for the large
scale computation system as a scalable scientific platform.
We also calculate a new solid phase of carbon made from C
36 fullerence. The super-
cell approximation is used to simulate the periodic boundary condition
and the cell size is 11 x 11 x 11 °A3 for C
36 . Kohn-Sham single electron wavefunction is expanded
by plane waves up to a cut-off energy 36 Ry. Since C36
is molecule we use a single k-point in the calculation.
In Fig. 2(a), we illustrate the speedup of the C36
fullerence on the pilot system and compare it with the Cray T3E.
In contrast to the above results, each shows an incomplete behavior
in scalability. Especially, the Cray T3E system retains the linear
scalability up to 16 nodes, but a rather monotonous increase of
S reaches to nearly 16 with using the 64 computing nodes,
while the DS10 exhibits a more deficient scalability. These results
account for the importance of the capacity of the network interface
through the collective communication between the executing nodes.
Indeed, the overhead of data communication increases rapidly as
the working nodes increases, reflecting the significant data exchange
during the 3D FFT. In Fig. 2(b), we also present the performance
evaluations of ab initio molecular dynamics using VASP package
on the PhaseI. However, the scalability has not found at all even
for 32 working nodes. In the remainder of this paper, we will
discuss on the scalability of the ab initio grand challenging
problems in more details.
[1] TeraCluster Project (http://cluster.or.kr).
[2] KISTI (http://www.hpcnet.ne.kr). |
|
|
Title |
openMosix vs Beowulf:
a case study |
Author(s) |
Moshe Bar, Stefano Cozzini, Maurizio Davini and
Alberto Marmodoro |
Author Inst. |
INFN and the University of Pisa, Italy |
Presenter |
Stefano Cozzini |
Abstract |
Today two main clustering paradigms exist for
the Linux environment: openMosix and Beowulf (MPI-based). Whereas
in MPI-based solutions, the parallelization needs to be coded
explicitly through special library directives, in openMosix the
usual Unix style programming concepts apply. Also, for clustering
applications it becomes increasingly
evident that the latency of the network interconnect component
quickly gains a very significant weight on the overall system
performance and throughput. In this paper we will describe our
evaluation work of these two different clustering paradigms applied
to a suite of production codes currently used in a specific scientific
environment recently installed at Trieste (Italy): the INFM center
of excellence DEMOCRITOS. The codes represent the standard computational
tools used in computational research in atomistic simulations.
The suite can be roughly divided in three main categories of codes:
- First-principles or ab-initio packages where the quantum description
of the electronic properties of matter is taken into account;
the size of complex systems simulated is limited to, at
most, hundreds of atoms: examples includes metal surfaces or
active sites of biological systems.
- Classical Molecular Dynamics (MD) packages needed to
simulate biological systems composed by several thousand of
atoms like for instance proteins.
- Quantum Monte Carlo (QMC) programs which are used to study
strongly correlated systems.
The above classification applies also to the computational characteristics
of the codes:
- Ab-initio codes are computationally demanding in term of memory
and CPU and specific higly scalable parallel versions of electronic
structure codes have been developed in the past years using
the Message Passing (MP) paradigm.
- Classical Molecular Dynamics codes have different characteristics:
lower memory requirements allows to implement easily parallel
codes using replicated data algorithm. This however limits scalability.
- The codes implementing Quantum Monte Carlo (QMC) techniques
need the so called embarrassing parallelization strategy: each
processor works on its own task. In this class we have
both serial and parallel codes.
The heterogeneity of the codes and the different computational
requirements suggest that different cluster solutions should be
considered for different classes of codes.
We therefore decided to provide Democritos with two small clusters
implementing the first the Beowulf paradigm, the second the openMosix
approach. Both the cluster are the IBM 1300 cluster solution equipped
by PIII 1.4.Ghz processor with the same number of CPUs (16 bi-processor
nodes). One cluster is equipped with the Myrinet high speed
network in order to allow us to test the role of an high speed
network on both the paradigms.
Our benchmarking work want to see which solution fits some precise
requirements about scalability, performance and throughput aspects.
Scalability and performance values are measured in a standard
way and therefore compared against other (proprietary) parallel
platforms. We define them satisfactory just computing a
" performance/price'" ratio. These requirements are easily met
by both the paradigm for all the class of codes. The aspect that
makes the difference is throughput one: this is fundamental in
order to guarantee efficient resource sharing among users and
high percentage of utilization of the computational resources.
Finally a key role is played by the "time to production" variable.
It is in fact fundamental for our research to reduce at
minimum the time needed to implement a scalable and efficient
parallel application starting from a serial ones.
Our preliminary results indicate that for highly parallel codes
like ab initio ones the standard Beowulf paradigm should be the
preferred solution in term of efficiency, scalability and throughput.
We note however that the superiority of this solution is mainly
due to the fact that highly parallel and scalable codes are ready
to be implemented and therefore the time to production for this
class of code is practically zero.
On the other hand for QMC codes the Openmosix approach is certainly
to be preferred. Efficiency and scalability are, as already said,
the same as in the Beowulf case; the two main advantages are related
to the other two aspects. Throughput is easily obtained
through the load balancing technology within openMosix, and the
"time to production" for serial programs is practically negligible:
just run many instances of the program and all it is done.
It is interesting to note that second class of codes (classical
MD) despite the fact they are already parallel, generally run
better on the openMosix cluster. These codes have in fact a low
level of scalability (1-8 CPU) making the sharing of resources
among this class of codes and the ab-initio ones (which requires
lot of processors) difficult to be efficiently scheduled with
standard queue systems on small cluster like ours.On the contrary,
the openMosix solution deals at best with the requirements of
QMC and moderate scalable codes like these ones.
In the final paper we will report the results obtained
on the computational suite under various lab environments considering
the following aspect: topology of nodes, typology of node anf
Network interconnects On hand of special hardware for the various
interconnects (Myrinet, Dolphin, Fast Ethernet, Gigabit) we will
show the resulting (reproducible) scalability, performance and
throughput aspects regarding our computational problem. |
|
|
|
| Applications Session III:
Experiences with Portability and I/O
|
| Title |
Adventures with Portability
|
| Author |
Elizabeth Post |
| Author Inst |
Lincoln University, New Zealand |
| Presenter |
Elizabeth Post
|
| Abstract |
As computers become more powerful it becomes
possible to develop computer models that more accurately approximate
the real-world systems they are simulating. However these
models are increasingly complex, often taking years to develop,
so it is important to re-use existing work as far as possible
rather than continually “re-inventing the wheel” and
duplicating effort which could be better spent elsewhere.
With this aim of re-use scientists at Dexcel (a New Zealand
dairy research institute) developed a dairy farm simulation framework
with some capability to incorporate existing computer models of
distinct entities in a pastoral dairy farm, e.g. pasture and crop
growth models, animal metabolism models. This framework also allows
run-time selection from several alternative models for each component
(where they exist). With some being simpler and others more
complex, this permits the same basic farm configuration to be
modeled at different levels of detail, as appropriate for different
research exercises.
A major practical difficulty in implementing a system with this
flexibility is that the various component sub-models have typically
been written in a range of different computer languages. The dairy
farm simulation framework itself is written in Smalltalk, and
so far makes use of sub-models for the climate, pasture and animal
components written in various languages such as Smalltalk, C,
Fortran and Pascal. Initially the framework was developed
as a research tool for agricultural scientists, and run on a Windows
operating system using the Microsoft COM/DCOM protocol for interprocess
communication. However there is now a requirement to investigate
automated optimization of farm specification and operating parameters,
and also visualisation of the associated high-dimensional objective
functions. This work requires very large numbers of simulations
to be performed so we are attempting to port the framework to
run in parallel on a Linux cluster.
While porting the existing framework and component models to
run on the Linux cluster we also need to maintain the same dairy
farm simulation model running on a Windows operating system. Ideally,
as far as possible, we would like to maintain one single version
that runs on both operating systems and also on a uniprocessor
or in a parallel processing environment to avoid the difficulties
of keeping the development of multiple versions for different
environments synchronised. In practice this may not be completely
achievable, but as far as possible we want to minimize the differences
between these implementations. Thus we have had to use techniques
that are portable between these environments, to ensure that all
environments can use the same version. An added complexity is
that researchers using the Windows version use a Windows interface
but the parallel version must run without interacting with a Windows
interface, so the code for the model itself must be independent
of the code for the interface. Also, as we are using component
models developed by other researchers we cannot change the component
models themselves, and have to ensure that the interfaces between
the framework and the component models are such that new versions
of the component models may be substituted as they become available.
It may also happen that alternative models for a particular component,
such as the pasture or animal component, may require quite different
inputs and produce outputs in different formats, or even completely
different types of output. It is therefore something of a challenge
to devise generic portable interfaces between the framework and
the component models that will provide the necessary input parameters
for any component model, and also be able to make use of the output
from these component models, if necessary converting the output
to a different form before it can be used in the dairy farm simulation.
This paper identifies and discusses the problems to be solved
in this porting process, and describes the techniques that have
been used successfully to date. These include porting the
framework and some of its component models from a Windows uniprocessor
environment to a Linux parallel processing environment, the initial
use of files to transfer data between processes, and then replacing
the file transfers by socket communication. This has led
to some interesting exercises in portability. As we use MPI to
manage the parallel processing, and have not yet persuaded Smalltalk
to communicate by using MPI, we have a C program, using MPI, which
manages the parallelism. This C program then starts the Smalltalk
framework, which in turn communicates with its component models,
written in Smalltalk, C, Fortran or Pascal, and returns a final
result for each simulation to the controlling C program. The problem
has been in getting different processes, written in different
languages, and perhaps started and running independently of one
another, to communicate and pass data between themselves. We hope
to include some preliminary results of extending MPI outside its
normal C/C++/Fortran domain and making it the basic inter-process
communication mechanism in the framework and will also be portable
across operating systems. |
|
|
| Title |
Experiences with the Parallel
Virtual File System (PVFS) in Linux Clusters |
| Author |
Kent Milfeld, Avijit Purkayastha, Chona Guiang |
| Author Inst |
TACC/University of Texas - Austin, USA |
| Presenter |
Kent Milfeld
|
| Abstract |
One of the key elements to the Linux Cluster
HPC Revolution is the high speed networking technology used to
interconnect nodes within a cluster. It is the backbone
for “sharing” data in a distributed memory environment.
This backbone, when combined with local high-speed disks, also
provides the framework for developing parallel I/O paradigms and
implementing parallel I/O systems—an HPC environment for
“sharing” disks in a parallel file system.
While much of the emphasis in cluster computing has focused on
optimizing processor performance (e.g., compilers, hardware monitors,
numerical libraries, optimization tools) and increasing message
passing efficiency (e.g., MPICH-GM, MPI-Pro, LAM, M-VIA), much
less effort has been directed towards harnessing the parallel
capacity of multiple processors and disks to obtain “HPC
I/O” in clusters. This is partly due to the fact that
the MPI-I/O standard was not formulated until version 2 of MPI.
It is also due to the cost of dedicating processors/nodes to I/O
services. (For workloads that include only a few I/O intensive
applications, dedicated nodes for parallel I/O may be an underused
resource, to say the least.) In addition, dedicated I/O
nodes introduce heterogeneity into the system configuration, and
therefore require additional resources, planning, management,
and user education.
The Parallel Virtual File System (PVFS) [1] is a shared file
system designed to take advantage of parallel I/O service on multiple
nodes within a Linux cluster. In PVFS, Parallel I/O requests
are initiated on clients (compute nodes), and the parallel I/O
service occurs on I/O servers (I/O nodes). The PVFS API
includes a partitioning mechanism to handle common, simple-strided
access[2] with a single call. The file system is virtual
in the sense that it can run without kernel support (that is,
in user space) in its native form (pvfs), although a UNIX I/O
interface is supported through a loadable kernel module.
Also, PVFS is built on top of the local file system on each I/O
node. Hence, there is a rich parameter space for obtaining optimal
performance through a local file system. The authors of
PVFS (Cettei , Ross & Ligon)[3] have championed this file
system, and have performed many different benchmarks to illustrate
its capabilities[4].
The main attributes of PVFS arr:
· a single file name space
· multiple I/O servers (files are striped
across I/O nodes)
· single metadata manager (mgr): metadata
is separated from data
· independent I/O server, client and
manager location
· data transfer through TCP/IP
· multiple user interfaces (MPI-I/O,
UNIX/POSIX I/O, native pvfs)
· (PVFS has been integrated into the
ADIO layer of ROMIO)
· free and downloadable
The purpose of this paper is to investigate the PVFS file system
on a wide range of parameters, including different local file
systems and varied multi-user conditions.
In many, small “workcluster” systems the nodes are
frequently 2-way SMPs (for price-performance reasons). In
these systems, where jobs are often run in dedicated mode (all
processors devoted to a single user), there are two basic configurations
that we evaluate for I/O-intensive applications: a compute
processor and an I/O processor can coexist on each node, or the
nodes can be divided into separate I/O nodes and compute nodes.
More specifically, the cases are:
- Some nodes are configured as I/O nodes and they are committed
to performing only IO (and are excluded from performing computations
in the compute node pool).
- No nodes are configured exclusively for I/O; the I/O services
are then resident on the compute nodes, and the parallel I/O
service is performed “on node” during the I/O phase
of an execution (and a larger pool of processors can be assigned
to the computational tasks).
Another concern is the underlying file system. Since PVFS is
built on top of the local file system on each I/O node, alternative
file systems, such as ext3fs and tmpfs, are evaluated for conditions
that provide enhanced performance.
Also, on a loaded production cluster, the state of the system
is often far from the conditions under which benchmarks are performed.
In our evaluation of PVFS we simulate different scenarios of simultaneous
I/O access by multiple users to test I/O nodes under the stress
that can occur in large systems. While the cost of “multiplexing”
multiple user requests is not important in a dedicated environment,
it is an important measurement for large production systems, and
should not be outweighed by single user characteristics.
[1] http://parlweb.parl.clemson.edu/pvfs/
[2] Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter
Ellis, and Michael Best, File Access Characteristics of Parallel
Scientific Workloads, IEEE Transactions on Parallel and Distributed
Systems, 7(10):1075-1089, October 1996.
[3] Matthew M. Cettei, Walter B. Ligon III, and Robert B. Ross,
Clemson University.
[4] http://parlweb.parl.clemson.edu/pvfs/papers.html, http://www-unix.mcs.anl.gov/~rross/resume.htm
|
|
|
| Systems
Administration Track Abstracts:
|
| Systems Session I:
System Performance Issues
|
| Title |
Checkpointing and
Migration of Parallel Processes Based on Message Passing Interface
|
| Author |
Zhang Youhui, Wang Dongsheng, Zheng Weimin |
| Author Inst |
Department of Computer Science, Tsinghua University,
China |
| Presenter |
Zhang Youhui
|
| Abstract |
Linux Clusters offer a cost-effective platform
for high-performance, long-running parallel computations and have
been used widely in recent years. However the main problem with
programming on clusters is the fact that it is prone to change.
Idle computers may be available for computation at one moment,
and gone the next due to load, failure or ownership[1].The probability
that clusters will fail increases with the number of nodes. During
normal computing, one abnormal events is likely to cause the entire
application to fail. To avoid this kind of time waste, it is necessary
to achieve high availability for clusters. Checkpointing &
Rollback Recovery (CRR) and Process Migration offer a low overhead
and full solution to this problem[1][2].
CRR, which records the process state to stable storage at regular
intervals and restarts the process from the last recorded checkpoint
upon system failure, is a method that avoids the waste of computations
accomplished prior to the occurrence of the failure. Checkpointing
also facilitates process migration, which suspends the execution
of a process on one node and subsequently resumes its execution
on another. However, Checkpointing a parallel application is more
complex than just having each processor take checkpoints independently.
To avoid domino effect and other problems upon recovery, the state
of inter-process communication must be recorded, and the global
checkpoint must be consistent. And there some related CRR software,
including MPVM[3], Condor[4], Hector[5].
This paper presents a Checkpoint-based Rollback Recovery and
Migration System for Message Passing Interface, ChaRM4MPI, for
Linux Clusters. In this system some important fault tolerant mechanisms
are designed and implemented, which include coordinated checkpointing
protocol, synchronized rollback recovery, synchronized process
migration, and so on.
Using ChaRM4MPI, the node transient faults can be recovered automatically,
and the permanent fault can also be recovered through checkpoint
mirroring and process migration techniques. If any node that is
running the computation drops out of the clusters, the computation
will not be interrupted. Moreover, users can migrate MPI processes
from one node to another manually for load balance or system maintenance.
ChaRM4MPI is a user-transparent implementation, which means it
is needless for users to modify their source codes.
ChaRM4MPI (Figure 1), is based on MPICHP4, an MPI implementation
developed at Mississippi State University and Argonne National
Laboratories that employs P4 library as the device layer. Our
work is focused on the follows:
- A coordinated CRR and migration protocol for parallel applications
is designed.
- One management process is implemented as the coordinator of
parallel CRR and process migration. It can operate users’
checkpoint /migration commands received through a GUI.
- We modify source codes of P4 listener. On reception of commands
from the coordinator, P4 listener will interrupt computing processes
by signals. The most difficult task is how to design this procedure
carefully to avoid any dead lock.
- Some signal handlers are integrated with P4 master to deal
with CRR/migration signals, which employ libckpt[6] to take
checkpointing. Moreover, a message-driven mechanism is implemented
to record the state of inter-process communication. The potential
dead lock problem should be avoided, too.
- We modify the start-up procedure of MPICHP4, so all MPI processes
will register themselves to the coordinator. In this way, the
latter can maintain a global process-info-table to manage the
whole system.
In ChaRM4MPI, computing processes in one node save their checkpoints
information and in-transit messages to local disk. Checkpointing
to local disk can recovery any number of node transient faults.
Moreover it uses RAID like checkpoint mirroring technique to tolerate
one or more node permanent faults. Each node uses a background
process to mirror the checkpoint file and related information
to other nodes besides its local disk. When some node fails, the
recovery information of application processes running on the node
will be available on other nodes. Therefore the application process
that ran on the fault node would be resumed on the other corresponding
node where the checkpoint information is saved, and go on running
from the checkpoint.
Some MPI parallel applications are tested on ChaRM4MPI, which
include the sorting programs and LU decomposition developed by
NASA, an overfall simulation, and so on. We study the extra time
overhead introduced by CRR mechanism, and the result is fairly
satisfied. The average extra overhead per checkpointing is less
than 2% of the original running time.
In the full text, the coordinated CRR and migration protocol
is described first, and then we introduce the work involved in
modifying MPICHP4 in detail. At last, testing results will be
presented and analyzed.
References:
1. Elnozahy E N, Johnson D B, Wang Y M. A Survey of Rollback Recovery
Protocols in Message-Passing System. Technical Report. Pittsburgh,
PA: CMU-CS-96-181. Carnegie Mellon University, Oct 1996
2. Elnozahy E N. Fault tolerance for clusters of workstations.
Banatre M and Lee P (Editors), chapter 8, Spring Verlag, Aug.
1994.
3. J. Casas, Dan Clark, et. al. MPVM: A Migration Transparent
Version of PVM, Technical Report, Dept. of Comp. Sci. & Eng.,
Oregon State Institute of Sci. & Tech.
4. M. J. Litzkow, et. al. Condor— A Hunter of Idle Workstations,
In Proc. of the 8th IEEE Intl. Conf. on Distributed Computing
Systems, pp104-111, June 1988
5. S. H. Russ, B. Meyers, C.--H. Tan, and B. Heckel, ``User--Transparent
Run--Time Performance Optimization'', Proceedings of the 2nd International
Workshop on Embedded High Performance Computing, Associated with
the 11th International Parallel Processing Symposium (IPPS 97),
Geneva, April 1997.
6. James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt:
Transparent checkpointing under Unix. In Proceedings of the 1995
USENIX Technical Conference, pages 213-224, January 1995.
them.
|
|
|
| Title |
An Analytical Model
to Evaluate the Performance of Cluster
Architectures |
| Author |
Leonardo Brenner, Paulo Fernandes, César
A. F. De Rose |
| Author Inst |
Faculdade de Informatica, Catholic University
of Rio Grande do Sul, Brazil |
| Presenter |
César
A. F. De Rose |
| Abstract |
[Need PDF converted to HTML] |
|
|
| Systems Session II: Cluster
Management and Monitoring
|
| Title |
High-Performance Linux Cluster
Monitoring Using Java |
| Author |
Curtis Smith, David Henry |
| Author Inst |
Linux NetworX |
| Presenter |
David Henry
|
| Abstract |
Monitoring is at the heart of cluster management.
Instrumentation data is used to schedule tasks, load-balance devices
and services, notify administrators of hardware and software failures,
and generally monitor the health and usage of a system. The information
used to perform these operations must be gathered from the cluster
without impacting performance. This paper discusses some of the
performance barriers to efficient high-performance monitoring,
and presents an optimized technique to gather monitored data using
the /proc file system and Java. |
|
|
| Title |
CRONO: A Configurable Management
System for Linux Clusters |
| Author |
Marco Aurélio Stelmar Netto, César
A. F. De Rose |
| Author Inst |
Research Center in High Performance Computing,
Catholic University of Rio Grande do Sul, Brazil |
| Presenter |
César
A. F. De Rose |
| Abstract |
1. Introduction
Cluster architectures are becoming a very attractive alternative
when high performance is needed. With a very good cost/performance
relation and good scalability big cluster systems with hundred
of nodes are becoming more popular in universities, research labs
and industries. Linux is usually the operating system used in
these systems because it’s free, efficient and very stable.
One big problem in such a system is to manage all the nodes efficiently
as one machine and deal with issues like access rights, time and
space sharing, reservation and jobs queuing. When looking for
such a manager for our research lab we were not satisfied with
the services and level of configuration provided by the most popular
systems like Computing Center Software [1] and Portable Batch
System [2] and decided to write our own management system.
2. Cluster management over Linux
The Crono system was developed to be a highly configurable manager
for Linux clusters. It has been coded using the C language and
rely on several scripts and configuration files to be adapted
to specific administration needs and machine configurations. The
Linux operating system has several advantages for the development
of a cluster management system (CMS). Linux is an open source
system and its configuration files are very simple to read and
modify, therefore the CMS can define environment variables and
access restrictions easily. Another main advantage is the possibility
to modify the kernel to better support the CMS.
3. Crono main functionalities
Crono provides two allocation modes, space-sharing and time-sharing.
The first one is used when the user needs exclusive access to
allocated nodes, for example when application performance is being
measured. The second one is used in situations where the users
are only testing their programs and, therefore, they do not care
about performance. Space-sharing is very interesting alternative
in teaching environments, allowing large groups of students to
use the cluster at the same time.
Another main feature of Crono is its flexibility by the definition
of access rights. With configuration files the system administrator
can create user categories and associate access restrictions to
these categories or to individual users. These restrictions are
defined by the maximum time and maximum amount of nodes used in
allocations and reservations. There is also the possibility to
define restrictions based on period of the day, day of the week
and target machine.
To configure the execution environment for programs, Crono supplies
scripts for the pre and post processing of the requisitions. When
it initiates the time of a user, Crono will use two scripts: one
of them controlled by the administrator, and the other by the
user itself. This mechanism is very useful, for example, to automatically
generate MPI [3] machine files. When the time of a user is over,
two post processing scripts will be used in the same way.
Users can interact to the system through a graphical interface
or using commands in the shell (bash, tcsh, etc). Several
services are supported for users and the system administrator
like: to show information about the allocation queue, to submit
jobs to execution, to release the resources, to configure the
execution environment and to get information on access rights.
Figure 1 presents the system architecture with its four main module,
the User Interface (UI), the Access Manager (AM), the Requisition
Manager (RM) and the Node Manager (NM). The communication between
the modules is done through the sockets interface, therefore allowing
modules to be in different machines. The full paper will present
the Crono architecture in more detail.
Figure 1: Crono’s architecture and main modules
4. Using Crono in production mode
For the last 6 months Crono is being used to manage the three
Linux clusters in our research lab[*]. Access to the clusters
is centralized trough one host machine that runs the Crono Access
Manager and three Requisition managers, one for each of the clusters.
Each node of the clusters has its own Node Manager. Each cluster
has its own peculiarities, having a different number of nodes
(4, 16, and 32) and different interconnection networks (combinations
of SCI, Fast-Ethernet and Myrinet). We also support several program
environments, including different MPI implementations and some
own developed execution kernels. With simple commands like cralloc,
crcomp and crrun users can choose the cluster, the number of nodes,
the target interconnection network and the programming environment
they want. Crono will automatically generate machine files, choose
the right libraries for the compilation, load the processes and
start the program. If the needed nodes are not available, waiting
queues are managed for each cluster. Because we have different
types of users, access rights can be defined for each user depending
on status, project, target machine and day period separately.
A student may have for example only access to the small cluster
during the peek hours and to 16 nodes of the main cluster during
the night. The system is running over the Linux Slackware 8 distribution
and is already very stable. We developed the system in a way that
it can be easily configured to work with other cluster configurations
and a different set of tools. More information about Crono can
be found at www.cpad.pucrs.br/crono together with its source code
that is distributed under the GNU license.
5. References
[1] KELLER, Axel, REINEFEL, Alexander, Anotomy of Resource Management
System for HPC Clusters, vol.
3, 2001.
[2] OpenPBS web site, "www.openpbs.org"
[3] W. Gropp, E. Lusk, and A. Skjellum. A High-Performance, Portable
Implementation of the MPI Message
Passing Interface Standard, July 1996.
[*] Research supported by HP Brazil. |
|
|
Title |
Parallel, Distributed
Scripting with Python |
Author(s) |
Patrick Miller |
Author Inst. |
Lawrence Livermore National Laboratory, USA |
Presenter |
Patrick Miller
|
Abstract |
[Need to conver from PDF to HTML] |
|
|
| Systems Session III:
Processor and network Performance
|
| Title |
| |