 |
2005 Tutorials
I Towards Highly Available, Scalable, and Secure HPC
Clusters with HA-OSCAR
II Architectures and Configuration of Linux Cluster IO
Solutions
III Programming Linux Clusters Using the Message Passing
Interface (MPI)
IV Performance Analysis Using SvPablo and KOJAK
V Object-Based Cluster Storage Systems
VI Torque/Maui: Getting the Most Out of Your Resource
Manager
VII Productive Application Development with the Intel
Cluster Toolkit
Last updated: 24 February 2005
I
|
Title |
|
Towards
Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR |
Presenters |
|
Ibrahim Haddad, Box Leangsuksun,
Stephen L. Scott |
Overview |
|
March 2004 was a major milestone for the
HA-OSCAR Working Group. It marked the announcement of the first
public release of the HA-OSCAR software package. HA-OSCAR is
an Open Source project that aims to provide a combined power
of high availability and performance computing. HA-OSCAR enhances
a Beowulf cluster system for mission critical grade applications
with various high availability mechanisms such as component redundancy
to eliminate this single point of failure, self-healing mechanism,
failure detection and recovery mechanisms, in addition to supporting
automatic failover and fail-back. The first release (version
1.0) supports new high availability capabilities for Linux Beowulf
clusters based on the OSCAR 3.0 release from the Open Cluster
Group. In this release of HA-OSCAR, we provide an installation
wizard graphical user interface and a web-based administration
tool, which allows intuitive creation and configuration of a
multi-head Beowulf cluster. In addition, we have included a default
set of monitoring services to ensure that critical services,
hardware components, and important cluster resources are always
available at the control node. HA-OSCAR also supports new tailored
services that can be configured and added via a WebMin-based
HA-OSCAR administration tool.
This tutorial will address in detail
all the design and implementation issues related to building
HA Linux Beowulf clusters and using Linux and Open Source Software
as the base technology. In addition, the focus of the tutorial
is HA-OSCAR. We will present the architecture of HA-OSCAR, review
of new features of the latest release, discuss how we implemented
the HA and security features, and discuss our experiments covering
modeling, and testing performance and availability on real systems.
Background
on HA-OSCAR: The HA-OSCAR project’s primary goal is to
improve the existing Beowulf architecture and cluster management
systems (e.g. OSCAR, ROCKS, Scyld etc) while providing high-availability
and scalability capabilities for Linux clusters. HA-OSCAR introduces
several enhancements and new features to OSCAR, mainly in the
areas of availability, scalability, and security. The new features
in the initial release are head node redundancy, self-recovery
for hardware, service, and application outages. HA-OSCAR has
been tested to work with several OSCAR distributions. HA-OSCAR
should work with OSCAR 2.3, 2.3.1, 3.0 based on Red Hat 9.0 and
OSCAR 4.0 based on Fedora core 2. The first version (1.0) was
released on March 22, 2004, brining over than 5000 hits to HA-OSCAR
site within 48 hours. The announcement was featured in the LinuxWorld
magazine, on O’Reilly, HPC
Wire, ClusterWorld, and Slashdot.net.
|
Outline |
|
Introduction
- Introduction to Beowulf
and HPC clusters
- Introduction to HA clusters
- Various levels of HA
- Linux: the commodity component of the
cluster stack
- Software and hardware system architecture
Challenges in Designing
and Prototyping HA/HPC Clusters
- Booting the cluster
- Storage
- Building the disks
- Installing application servers
- Traffic distribution mechanisms
- Load balancing mechanisms
- Building redundancy at various
levels in the cluster:
- Ethernet redundancy
- DHCP/TFTP/NTP/NFS servers’ redundancy
- Data redundancy
using software RAID.• Automating
network installations
- Automatic network RAID setups
- File systems for HA Linux
clusters
OSCAR
- Introduction
- Cluster Computing Overview
- OSCAR - "The Beginning" - Overview
/ Strategy
- OSCAR Components (Functional areas)
- Core, Admin/Config,
HPC Services
- Core Components: SIS, C3, Switcher, ODA, OPD
- "The OSCAR
Trail"
- OSCAR Wizard (v3.0)
HA-OSCAR
- HA-OSCAR overview
- HA-OSCAR architecture and components
- HA-OSCAR comparison
with Beowulf architecture
- HA features
- Multi-head builder and Self-configuration
- Monitoring
- Service monitoring
- Hardware monitoring
- Resource monitoring
- Self-healing and recovery mechanism
- Test environment
- Installation Steps
- Experiments
- Availability moldering, analysis and uptime
improvement study between Beowulf and HA-OSCAR
- Test results
- Applications and feasibility studies
- Grid-enable HA cluster
- HA-OSCAR and Distributed Security
Infrastructure integration
- HA-OSCAR and OpenMosix/LVS feasibility
study
Demonstration (with 4 laptops running the latest research
release of HA-OSCAR)
Conclusion
- HA-OSCAR Roadmap
- Advanced research
- Questions and answers
|
| Schedule |
|
All Day
|
|
|
| II
|
Title |
|
Architectures and
Configuration of Linux Cluster I/O Solutions |
Presenter(s) |
|
Lee Ward |
Overview |
|
|
Outline |
|
General cluster/parallel FS requirements
- global
namespace
- data organization
- single system view for management
- security
- parallel access methods (n x m and n x 1)
Architectures
- clustered solutions
- distributed file systems
- GFS, CVFS
- parallel file systems
- parameters; stripe, stride
- PanFS, Lustre, GPFS
- central vs. distributed metadata pros
and cons
- configuration and balance
Hardware
- Fundamental disk concepts and implications
- Tracks, cylinders,
rotational and seek latencies
- Disk access methods
- SCSI
- iSCSI
- ATA
- PATA vs. SATA
- ATAPI
- Device aggregation
- LVM
- RAID
- Software
- Hardware
- Virtualized Storage Devices
- Emerging devices
- Object-based disk
- Future
- Nanotech
- Active disk
- SANs
- Machine Configuration
- CPU requirements
- Memory requirements
- Adapter organization and the peripheral
data bus
General Q&A
|
| Schedule |
|
Half-day morning. |
|
|
| III
|
Title |
|
Programming Linux Clusters
Using the Message Passing Interface (MPI) |
Presenter(s) |
|
Purushotham V. Bangalore, Vijay Velusamy |
Overview |
|
Linux Clusters continue to dominate
the Top500 list of the world’s fastest supercomputers.
Most of the programs that run on these clusters are developed
using the Message Passing Interface (MPI) standard Application
Programming Interface (API). The proposed half-day tutorial covers
the basic knowledge required to write parallel programs using
the message passing programming model and is directly applicable
to almost every parallel computer architecture.
Efficient Parallel
programming involves solving a common task by co-operation
between processes. The programmer is required to not only define
the tasks that will be executed by the processors, but also how
these tasks are to synchronize and exchange data with one another.
In the message-passing model the tasks are separate processes
that send each other explicit messages to communicate and synchronize.
All these parallel operations are performed via calls to message-passing
interfaces that are directly responsible for interfacing with
the system infrastructure connecting the actual processors together.
This tutorial covers the following features supported by the
MPI-1.2 standard: point-to-point communication (blocking and
non-blocking operations), collective communication, primitive
datatypes, and language bindings for C/C++ and Fortran.
Although most of the functionality required for
parallel programming in clusters may be achieved using the basic
features of the MPI-1.2 standard, much more flexibility and performance
could be achieved using some of the advanced features of the
MPI-2 standard. These interfaces are designed not only to benefit
from performance enhancing features that may be available in
the parallel programming environment, but also simplify the programming
involved. This half-day tutorial focuses on the advanced MPI-1.2
functionality and features of the MPI-2 standard. While the MPI-1.2
standard relied on a static process model, many applications
require dynamic process management support and these features
were added to the MPI-2 standard. In this tutorial we present
different approaches to using dynamic process management capabilities
provided by the MPI-2 standard. Also, simple MPI applications
rely on a single process reading and writing data through files.
In order to take advantage of emerging parallel file systems
(e.g., Lustre, PVFS, Panasas) higher level abstractions are provided
by the MPI-2 standard. This tutorial presents the various APIs
and illustrative examples to describe the various features available
to access files in parallel using derived datatypes.
|
Outline |
|
The tutorial includes formal lectures, programming
examples from real-world applications, and informal discussions.
This tutorial is intended for any application developer or designer
involved in developing cluster applications.
- Introduction
to Cluster Computing and MPI
- Point-to-point Communication
- Collective Communication
- Dynamic Process Management
- Advanced Derived
Datatypes
- MPI-I/O
|
| Schedule |
|
Half-day morning. |
|
|
| IV
|
Title |
|
Performance
Analysis Using SvPablo and KOJAK |
Presenter(s) |
|
Ying Zhang, Shirley Moore |
Overview |
|
SvPablo is a performance analysis and visualization
system for both sequential and parallel platforms that provides
a graphical environment for instrumenting scientific applications
written in C and Fortran and for analyzing their performance
at the source code level. SvPablo supports both interactive and
automatic instrumentation for C, Fortran 77, and Fortran 90 programs.
During execution of the instrumented code, the SvPablo library
captures and computes statistical performance data for each instrumented
construct on each processor. Because it only maintains the statistical
data, rather than detailed event traces, the SvPablo library
can characterize the performance of programs that run for days
and on hundreds of processors. SvPablo captures both software
performance data (such as duration and number of calls) and hardware
metrics (such as floating point operations, cycles, cache misses)
at run time. At the end of execution, the SvPablo library generates
summary files for each processor or MPI task. The per-task performance
files can be merged into one performance file post- mortem via
a utility program. This file then can be loaded and displayed
in the SvPablo GUI. Performance data are presented in the SvPablo
GUI via a hierarchy of color-coded performance displays, including
a high- level procedure profile and detailed line-level statistics.
The GUI provides a convenient means for the user to identify
high-level bottlenecks (e.g. procedures) and then explore in
increasing levels of detail (e.g. identifying a specific cause
of poor performance at a source code line executed on one of
many processors). In addition, when the program is executed on
different numbers of processors, SvPablo’s graphical scalability
analysis tool calculates scalability efficiencies for each instrumented
construct based on the performance files for each execution.
Via a colored display and line graph, it presents to the users
the overall scaling trends and limits for their applications.
It also helps them to identify the code sections that have best
or worst scalability and discover the key areas in the application
for optimization, which is extremely important when porting applications
to large scale clusters.
KOJAK is a suite of performance tools
that collect and analyze performance data from high performance
applications written in Fortran 77/90/95 or C/C++. Performance
data are collected automatically using a combination of source
code annotations or binary instrumentation and hardware counters.
The analysis tools use pattern recognition to convert the raw
performance data into information about performance bottlenecks
relevant to developers. Once profiling and scalability analysis
have been carried out using a profiling tool such as SvPablo,
KOJAK can be configured to automatically instrument problem areas
of the application code to generate a detailed trace file using
the EPILOG library. The event traces generated by EPILOG capture
MPI point-to-point and collective communication as well as OpenMP
parallelism change, parallel constructs, and synchronization.
In addition, data from hardware counters accessed using the PAPI
library can be recorded in the event traces. KOJAK’s EXPERT
tool is an automatic trace analyzer that attempts to identify
specific performance problems. Internally, EXPERT represents
performance problems in the form of execution patterns that model
inefficient behavior. These patterns are used during the analysis
process to recognize and quantify inefficient behavior in the
application. Each pattern calculates a (call path, location)
matrix containing the time spent on a specific behavior in a
particular (call path, location) pair, where a location is a
process or thread. Thus, EXPERT maps.the (performance problem,
call path, location) space onto the time spent on a particular
performance problem while the program was executing in a particular
r call path at a particular location. After the analysis has
been finished, the mapping is written to a file and can be viewed
using the CUBE display tool. The CUBE display consists of three
coupled tree browsers, representing the metric, the program,
and the location dimensions. At any given time, there are two
nodes selected, one in the metric tree and one in the call tree.
Each node is labeled with a severity value. A value shown in
the metric tree represents the sum of a particular metric for
the entire program, that is, across all call paths and all locations.
A value shown in the call tree represents the sum of the selected
metric across all locations for a particular call path. A value
shown in the location tree represents the selected metric for
the selected call path and a particular location. To help identify
metric/resource combinations with a high severity, values are
ranked using colors. Thus, the CUBE display visually identifies
performance bottlenecks that have been detected by EXPERT and
relates them to specific source code and process/thread locations.
Recent work on KOJAK has used a rapid prototyping capability
written in Python to define new types of patterns, such as a
specialized pattern for detecting communication bottlenecks in
wavefront algorithms.
SvPablo and KOJAK have been used in the
performance analysis of large scientific applications on major
parallel systems including Linux clusters, IBM SP, NEC SX, HP/Compaq
Alpha, etc. The applications studied include DOE SciDAC applications
being analyzed by the Performance Evaluation Research Center
(PERC). This tutorial will provide an overview of SvPablo, introduce
tool usage with an emphasis on scalability analysis, and presents
results obtained using SvPablo in scalability analysis of a scientific
application on Linux clusters (up to 1024 processors.) It will
also introduce a new feature in SvPablo temperature measurement
and profiling -- which helps users to identify the “hot
spots” in
their applications that may present problems for the system and
affect the system’s state of health. Following the SvPablo
presentation, the tutorial will provide an overview of KOJAK,
demonstrate how profile data obtained using SvPablo can be used
to select portions of source code for EPILOG instrumentation,
and explain and demonstrate the use of EXPERT and CUBE. Finally,
instruction will be provided on how to prototype new EXPERT analysis
patterns.
|
Outline |
|
Outline of tutorial:
- Overview of SvPablo
- SvPablo interactive and automated source code instrumentation
- Use of the SvPablo GUI
- SvPablo scalability
analysis with example
- Temperature measurement and profiling
- Overview
of KOJAK
- Use of SvPablo profile data to selectively
insert EPILOG instrumentation
- Use of EXPERT
- Use of the CUBE GUI
- Prototyping new EXPERT
patterns using Python
|
| Schedule |
|
Half-day morning. |
|
|
| V
|
Title |
|
Object-Based
Cluster Storage Systems |
Presenter(s) |
|
David Nagle |
Overview |
|
The last few years have seen significant advanced
in cluster-based storage with many new systems embracing object-based
storage to provide the scalability, performance and fault-tolerance
necessary to meet the demands of cluster applications. Products
adopting the object-model, include those from Panasas, Lustre,
Ibrix, Centera, and Isilon. This tutorial will present the fundamentals
of object-based storage including both the underlying architectural
principals and how various products have adapted those principals
into their product designs. The tutorial will begin with an overview
of object-based storage device (OSD) interface as defined by
the ANSI/T10 standard. Topics will include the object-model,
the OSD command set, and OSD security. Will we then describe
the decoupled data/metadata storage architecture commonly found
in cluster storage systems and how the OSD interface, security
model, networking and RAID play critical roles in the performance
and fault-tolerance of these systems. Finally we will perform
an in-depth comparison of the various object-based storage systems
available today.
|
Outline |
|
- Intro to object-based cluster storage
- Comparison between
block and object-based storage
- Details of object-based storage (OSD)
- Basic object-based cluster storage architecture
- Scaling via
decoupling of data path from metadata
- Security
- Networking and communications
- RAID and object-based
storage
- Fault tolerance models for cluster storage systems
- Comparison
of current solutions
- EMC Centera
- Lustre
- Panasas
- Isilon
- Ibrix
- For each of the above systems
|
| Schedule |
|
Half-day afternoon. |
|
|
| VI
|
Title |
|
Torque/Maui: Getting the
Most Out of Your Resource Manager |
Presenter(s) |
|
Eric Kretzer, Dan Lapine, Peter Enstrom |
Overview |
|
Torque is a resource manager providing control
over batch jobs and distributed nodes. Maui is an advanced scheduler
commonly used in conjunction with Torque and other resource managers.
Torque continues to offer features that rival commercially available
packages including support for scaling to multi-teraflop clusters.
Maui offers a variety of scheduling policies that include priority
escalation, advance reservations, fairshare, and backfill capabilities.
The Torque/Maui tandem can maximize the cluster's overall utilization
at no cost, and therefore making Torque/Maui appealing to a majority
of cluster administrators.
This tutorial aims to introduce participants
to the resource manager, Torque, and its accompanying advanced
scheduler, Maui. The tutorial will provide a foundation on the
basics of configuring the server and client portions of Torque.
Then building on that foundation, we will discuss detailed information
on leveraging the advance capabilities of Maui to enforce specific
resource policies such as ensuring a particular set of users
are guaranteed a specific set of resources.
|
Outline |
|
- Introduction to Resource Management
- Short overview
- Benefits of using Torque/Maui
- Key deliverables of the tutorial
- Torque Resource Manager
- Resource Server
- overview
- configuring the server
- defining nodes & node features
- defining queues
- associated commands
- Resource Client
- overview
- configuring the mom
- setting up a job prologue & epilogue
- pre-job execution
machine check
- associated commands
- Resource Scheduler
- overview
- configuring the scheduler
- associated commands
- substituting a third party scheduler
- HANDS-ON LAB will
exercise bullet points above.
- Maui Cluster Scheduler
- overview
- reservations
- resource partitioning
- quality of service
- backfill
- fairshare
- statistics
- associated commands
- HANDS-ON LAB will couple advance scheduling
to the pre-existing resource manager
- Policies
- job submission filters
- locking jobs/users to specific queues
- different charging
algorithms based on resources & features
- prioritizing
job types
|
| Schedule |
|
Half-day afternoon.
|
|
|
| VII
|
Title |
|
Productive Application
Development with the Intel Cluster Toolkit |
Presenters |
|
Mario Deilmann |
Overview |
|
In the introduction, we provide an overview of
prerequisites, terminology, performance issues and the nature
of a parallel development approach. The main part of the tutorial
will be an intermediate example of how the performance of an
MPI application can be optimized using the Intel MPI and
Intel® Trace
Collector. Advanced topics are advanced MPI tracing and some
sophisticated functionality like binary instrumentation. A short
introduction using the Intel® Cluster
Math Kernel Library and The Intel® MPI benchmark follows. Finally we
throw a glance at the upcoming new version of the Intel® Trace Analyzer,
which is the first version with a complete new software design
and a new graphical user interface. This tutorial aims to introduce
detailed technical information along with a basic understanding
on how different factors contribute to the overall application
performance and how to address them with the different tools
of the Intel® Cluster Toolkit.
|
Outline |
|
This tutorial introduces the Intel Cluster Toolkit (ICT)
1.0, a toolkit for cluster programmers. This toolkit addresses
the main stages of the parallel development process, enabling developers
to achieve optimized performance for Intel processor-based cluster
systems.
Features:
- Efficiently build scalable parallel MPI
applications that can execute across multiple network architectures.
- Tune applications with powerful mathematical library
functions from Intel® Cluster Math Kernel Library.
- Easily
collect trace information and analyze runtime behavior in
detail, using the leading tools for performance analysis on
clusters.
- Gather performance information about a cluster system
easily.
- Linux functionality and encompassing MPI-2 extensions
with standard compliance and portability.
- Interoperable with
resource management, performance measurement and debugging
tools.
- A single installation and license for five cluster
development solutions
Performance (This tutorial will introduce the following tools
that address the above issues.):
- Intel MPI Library 1.0 provides
a flexible implementation of MPI for easier message-passing
interface development on multiple network architectures.
- Intel Cluster
Math Kernel Library 7.2 enables parallel computing programmers
to develop Linux applications with numerical stability. Multiple
library functionality includes ScaLAPACK, Vector Math and Statistical
libraries, PARDISO Sparse Solver, and Discrete Fourier Transforms.
- Intel Trace Collector1 1 5.0 applies event-based
tracing in cluster applications with a low-overhead library.
ITC offers performance data, recording of statistics, multi-threaded
traces, and automatic instrumentation of binaries on IA-32.
- Intel Trace Analyzer2 2 4.0 provides visual analysis
of application activities gathered by the Intel® Trace
Collector.
- Intel MPI Benchmarks3 3 2.3 is a comprehensive
set of MPI benchmarks.
Figure 1 illustrates how the different
components interact with a user’s application.

1 Intel
Trace Collector was formerly known as Vampirtrace.
2 Intel Trace
Analyzer was formerly known as Vampir.
3 Intel MPI
Benchmarks were formerly known as PMB.
|
| Schedule |
|
Half-day afternoon.
|
|
|
|
|
|
|
|