This is an archived page of the 2005 conference

Tutorials

I Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR
II Architectures and Configuration of Linux Cluster IO Solutions
III Programming Linux Clusters Using the Message Passing Interface (MPI)
IV Performance Analysis Using SvPablo and KOJAK
V Object-Based Cluster Storage Systems
VI Torque/Maui: Getting the Most Out of Your Resource Manager
VII Productive Application Development with the Intel Cluster Toolkit

Last updated: 24 February 2005

I

Title

 

Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR

Presenters

 

Ibrahim Haddad, Box Leangsuksun, Stephen L. Scott

Overview

 

March 2004 was a major milestone for the HA-OSCAR Working Group. It marked the announcement of the first public release of the HA-OSCAR software package. HA-OSCAR is an Open Source project that aims to provide a combined power of high availability and performance computing. HA-OSCAR enhances a Beowulf cluster system for mission critical grade applications with various high availability mechanisms such as component redundancy to eliminate this single point of failure, self-healing mechanism, failure detection and recovery mechanisms, in addition to supporting automatic failover and fail-back. The first release (version 1.0) supports new high availability capabilities for Linux Beowulf clusters based on the OSCAR 3.0 release from the Open Cluster Group. In this release of HA-OSCAR, we provide an installation wizard graphical user interface and a web-based administration tool, which allows intuitive creation and configuration of a multi-head Beowulf cluster. In addition, we have included a default set of monitoring services to ensure that critical services, hardware components, and important cluster resources are always available at the control node. HA-OSCAR also supports new tailored services that can be configured and added via a WebMin-based HA-OSCAR administration tool.

This tutorial will address in detail all the design and implementation issues related to building HA Linux Beowulf clusters and using Linux and Open Source Software as the base technology. In addition, the focus of the tutorial is HA-OSCAR. We will present the architecture of HA-OSCAR, review of new features of the latest release, discuss how we implemented the HA and security features, and discuss our experiments covering modeling, and testing performance and availability on real systems.

Background on HA-OSCAR: The HA-OSCAR project’s primary goal is to improve the existing Beowulf architecture and cluster management systems (e.g. OSCAR, ROCKS, Scyld etc) while providing high-availability and scalability capabilities for Linux clusters. HA-OSCAR introduces several enhancements and new features to OSCAR, mainly in the areas of availability, scalability, and security. The new features in the initial release are head node redundancy, self-recovery for hardware, service, and application outages. HA-OSCAR has been tested to work with several OSCAR distributions. HA-OSCAR should work with OSCAR 2.3, 2.3.1, 3.0 based on Red Hat 9.0 and OSCAR 4.0 based on Fedora core 2. The first version (1.0) was released on March 22, 2004, brining over than 5000 hits to HA-OSCAR site within 48 hours. The announcement was featured in the LinuxWorld magazine, on O’Reilly, HPC Wire, ClusterWorld, and Slashdot.net.

Outline

 

Introduction

  • Introduction to Beowulf and HPC clusters
  • Introduction to HA clusters
  • Various levels of HA
  • Linux: the commodity component of the cluster stack
  • Software and hardware system architecture

Challenges in Designing and Prototyping HA/HPC Clusters

  • Booting the cluster
  • Storage
  • Building the disks
  • Installing application servers
  • Traffic distribution mechanisms
  • Load balancing mechanisms
  • Building redundancy at various levels in the cluster:
  • Ethernet redundancy
  • DHCP/TFTP/NTP/NFS servers’ redundancy
  • Data redundancy using software RAID.• Automating network installations
  • Automatic network RAID setups
  • File systems for HA Linux clusters

OSCAR

  • Introduction
  • Cluster Computing Overview
  • OSCAR - "The Beginning" - Overview / Strategy
  • OSCAR Components (Functional areas)
    • Core, Admin/Config, HPC Services
    • Core Components: SIS, C3, Switcher, ODA, OPD
  • "The OSCAR Trail"
  • OSCAR Wizard (v3.0)

HA-OSCAR

  • HA-OSCAR overview
  • HA-OSCAR architecture and components
  • HA-OSCAR comparison with Beowulf architecture
  • HA features
  • Multi-head builder and Self-configuration
  • Monitoring
    • Service monitoring
    • Hardware monitoring
    • Resource monitoring
  • Self-healing and recovery mechanism
  • Test environment
  • Installation Steps
  • Experiments
  • Availability moldering, analysis and uptime improvement study between Beowulf and HA-OSCAR
  • Test results
  • Applications and feasibility studies
  • Grid-enable HA cluster
  • HA-OSCAR and Distributed Security Infrastructure integration
  • HA-OSCAR and OpenMosix/LVS feasibility study

Demonstration (with 4 laptops running the latest research release of HA-OSCAR)

Conclusion
  • HA-OSCAR Roadmap
  • Advanced research
  • Questions and answers

Schedule
All Day

 

 

II

Title

 

Architectures and Configuration of Linux Cluster I/O Solutions

Presenter(s)

 

Lee Ward

Overview

 

 

Outline

 

General cluster/parallel FS requirements

  • global namespace
  • data organization
  • single system view for management
  • security
  • parallel access methods (n x m and n x 1)

Architectures

  • clustered solutions
  • distributed file systems
  • GFS, CVFS
  • parallel file systems
  • parameters; stripe, stride
  • PanFS, Lustre, GPFS
  • central vs. distributed metadata pros and cons
  • configuration and balance

Hardware

  • Fundamental disk concepts and implications
  • Tracks, cylinders, rotational and seek latencies
  • Disk access methods
  • SCSI
  • iSCSI
  • ATA
  • PATA vs. SATA
  • ATAPI
  • Device aggregation
  • LVM
  • RAID
  • Software
  • Hardware
  • Virtualized Storage Devices
  • Emerging devices
  • Object-based disk
  • Future
  • Nanotech
  • Active disk
  • SANs
  • Machine Configuration
  • CPU requirements
  • Memory requirements
  • Adapter organization and the peripheral data bus

General Q&A

 

Schedule


Half-day morning.

 

 

III

Title

 

Programming Linux Clusters Using the Message Passing Interface (MPI)

Presenter(s)

 

Purushotham V. Bangalore, Vijay Velusamy

Overview

 

Linux Clusters continue to dominate the Top500 list of the world’s fastest supercomputers. Most of the programs that run on these clusters are developed using the Message Passing Interface (MPI) standard Application Programming Interface (API). The proposed half-day tutorial covers the basic knowledge required to write parallel programs using the message passing programming model and is directly applicable to almost every parallel computer architecture.

Efficient Parallel programming involves solving a common task by co-operation between processes. The programmer is required to not only define the tasks that will be executed by the processors, but also how these tasks are to synchronize and exchange data with one another. In the message-passing model the tasks are separate processes that send each other explicit messages to communicate and synchronize. All these parallel operations are performed via calls to message-passing interfaces that are directly responsible for interfacing with the system infrastructure connecting the actual processors together. This tutorial covers the following features supported by the MPI-1.2 standard: point-to-point communication (blocking and non-blocking operations), collective communication, primitive datatypes, and language bindings for C/C++ and Fortran.

Although most of the functionality required for parallel programming in clusters may be achieved using the basic features of the MPI-1.2 standard, much more flexibility and performance could be achieved using some of the advanced features of the MPI-2 standard. These interfaces are designed not only to benefit from performance enhancing features that may be available in the parallel programming environment, but also simplify the programming involved. This half-day tutorial focuses on the advanced MPI-1.2 functionality and features of the MPI-2 standard. While the MPI-1.2 standard relied on a static process model, many applications require dynamic process management support and these features were added to the MPI-2 standard. In this tutorial we present different approaches to using dynamic process management capabilities provided by the MPI-2 standard. Also, simple MPI applications rely on a single process reading and writing data through files. In order to take advantage of emerging parallel file systems (e.g., Lustre, PVFS, Panasas) higher level abstractions are provided by the MPI-2 standard. This tutorial presents the various APIs and illustrative examples to describe the various features available to access files in parallel using derived datatypes.

Outline

 

The tutorial includes formal lectures, programming examples from real-world applications, and informal discussions. This tutorial is intended for any application developer or designer involved in developing cluster applications.

  • Introduction to Cluster Computing and MPI
  • Point-to-point Communication
  • Collective Communication
  • Dynamic Process Management
  • Advanced Derived Datatypes
  • MPI-I/O

Schedule


Half-day morning.

 

 

IV

Title

 

Performance Analysis Using SvPablo and KOJAK

Presenter(s)

 

Ying Zhang, Shirley Moore

Overview

 

SvPablo is a performance analysis and visualization system for both sequential and parallel platforms that provides a graphical environment for instrumenting scientific applications written in C and Fortran and for analyzing their performance at the source code level. SvPablo supports both interactive and automatic instrumentation for C, Fortran 77, and Fortran 90 programs. During execution of the instrumented code, the SvPablo library captures and computes statistical performance data for each instrumented construct on each processor. Because it only maintains the statistical data, rather than detailed event traces, the SvPablo library can characterize the performance of programs that run for days and on hundreds of processors. SvPablo captures both software performance data (such as duration and number of calls) and hardware metrics (such as floating point operations, cycles, cache misses) at run time. At the end of execution, the SvPablo library generates summary files for each processor or MPI task. The per-task performance files can be merged into one performance file post- mortem via a utility program. This file then can be loaded and displayed in the SvPablo GUI. Performance data are presented in the SvPablo GUI via a hierarchy of color-coded performance displays, including a high- level procedure profile and detailed line-level statistics. The GUI provides a convenient means for the user to identify high-level bottlenecks (e.g. procedures) and then explore in increasing levels of detail (e.g. identifying a specific cause of poor performance at a source code line executed on one of many processors). In addition, when the program is executed on different numbers of processors, SvPablo’s graphical scalability analysis tool calculates scalability efficiencies for each instrumented construct based on the performance files for each execution. Via a colored display and line graph, it presents to the users the overall scaling trends and limits for their applications. It also helps them to identify the code sections that have best or worst scalability and discover the key areas in the application for optimization, which is extremely important when porting applications to large scale clusters.

KOJAK is a suite of performance tools that collect and analyze performance data from high performance applications written in Fortran 77/90/95 or C/C++. Performance data are collected automatically using a combination of source code annotations or binary instrumentation and hardware counters. The analysis tools use pattern recognition to convert the raw performance data into information about performance bottlenecks relevant to developers. Once profiling and scalability analysis have been carried out using a profiling tool such as SvPablo, KOJAK can be configured to automatically instrument problem areas of the application code to generate a detailed trace file using the EPILOG library. The event traces generated by EPILOG capture MPI point-to-point and collective communication as well as OpenMP parallelism change, parallel constructs, and synchronization. In addition, data from hardware counters accessed using the PAPI library can be recorded in the event traces. KOJAK’s EXPERT tool is an automatic trace analyzer that attempts to identify specific performance problems. Internally, EXPERT represents performance problems in the form of execution patterns that model inefficient behavior. These patterns are used during the analysis process to recognize and quantify inefficient behavior in the application. Each pattern calculates a (call path, location) matrix containing the time spent on a specific behavior in a particular (call path, location) pair, where a location is a process or thread. Thus, EXPERT maps.the (performance problem, call path, location) space onto the time spent on a particular performance problem while the program was executing in a particular r call path at a particular location. After the analysis has been finished, the mapping is written to a file and can be viewed using the CUBE display tool. The CUBE display consists of three coupled tree browsers, representing the metric, the program, and the location dimensions. At any given time, there are two nodes selected, one in the metric tree and one in the call tree. Each node is labeled with a severity value. A value shown in the metric tree represents the sum of a particular metric for the entire program, that is, across all call paths and all locations. A value shown in the call tree represents the sum of the selected metric across all locations for a particular call path. A value shown in the location tree represents the selected metric for the selected call path and a particular location. To help identify metric/resource combinations with a high severity, values are ranked using colors. Thus, the CUBE display visually identifies performance bottlenecks that have been detected by EXPERT and relates them to specific source code and process/thread locations. Recent work on KOJAK has used a rapid prototyping capability written in Python to define new types of patterns, such as a specialized pattern for detecting communication bottlenecks in wavefront algorithms.

SvPablo and KOJAK have been used in the performance analysis of large scientific applications on major parallel systems including Linux clusters, IBM SP, NEC SX, HP/Compaq Alpha, etc. The applications studied include DOE SciDAC applications being analyzed by the Performance Evaluation Research Center (PERC). This tutorial will provide an overview of SvPablo, introduce tool usage with an emphasis on scalability analysis, and presents results obtained using SvPablo in scalability analysis of a scientific application on Linux clusters (up to 1024 processors.) It will also introduce a new feature in SvPablo ­ temperature measurement and profiling -- which helps users to identify the “hot spots” in their applications that may present problems for the system and affect the system’s state of health. Following the SvPablo presentation, the tutorial will provide an overview of KOJAK, demonstrate how profile data obtained using SvPablo can be used to select portions of source code for EPILOG instrumentation, and explain and demonstrate the use of EXPERT and CUBE. Finally, instruction will be provided on how to prototype new EXPERT analysis patterns.

Outline

 

Outline of tutorial:

  • Overview of SvPablo
  • SvPablo interactive and automated source code instrumentation
  • Use of the SvPablo GUI
  • SvPablo scalability analysis with example
  • Temperature measurement and profiling
  • Overview of KOJAK
  • Use of SvPablo profile data to selectively insert EPILOG instrumentation
  • Use of EXPERT
  • Use of the CUBE GUI
  • Prototyping new EXPERT patterns using Python

Schedule


Half-day morning.

 

 

V

Title

 

Object-Based Cluster Storage Systems

Presenter(s)

 

David Nagle

Overview

 

The last few years have seen significant advanced in cluster-based storage with many new systems embracing object-based storage to provide the scalability, performance and fault-tolerance necessary to meet the demands of cluster applications. Products adopting the object-model, include those from Panasas, Lustre, Ibrix, Centera, and Isilon. This tutorial will present the fundamentals of object-based storage including both the underlying architectural principals and how various products have adapted those principals into their product designs. The tutorial will begin with an overview of object-based storage device (OSD) interface as defined by the ANSI/T10 standard. Topics will include the object-model, the OSD command set, and OSD security. Will we then describe the decoupled data/metadata storage architecture commonly found in cluster storage systems and how the OSD interface, security model, networking and RAID play critical roles in the performance and fault-tolerance of these systems. Finally we will perform an in-depth comparison of the various object-based storage systems available today.

Outline

 
  1. Intro to object-based cluster storage
    • Comparison between block and object-based storage
    • Details of object-based storage (OSD)
  2. Basic object-based cluster storage architecture
    • Scaling via decoupling of data path from metadata
    • Security
    • Networking and communications
    • RAID and object-based storage
    • Fault tolerance models for cluster storage systems
  3. Comparison of current solutions
    • EMC Centera
    • Lustre
    • Panasas
    • Isilon
    • Ibrix
    • For each of the above systems

Schedule


Half-day afternoon.

 

 

VI

Title

 

Torque/Maui: Getting the Most Out of Your Resource Manager

Presenter(s)

 

Eric Kretzer, Dan Lapine, Peter Enstrom

Overview

 

Torque is a resource manager providing control over batch jobs and distributed nodes. Maui is an advanced scheduler commonly used in conjunction with Torque and other resource managers. Torque continues to offer features that rival commercially available packages including support for scaling to multi-teraflop clusters. Maui offers a variety of scheduling policies that include priority escalation, advance reservations, fairshare, and backfill capabilities. The Torque/Maui tandem can maximize the cluster's overall utilization at no cost, and therefore making Torque/Maui appealing to a majority of cluster administrators.

This tutorial aims to introduce participants to the resource manager, Torque, and its accompanying advanced scheduler, Maui. The tutorial will provide a foundation on the basics of configuring the server and client portions of Torque. Then building on that foundation, we will discuss detailed information on leveraging the advance capabilities of Maui to enforce specific resource policies such as ensuring a particular set of users are guaranteed a specific set of resources.

Outline

 
  1. Introduction to Resource Management
    • Short overview
    • Benefits of using Torque/Maui
    • Key deliverables of the tutorial
  2. Torque Resource Manager
    • Resource Server
      • overview
      • configuring the server
      • defining nodes & node features
      • defining queues
      • associated commands
    • Resource Client
      • overview
      • configuring the mom
      • setting up a job prologue & epilogue
      • pre-job execution machine check
      • associated commands
    • Resource Scheduler
      • overview
      • configuring the scheduler
      • associated commands
      • substituting a third party scheduler
    • HANDS-ON LAB will exercise bullet points above.
  3. Maui Cluster Scheduler
    • overview
    • reservations
    • resource partitioning
    • quality of service
    • backfill
    • fairshare
    • statistics
    • associated commands
    • HANDS-ON LAB will couple advance scheduling to the pre-existing resource manager
  4. Policies
    • job submission filters
    • locking jobs/users to specific queues
    • different charging algorithms based on resources & features
    • prioritizing job types

Schedule


Half-day afternoon.

 

 

VII

Title

 

Productive Application Development with the Intel Cluster Toolkit

Presenters

 

Mario Deilmann

Overview

 

In the introduction, we provide an overview of prerequisites, terminology, performance issues and the nature of a parallel development approach. The main part of the tutorial will be an intermediate example of how the performance of an MPI application can be optimized using the Intel MPI and Intel® Trace Collector. Advanced topics are advanced MPI tracing and some sophisticated functionality like binary instrumentation. A short introduction using the Intel® Cluster Math Kernel Library and The Intel® MPI benchmark follows. Finally we throw a glance at the upcoming new version of the Intel® Trace Analyzer, which is the first version with a complete new software design and a new graphical user interface. This tutorial aims to introduce detailed technical information along with a basic understanding on how different factors contribute to the overall application performance and how to address them with the different tools of the Intel® Cluster Toolkit.

Outline

 

This tutorial introduces the Intel Cluster Toolkit (ICT) 1.0, a toolkit for cluster programmers. This toolkit addresses the main stages of the parallel development process, enabling developers to achieve optimized performance for Intel processor-based cluster systems.

Features:

  • Efficiently build scalable parallel MPI applications that can execute across multiple network architectures.
  • Tune applications with powerful mathematical library functions from Intel® Cluster Math Kernel Library.
  • Easily collect trace information and analyze runtime behavior in detail, using the leading tools for performance analysis on clusters.
  • Gather performance information about a cluster system easily.
  • Linux functionality and encompassing MPI-2 extensions with standard compliance and portability.
  • Interoperable with resource management, performance measurement and debugging tools.
  • A single installation and license for five cluster development solutions

Performance (This tutorial will introduce the following tools that address the above issues.):

  • Intel MPI Library 1.0 provides a flexible implementation of MPI for easier message-passing interface development on multiple network architectures.
  • Intel Cluster Math Kernel Library 7.2 enables parallel computing programmers to develop Linux applications with numerical stability. Multiple library functionality includes ScaLAPACK, Vector Math and Statistical libraries, PARDISO Sparse Solver, and Discrete Fourier Transforms.
  • Intel Trace Collector1 1 5.0 applies event-based tracing in cluster applications with a low-overhead library. ITC offers performance data, recording of statistics, multi-threaded traces, and automatic instrumentation of binaries on IA-32.
  • Intel Trace Analyzer2 2 4.0 provides visual analysis of application activities gathered by the Intel® Trace Collector.
  • Intel MPI Benchmarks3 3 2.3 is a comprehensive set of MPI benchmarks.

Figure 1 illustrates how the different components interact with a user’s application.


  Figure 1: The Software Architecture of the Intel Cluster Toolkit.

1 Intel Trace Collector was formerly known as Vampirtrace.
2 Intel Trace Analyzer was formerly known as Vampir.
3 Intel MPI Benchmarks were formerly known as PMB.

Schedule


Half-day afternoon.