This is an archived page of the 2006 conference

tutorials

I Cancelled
II Advanced Cluster and Grid Management with Moab Cluster Suite
III Towards Highly Available, Scalable, and Secure Computer Clusters with HA-OSCAR
IV Cancelled
V Object-Based Cluster Storage
VI Resource Management Using SLURM
VII Machine Room Design
VIII Intel Cluster Tools
IX HPC and MPI:Open MPI Tuning
X OpenIB

Last updated: April 26, 2006

 

I

Title

 

Cancelled

Presenters

 

NA

Overview

 

NA

Outline

 

NA

Schedule


NA

 

 

 

II

Title

 

Advanced Cluster and Grid Management with Moab Cluster Suite

Presenter(s)

 

Dave Jackson

Overview

 

Organizations are demanding more out of their clusters.  Expectations for efficiency are rising. Management and reporting requirements are becoming more advanced. In every way, clusters are expected to be more professionally managed, more flexible, and better performing from day 1.

In this tutorial, we will overview the high-level architecture of the Moab Workload Manager and discuss why it has found so much success amongst Top 500 Cluster systems. We will discuss in detail many of the more common requirements and vexing problems of the modern cluster and how Moab facilities can help manage these issues effectively. Issues will include general optimization, end-user empowerment, delivering
targeted levels of service, improving overall cluster availability and up time, and supporting dynamic services and automated cluster re-provisioning.

Coverage will also be given for full system policy integration allowing orchestration and management of not just compute resources, but also license, storage, and network resources towards those objectives most important to the organization.  This includes importing real-time information from other services into high-level scheduling systems and allowing the cluster management system to control and export information back into other local and grid services.

Finally, this tutorial will cover the use of Moab in creating powerful and flexible grids connecting multiple clusters into a more manageable entity. We will describe the creation of management and information grids in which analysis, control, and reporting is centralized but jobs remain local within each cluster as well as the more common job grid in which job and data migration is automatically managed. Lastly, we will discuss creation and management of virtualized and on-demand clusters and how Moab allows sites to either host or utilize these dynamically configured and dynamically allocated resources.

Outline

 

    8:30 Introduction

    • Attributes of a Basic Cluster
    • Evolution of a Cluster Environment over Time
    • Sources of Complexity and Waste within a Matured Cluster Environment

    9:00 Moab Workload Manager Architecture

    • High Level Overview of Job, Node, and User Management Facilities
    • Generalized Resource Management Interfaces

    9:30 General Cluster Management Tasks

    • Optimizing Cluster Performance and Availability
    • Managing Politics and Providing Service Level Agreements
    • Handling Transient Needs with Reservations
    • Off-loading Staff through User Empowerment
    • Orchestrating Compute, Network and Storage Usage
    • Diagnosing Job, Node, and Policy Issues

    10:00 Break

    10:30 Advanced Cluster Management Tasks

    • Integrating Moab Management Facilities with Peer Services
    • Capacity Planning and Reporting with Historical Statistics
    • Customizing Management with Generic Metrics, Properties and Consumable Resources
    • Enabling Charging and Allocation Management Facilities
    • Automatically Responding to Arbitrary Events with Triggers

    11:15 Grid Management

    • Enabling a Grid in 60 Seconds
    • Controlling Information Access and Job and Data Flow Policies
    • Credential and Security Management
    • Enabling Information Services
    • Reducing Staff Overhead with Control/Management Grids
    • Improving Statistics with Information Grids

    11:40 On Demand - Utility Computing

    • Architectural Overview
    • Enabling Moab to Seamlessly Utilize an Internal/External Resource Hosting Center
    • Becoming a On Demand Hosting Center

    11:55 Questions and Answer

Schedule


Half-day morning.

 

 

 

III

Title

 

Towards Highly Available, Scalable, and Secure Computer Clusters with HA-OSCAR

Presenter(s)

 

Ibrahim Haddad, Box Leangsuksun, Stephen L. Scott

Overview

 

March 2004 was a major milestone for the HA-OSCAR Working Group. It marked the announcement of the first public release of the HA-OSCAR software package. HA-OSCAR is an Open Source project that aims to provide a combined power of high availability and performance computing. HA-OSCAR enhances a Beowulf cluster system for mission critical grade applications with various high availability mechanisms such as component redundancy to eliminate this single point of failure, self-healing mechanism, failure detection and recovery mechanisms, in addition to supporting automatic failover and fail-back.

The first release (version 1.0) supports new high availability capabilities for Linux Beowulf clusters based on the OSCAR 3.0 release from the Open Cluster Group. In this release of HA-OSCAR, we provide an installation wizard graphical user interface and a web-based administration tool, which allows intuitive creation and configuration of a multi-head Beowulf cluster. In addition, we have included a default set of monitoring services to ensure that critical services, hardware components, and important cluster resources are always available at the control node. HA-OSCAR also supports new tailored services that can be configured and added via a WebMin-based HA-OSCAR administration tool.

This tutorial will address in detail all the design and implementation issues related to building HA Linux Beowulf clusters and using Linux and Open Source Software as the base technology. In addition, the focus of the tutorial is HA-OSCAR. We will present the architecture of HA-OSCAR, review of new features of the latest release, discuss how we implemented the HA and security features, and discuss our experiments covering modeling, and testing performance and availability on real systems.

Outline

 

Introduction

  • Introduction to Beowulf and HPC clusters
  • Introduction to HA clusters
  • Various levels of HA
  • Linux: the commodity component of the cluster stack
  • Software and hardware system architecture

Challenges in Designing and Prototyping HA/HPC Clusters

  • Booting the cluster
  • Storage
  • Building the disks
  • Installing application servers
  • Traffic distribution mechanisms
  • Load balancing mechanisms
  • Building redundancy at various levels in the cluster:
  • Ethernet redundancy
  • DHCP/TFTP/NTP/NFS servers’ redundancy
  • Data redundancy using software RAID
  • Automating network installations
  • Automatic network RAID setups
  • File systems for HA Linux clusters

OSCAR

  • Introduction
  • Cluster Computing Overview
  • OSCAR - "The Beginning" - Overview / Strategy
  • OSCAR Components (Functional areas)
    • Core, Admin/Config, HPC Services
    • Core Components: SIS, C3, Switcher, ODA, OPD

HA-OSCAR (40%)

  • HA-OSCAR overview
  • HA-OSCAR architecture and components
  • HA-OSCAR comparison with Beowulf architecture
  • HA features
  • Multi-head builder and Self-configuration
  • Monitoring
    • Service monitoring
    • Hardware monitoring
    • Resource monitoring
  • Self-healing and recovery mechanism
  • Test environment
  • Installation Steps
  • Experiments
  • Availability moldering, analysis and uptime improvement study between Beowulf and HA-OSCAR
  • Test results
  • Applications and feasibility studies
    • Grid-enable HA cluster
    • HA-OSCAR and Distributed Security Infrastructure integration
    • HA-OSCAR and OpenMosix/LVS feasibility study
    • Transparent Job Queue fault tolerance based on TORQUE

Demonstration

With 4 laptops running latest research release of HA-OSCAR

Conclusion

    • HA-OSCAR Roadmap
    • Advanced research
    • Questions and answers

.

Schedule


Half-day morning.

 

 

 

IV

Title

 

Cancelled

Presenter(s)

 

N/A

Overview

 

N/A

Outline

 

N/A

Schedule


N/A

 

 

 

V

Title

 

Object-Based Cluster Storage

Presenter(s)

 

David Nagle and Brent Welch

Overview

 

AThe last few years have seen significant advances in cluster-based storage with new systems embracing object-based storage to provide the scalability, performance and fault-tolerance necessary to meet the demands of cluster applications. Products adopting the object-model include Panasas, Lustre, and Centera. This tutorial will present the fundamentals of object-based storage including the underlying architectural principals and how various products have adapted those principals into their product designs.

The tutorial will begin with an overview of the object-based storage device (OSD) interface as defined by the ANSI/T10 standard. Topics will include the object-model, the OSD command set, and OSD security. We will then describe the decoupled data/metadata storage architecture commonly found in cluster storage systems and how the OSD interface, security model, networking and RAID play critical roles in the performance and fault-tolerance of these systems. Finally, we will perform an in-depth comparison of the various object-based storage systems available today.

Outline

 

Available soon.

Schedule


Half-day afternoon.

 

 

 

VI

Title

 

Resource Management Using SLURM

Presenter(s)

 

Morris Jette

Overview

 

SLURM has become a very popular resource manager. Some of its important characteristics include: high scalability portability, security and fault-tolerance. It is also open source and available under the GNU Public License.

A complete build, installation and configuration of SLURM will be performed. Attendees with a Linux laptop will be able to do their own installation using the supplied CD. SLURM can emulate a sizable Linux cluster, including a BlueGene system, entirely within a laptop computer. This will allow presentation demonstration of realistic configuration and use issues.

More information about SLURM is available at http://www.llnl.gov/linux/slurm.

Outline

 
  • The role of a resource manager
  • Design issues for resource management on large-scale clusters
  • SLURM architecture
  • SLURM commands and their use
  • SLURM configuration
  • Demonstration of SLURM build, installation, configuration and use

Schedule


Half-day afternoon.

 

 

 

VII

Title

 

Machine Room Design

Presenters

 

Timothy Thomas

Overview

 

This tutorial will give the attendee an immersion in the world of putting together new, modern data centers, as wel asl some pointers on refurbishing old ones to deal with the new world order. Topics will include the following: background ("theory") of power (demand and supply) and heat transport; advanced design principles; example designs; overview of new component product classes; practical direction in cooling techniques; room layout, networking, power distribution, cabling, networking; logic of interconnected systems (electrical, fire detection, fire and security alarms, fire suppression; insurance issues; working with facilities planning and physical plant departments; budgeting; the nature and meaning of blueprints; in-the-trenches experience from one or more recent projects.

Outline

 

Available soon.

Schedule


Half-day afternoon.

 

 

 

VIII

Title

 

Intel Cluster Tools

Presenter(s)

 

Werner Krotz-Vogel

Overview

 

This tutorial is about optimizing performance with the Intel Cluster Toolkit on Linux Clusters. Attendees will learn how to use Intel Trace Analyzer & Collector for performance optimization of MPI applications. After an in-depth introduction, attendees will have to solve several tasks, which will guide them through the main features and functionalities of the tool set.

Outline

 

Available soon.

Schedule


Half-day afternoon.

 

 

 

X

Title

 

OpenIB

Presenters

 

Stephen Poole

Overview

 

Available soon.

Outline

 

Available soon.

Schedule


Half-day morning.

 

 

 

IX

Title

 

HPC and MPI: OpenMPI Tuning

Presenters

 

Graham E. Fagg

Overview

 

To familiarize the audience with some of the advanced extensions of Open MPI. All examples and demos will be given using the latest stable release of the Open MPI library, a full fledged MPI­2 library based on the cumulative work of the LAM/MPI, LA­MPI, and FT­MPI implementations.

We will be focusing on two distinctive topics:

  • Maximizing performance in heterogeneous and multi­network clusters by using the dynamic component architecture of the Open MPI implementation of the MPI standard.
  • Increasing the overall performance of collective communications based on multiple choices of run­time parameters. An overview of the algorithmic choices of di ff erent collective communications will be presented, as well as their advantages and/or draw­backs. A quick introduction to communication modeling will be provided in order to allow those unfamiliar with the concept to gain deeper understanding of the challenges we’re presenting.

.

Outline

 
  • MPI background
    • Generic nature of MPI interface
      • Point­to­point communications
      • Collective communications
  • Hands­on installation of Open MPI on tutorial hardware
  • The Open MPI component architecture
    • Open MPI capabilities to be explored
      • Dynamic loading of MCA components
      • Run­time component selection
      • Tunable run­time parameters
    • In Open MPI, how to:
      • Add an MCA module to an existing installation
      • Allow automatic discovery of modules
      • Use manual module selection
      • Set run­time tunable parameters
      • Select point­to­point module(s)
      • Select collective module(s)
    • Instructor demos; attendee hands­on demos
  • Heterogeneous Environments
    • MPI background
    • Heterogeneous aspects of Open MPI implementation
      • Multi­network (including striping across multiple interfaces)
      • Multiple end­points (host architectures)
      • Endian­ness
    • Instructor demos; attendee hands­on demos
      Collectives
    • Relevant MPI background
      • Collectives specification
    • In the context of Open MPI, how to:
      • Configure the point­to­point communications
      • Configure collectives communications
    • Instructor demos; attendee hands­on demos


Half-day afternoon.