|Title||Libertie, Egalitie, Fraternatie: Viva Linux Revolution!|
|Abstract||Commodity processors and storage, very high-performance WAN and LAN interconnects, Linux and open source HPC and Grid software are transforming the way we develop and deploy high-performance computing resources. Concurrently, a new model of nationally integrated, tightly coupled scientific data archives and instruments is emerging. This talk outlines a new world view of distributed data, computing, and network resources for 21st century science and engineering.|
|Title||Clusters: What's all the fuss about anyway?|
|Author Inst||Intel Corporation|
|Abstract||Supercomputing-people started building clusters 15 to 20 years ago. So why is it that after 15 years as the supercomputer-of-choice for economically disadvantaged academics, they've shot-up in importance to become the dominant architecture for High Performance Computing? In this talk, we'll explore this transition and put cluster computing technology into its larger context in the universe of supercomputing. This context will give us a platform to jump off as we look at projections for HPC clusters for next few years. Finally, I will "rain on my own parade" and talk about challenges we face that could stop the whole cluster revolution dead in its tracks.|
Applications Track Abstracts:
|Applications Session I: Performance Issues - Looking Under the Hood|
|Title||Using PAPI for Hardware Performance Monitoring on Linux Clusters|
|Author||Jack Dongarra, Kevin London, Shirley Moore, and Phil Mucci|
|Author Inst||University of Tennessee-Knoxville|
|Abstract||PAPI is a specification of a cross-platform
interface to hardware performance counters on modern microprocessors.
These counters exist as a small set of registers that counter
events, which are occurrences of specific signals related to
a processor's function. Monitoring these events has a variety
of uses in application performance analysis and tuning. The
PAPI specification consists of both a standard set of events
deemed most relevant for application performance tuning, as
well as both high-level and low-level sets of routines for
accessing the counters. The high level interface simply provides
the ability to start, stop, and read sets of events, and is
intended for the acquisition of simple but accurate measurement
by application engineers. The fully programmable low-level
interface provides sophisticated options for controlling the
counters, such as setting thresholds for interrupt on overflow,
as well as access to all native counting modes and events,
and is intended for third-party tool writers or users with
more sophisticated needs.
PAPI has been implemented on a number of platforms, including Linux/x86 and Linux/IA64. The Linux/x86 implementation requires a kernel patch which provides a driver for the hardware counters. The driver memory maps the counter registers into user space and allows virtualizing the counters on a per-process or per-thread basis. The kernel patch is being proposed for inclusion in the main Linux tree. The PAPI library provides access on Linux platforms not only to the standard set of events mentioned above but also to all the Linux/x86 and Linux/IA64 native events.
PAPI has been installed and is in use, either directly or through incorporation into third-party end-user performance analysis tools, on a number of Linux clusters, including the New Mexico LosLobos cluster and Linux clusters at NCSA and the University of Tennessee being used for the CGrADS project.
|Title||The Web100 Project: The Global Impact of Eliminating the World Wide Wait|
|Abstract||The goal of the web100 project (http://www.web100.org)
is to drive the wide deployment of advanced TCP implementations
that can support high performance applications without intervention
from network wizards. This is accomplished through extensive
internal instrumentation of the stack itself, plus TCP autotuning,
etc. This will permit simple tools to "autotune" TCP as well
as diagnose other components of the overall system: inefficient
applications and sick Internet paths. Web100 will bring better
observability to these last two classes of problems, leading
to long term overall improvements in both applications and
the underlying infrastructure. These changes will ultimately
bring high performance networking to the scientific masses.
My talk will describe the web100 project in detail, followed by an outline of a number of unresolved research, engineering and deployment problems that are likely to become critical as better TCP implementations are deployed.
|Applications Session II: Libraries for Application Performance|
|Title||Performance of the MP_Lite Message-Passing Library on Clusters|
|Author Inst||Ames Laboratory|
|Abstract||MP_Lite is a light-weight message-passing library designed to deliver the maximum performance to applications in a portable and user-friendly manner. It supports a subset of the most common MPI commands. MP_Lite can boost the communication rate by 30% over Fast Ethernet and by up to 60% over Gigabit Ethernet compared to other MPI implementations, all while maintaining a much lower latency. This is accomplished by streamlining the flow of the data at all stages, minimizing the memory-to-memory copying that results from buffering messages. An OS-bypass module based on the M-VIA project is being developed that will provide even lower latencies and higher bandwidths. Channel bonding two Fast Ethernet cards per machine in a PC cluster using MP_Lite provides a nearly ideal doubling of the bandwidth while only adding a small amount to the overall cost of a cluster.|
|Title||Cluster Programming with Shared Memory on Disk|
|Author||Sarah E. Anderson, B. Kevin Edgar, David H. Porter & Paul R. Woodward|
|Author Inst||Laboratory for Computational Science & Engineering (LCSE), University of Minnesota|
|Presenter||Sarah E. Anderson|
|Abstract||We will describe an implementation of our PPM
(Piecewise-parabolic method) hydrodynamics solver adapted to
run efficiently and robustly on a cluster of Linux Itanium
systems. The target problem will be computed on a 64 CPU (32
node) cluster at NCSA, and will be the largest such problem
we have ever attempted on any system, a computation on a mesh
size of 2048 x 2048 x 2048 zones.
The advent of clusters with high performance interconnects offers the opportunity to compute problems in new ways. Most previous cluster implementations have left domain decomposed independent work contexts in memory, sending only an updated "halo" of boundary information to neighboring nodes. On these new interconnect/CPU balanced clusters, our algorithm has enough separation between data production and re-use to completely overlap the read or write of two complete contexts with the update of a third. Any node may update any context, resulting in a coarse-grained load balancing capability. Leaving the contexts resident on disk gives us robustness, with essentially continuous check pointing. In addition, we can stop and restart a computation on any number of nodes. This permits any subset of the cluster to run "out of core", solving problems larger than would fit in even the whole cluster's memory. Further, we implement a high-performance remote I/O capability, giving us the ability to fetch, or write back a context resident on any remote node's local disk.
Our application is based on a small Fortran or C callable library, offering routines to simplify master/worker task queue management & communication. The library also implements asynchronous remote I/O, reading and writing named data objects residing anywhere on the cluster. To the application programmer, it offers the basic interlock-free read/write capabilities of a shared file system without giving up the performance provided by the high-speed interconnect.
|Applications Session III: Experience with Discipline-Specific Cluster Construction|
|Title||Building a High Performance Linux Cluster for Large-Scale Geophysical Modeling|
|Author||Hans-Peter Bunge, Mark Dalton|
|Author Inst||Bunge - Princeton University (Dept. of Geosciences), Dalton - Cray Research|
|Abstract||Geophysical models of seismic sound wave propagation, solid state convection in the Earth's mantle or magneto hydrodynamic simulations of the Earth's magnetic field generation promise great advances in our understanding of the complex processes that govern the Earth's interior. They also rank among some of the most demanding calculations computational physicists can perform today, exceeding the limitations of some of the largest high-performance computational systems by a factor of ten to one hundred. In order to progress, a significant reduction in computational cost is required. Off-the-shelf high-performance Linux clusters allow us to achieve this cost reduction by exploiting price-advantages in mass produced PC hardware. Here we review our experience of building a 140 Processor Pentium-II fully switched 100 BaseT Ethernet Linux Cluster targeted specifically for large-scale geophysical modeling. The system was funded in part by the National Science Foundation and has been in operation at Princeton's Geosciences Department for more than two years. During this time we experienced nearly 100 percent system uptime and have been able to apply the system to a large range of geophysical modeling problems. We will present performance results and will explore some of the scientific results that have been accomplished on the system to date.|
|Title||Aeroacoustic and Turbulent Flow Simulations Using a Cluster of Workstations|
|Author||Anirudh Modi and Lyle N. Long|
|Author Inst||Institute for High Performance Computing Applications (IHPCA) Pennsylvania State University|
|Abstract||The COst effective COmputing Array (COCOA) and its successor COCOA-2, are a Penn State Department of Aerospace Engineering initiative to bring low cost parallel computing to the departmental level. COCOA is a 50 processor linux cluster of off-the-shelf PCs (dual PII-400s) with 12.5 GB memory connected via fast-ethernet (100 Mbit/sec), whereas COCOA-2 is a 40 processor rackmounted cluster (dual PIII-800s) with 20 GB memory connected via two fast-ethernet adaptors (using Linux channel bonding). Each of the clusters have 100 GB of usable disk space. The PCs are running RedHat Linux 7.0 with MPI (both LAM MPI and MPICH) and C/C++/FORTRAN for parallel programming and DQS for queueing the jobs. COCOA cost us $100,000 (1998 US dollars), whereas COCOA-2 cost us a meagre $60,000 (2001 US dollars). Owing to the simplicity of using such commonly available open-source software and commodity hardware, these clusters are managed effortlessly by the research students themselves. The choice of the hardware suited for our applications, and the problems encountered during the making of the cluster have also been summarized. A wide range of complex and huge Aeroacoustics and turbulent Computational Fluid Dynamics (CFD) problems like unsteady separated flow over helicopters and complex ship geometries are being run on COCOA successfully, making it an ideal platform for debugging and development of our research codes. Several problems utilizing huge amounts of memory (2-5 GBs) have been found to run very well on these clusters even as compared to the traditional supercomputers. MPI performance for these clusters has also been tested thoroughly to enable us to write parallel programs that are ideally suited to such high-latency clustering environments. Some comprehensive benchmarks have also been run on the clusters showing their effectiveness at tackling current computational problems successfully. Several research members are using these clusters very successfully on a daily basis for developing, testing and running their parallel programs, something that required a privileged access to a National Supercomputing Center not so long ago.|
|Applications Session IV: Application Performance Analyses|
|Title||Benchmarking Production Codes on Beowulf Clusters: the Sissa Case Study|
|Author Inst||INFM, udr Sissa (TS ITALY)|
|Abstract||In the Condensed Matter group at Sissa (Trieste) we routinely use a certain number of parallel codes to perform scientific research in computational condensed matter physics. Codes are developed in house and implements all the modern computational techniques. In particular we can currently classify three main classes of parallel applications: (i) First-principles or ab-initio codes where the quantum description of the electronic properties of matter is taken into account; (ii) Classical Molecular Dynamics (MD) codes and (iii) Quantum Monte Carlo (QMC) techniques. Codes belonging to different classes are different also from the computational point of view. In this presentation we want to show the behavior of a representative suite of these three classes of codes on two kinds of Beowulf clusters we have at our disposal. The first machine is an Intel based cluster using Myrinet networks built here at Sissa. The second one is an alpha based cluster connected through QSNET network hosted at the supercomputer center at Cineca (Italy). A careful performance analysis was carried out for all the applications in terms of the computational characteristics of the benchmarking codes. Our measurements on the two clusters spot out a rich and interesting scenario: for instance some of bottlenecks experienced by certain classes of codes on the Intel cluster are easily overcame on the alpha cluster, while for other classes the Intel architecture still has to be preferred in term of performance/price ratio. Performance results on Beowulf clusters are also compared with data obtained on other parallel machine available. The list includes Origin3000, T3E and SP3. At the light of all the results we collected we are now able to draw some indications for future acquisition of computational power.|
|Title||Performance of Tightly Coupled Linux Cluster Simulation using PETSc of Reaction and Transport Processes During Corrosion Pit Initiation|
|Author||Eric Webb, Jay Alameda, William Gropp, Joshua Gray, and Richard Alkire|
|Author Inst||UIUC ChemE, NCSA, ANL|
|Abstract||The most dangerous forms of corrosion occur
unexpectedly and often out of sight of ordinary inspection
methods. Therefore, remote monitoring is increasingly installed
on bridges, tunnels, waste repositories, pipelines and marine
structures, for example. Predicting material failure based
on field data, however, depends upon accurate understanding
of corrosion processes, which involve phenomena that span more
than ten orders of magnitude in time and distance scales. Consequently,
the situation represents an extreme challenge.
In this work, we consider a seminal moment: the initiation of a small corrosion pit that possesses the ability to grow into a larger pit, and then to form crevices and cracks that may cause catastrophic failure. Improved understanding of mechanisms of pit initiation is needed. Direct laboratory measurement of local events during pit initiation is surprisingly difficult. The complexity of the system creates a situation where experiments seldom can be compared directly with simulation results, not to mention with each other. Multiple scientific hypotheses therefore emerge. Simulations that accurately emulate experimental observations are essential in order to reduce the number of experiments needed to verify scientific understanding of the system, as well as to design corrosion monitoring technology. The role of sulfur-rich inclusions on pit initiation on a surface of stainless steel was simulated numerically as well as investigated experimentally. Mechanistic hypotheses suggested by the experimental data were simulated in numerical models in order to test their validity.
In this work, we describe our experience in solving the mass balance and species transport relationships for an electrically neutral solution, in the form of a system of coupled, non-linear partial differential and algebraic equations. This system was solved numerically with the use of a finite difference method with second-order centered finite differences in two dimensions being used to transform the partial differential equations into a system of non-linear algebraic equations, and second-order forward- and backward-difference approximations were used on the boundaries. A fully implicit scheme was implemented to step forward in time. To solve the resulting system of equations, we used the Portable, Extensible Toolkit for Scientific Computation (PETSc), which provided a robust parallel computing architecture, numerical software libraries and data structures, and sparse matrix storage techniques. In particular, we describe our experience in moving our PETSc-based code from the NCSA Origin Array to the emerging linux cluster architecture, and the resulting code and solver performance that resulted from our work on the linux cluster.
Through this work, we will show that the dual strategies of taking strategic advantage of object oriented nature of C++ to create chemistry and geometry classes, which simplified the task of testing hypotheses, as well as employing PETSc as our solver, allowed a straightforward transition to the linux cluster architecture, while enjoying good code performance to allow simulation of complex electrochemical phenomena. This approach provides an excellent starting point to more detailed computational investigations of electrochemical phenomena in the future.
|Applications Session V: Applications on Clusters|
|Title||A Parallel Chemical Reactor Simulation Using Cactus|
|Author||Karen Camarda, Yuan He, and Ken Bishop|
|Author Inst||U. Kansas, Department of Chemical and Petroleum Engineering|
|Abstract||Efficient numerical simulation of chemical
reactors is an active field of research, both in industry and
academia. As members of the Alliance Chemical Engineering Application
Technologies (AT) Team, we have developed a reactor simulation
for the partial oxidation of ortho-xylene to phthalic anhydride,
an important industrial compound. This problem is of interest
to chemical engineers for both theoretical and practical reasons.
Theoretically, one is trying to understand the kinetics of
the reactions involved. Currently, none of the several kinetic
models that have been proposed is able to reproduce both the
temperature profile inside the reactor and the spectrum of
products observed in industry. Practically, an accurate simulation
of this reaction would allow industrial chemical engineers
to optimize plant operations, while ensuring a safe working
We perform our simulation using a pseudo-homogeneous, non-isothermal, packed bed catalytic reactor model. The equations solved are a set of diffusive-advective partial differential equations, each with a generation term introduced by the chemical reaction model used. In general, these generation terms couple the equations in a nonlinear way, as is the case with the model we implement.
The discrete equations to be solved are derived by finite-differencing the partial differential equations to obtain the Crank-Nicholson representation, which is then linearized. Because solving the resulting system of equations is computationally expensive, an efficient parallel implementation is desired.
In order to achieve parallelism and portability with a minimum of development effort, we have ported our existing reactor model to the Cactus computational environment. To solve the linear systems that result, we have also ported a solver that uses a conjugate gradient method.
In this paper, we discuss the nature of the equations solved and the algorithm used to solve them. We then present a performance analysis of running this code on the Linux cluster at the National Center for Supercomputing Applications (NCSA).
|Title||Massively Parallel Visualization on Linux Clusters with Rocketeer Voyager|
|Author||Robert Fiedler & John Norris|
|Author Inst||UIUC, Center for Simulation of Advanced Rockets|
|Abstract||This paper describes Rocketeer Voyager, a 3-D
scientific visualization tool which processes in parallel a
series of HDF output dumps from a large scale simulation. Rocketeer
reads data defined on many types of grids and displays translucent
surfaces and isosurfaces, vectors as glyphs, surface and 3-D
meshes, etc. An interactive version is first used to view a
few snapshots, orient the image, and save a set of graphical
operations that will be applied by the batch tool to the entire
set of snapshots. The snapshots may reside on separate or shared
file systems. Voyager broadcasts the list of operations, reads
the data ranges from all snapshots to determine the color scales,
and then processes the snapshots concurrently to produce hundreds
of frames for animation in little more time than it would take
for the interactive tool to display a single image.
Rocketeer is based on the Visualization Toolkit, which uses OpenGL to exploit hardware graphics acceleration. On a linux cluster, particularly a heterogeneous one, the nodes may have different graphics hardware or none at all. To circumvent defects in images rendered using Mesa, CSAR contracted with Xi Graphics, Inc. to develop an efficient and flawless X-server that runs in a "virtual frame buffer" and installs on all nodes in the cluster with little administrator intervention. Even though the virtual frame buffer does not take advantage of any graphics hardware, individual images are rendered as quickly as they are on all but the fastest graphics workstations.
Systems Administration Track Abstracts:
|Systems Session I: Tools for Building Clusters|
|Title||SCE : A Fully Integrated Software Tool for Beowulf Cluster System|
|Author||Putchong Uthayopas, Sugree Phatanapherom, Thara Angskun, Somsak Sriprayoonsakul|
|Author Inst||Parallel Research Group, CONSYL, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand|
|Abstract||One of the problem with the wide adoption of
clusters for mainstream high performance computing is the difficulty
in building and managing the system. There are many efforts
in solving this problem by building fully automated, integrated
stack of software distribution from several well known open
source software. The problem is that these set of software
comes from many sources. and never been designed to work together
as a truly integrated system.
With the experiences and tools developed to build many clusters in our site , we decided to build an integrate software tool that is easy to use for cluster user community. These software tools, called SCE (Scalable Computing Environment), consists of cluster builder tool, complex system management tool (SCMS), scalable real-time monitoring, web base monitoring software(KCAP), parallel unix command, and batch scheduler. These software run on top of our cluster middleware that provides cluster wide process control and many services. MPICH are also included. All tools in SCE are designed to be truly integrated since all of them except MPI and PVM are built by our group. SCE also provides more than 30 APIs to access system resources information, control remote process execution, ensemble management and more. These APIs and the interaction among software components allows user to extends and enhance SCE in many way. SCE is also designed to be very easy to use. Most of the installation and configuration are automated by complete GUI and Web. This paper will discuss the current SCE design, implementation, and experiences. SCE is expected to be available as a developer version in June.
|Title||Node Abstraction Techniques for Linux Installation|
|Author||Sean Dague & Rich Ferri|
|Abstract||Every Linux distribution is different. This
is both a strength, as different distributions are better suited
for different environments, and a weakness, as someone familiar
with one distribution, may be lost when presented with a different
distribution. The differences hurt you even more when writing
an application for Linux, as a developer may make assumptions
that he/she believes are valid for Linux in general, but in
reality are only true for certain distributions.
Lui (Linux Utility for cluster Installation) is an open source project whose goal is to be a universal cluster installer for all distributions and architectures. It has existed in the community since April of 2000. In redesigning for the 2.0 release, we spent time researching where Linux distributions diverge in order to build an abstraction layer over the more contentious areas of difference. This presentation will look at the various ways one could abstract these differences, and the path that the Lui project has taken for the 2.0 release. We will investigate abstractions necessary both for different Linux distributions, as well as different system architectures.
This presentation will discuss the data abstractions one can use to support multiple distributions and architectures, and how we used these abstractions in the design of Lui 2.0.
|Systems Session II: Production Clusters|
|Title||Setting Up and Running a Production Linux Cluster at Pacific Northwest National Laboratory|
|Author||Gary Skouson, Ryan Braby|
|Author Inst||Pacific Northwest National Laboratory|
|Abstract||With the low price and increasing performance of commodity computer hardware, it is interesting to study the viability of using clusters of relatively inexpensive computers to produce a stable system capable of the current demands for a high performance computer. A 192-processor cluster was installed to test and develop methods that would make the PC cluster a workable alternative to using other commercial systems. By comparing the PC clusters with the cluster systems sold commercially, it became apparent that the tools to manage the PC cluster as a single system were not as robust or as well integrated as in some commercial systems. This presentation will focus on the problems encountered and solutions used to stabilize this cluster for both production and development use. This included the use of extra hardware such as, remote power control units and multi-port adapters to provide remote access to both the system console and system power. A giganet cLan fabric was also used to provide a high speed, low latency interconnect. Software solutions were used for resource management, job scheduling and accounting, parallel filesystems, remote network installation and system monitoring. Although there are still some tools missing for debugging hardware problems, the PC cluster continues to be very stable and useful for our users.|
|Title||High Throughput Linux Clustering at Fermilab|
|Author||Steven C. Timm, L. Giacchetti, D. Skow|
|Presenter||Steven C. Timm|
|Abstract||Computing in experimental High Energy Physics is a natural problem to use coarse-grained parallel computing structures. This report will describe the development and management of computing farms at Fermilab. We will describe the hardware and configuration of our 1000-processor cluster, its interface to petabyte-scale network-attached storage, the batch software we have developed, and the cluster management and monitoring tools that we use.|
|Systems Session III: Accounting / Monitoring|
|Title||SNUPI: A Grid Accounting and Performance System Employing Portal Services and RDBMS Back-End|
|Author||Victor Hazlewood, Ray Bean, Ken Yoshimoto|
|Abstract||SNUPI, the System, Network, Usage and Performance Interface, provides a portal interface to system management, monitoring and accounting data for heterogenous computer systems, including Linux clusters. SNUPI provides data collection tools, recommended RDBMS schema design, and Perl-DBI scripts suitable for portal services to deliver reports at the system, user, job and process level for heterogeneous systems across the enterprise, including Linux clusters. The proposed paper and presentation will describe the background of accounting systems for UNIX, detail process and batch accounting (JOBLOG) capabilities available for Linux, and describe the collection tools, the RDBMS schema and portal scripts that make up the Open Source SNUPI grid accounting and performance system as employed on the Linux cluster and other systems at NPACI/SDSC.|
|Title||Cluster Monitoring at NCSA|
|Abstract||System monitoring and performance monitoring at the National Center for Supercomputing Applications have always been treated as separate arenas, due to the nature of their different functions. The function of a system monitor has been that of a virtual system operator, monitoring for events that depict a system fault and also automating the management of such events. The function of a performance monitor has been to collect data from different layers of the system -- hardware, kernel, service, and application layers -- for administrators and users who collect such data for diagnostic purposes in order to finely tune the performance of the different system layers. The collected performance data has never been associated with the system-monitoring framework. However, the legacy of this division of interest dissolves within the new arena of cluster monitoring. This paper explains why this is so, and how the National Center for Supercomputing Applications has merged the arenas of system monitoring and performance monitoring in a cluster environment.|
|Systems Session IV: Lessons Learned|
|Title||Linux Cluster Security|
|Abstract||Modern Linux clusters are under increasing security threats. This paper will discuss various aspects of cluster security, including: firewalls, ssh, kerberos, stateful and stateless packet filtering, honeypots, intrusion detection systems, penetration testing, and newer kernel security features. Example packet filtering rule sets for various configurations will be detailed along with the rationale behind them.|
|Title||Lessons Learned from Proprietary HPC Cluster Software|
|Author||James B. White III|
|Author Inst||Oak Ridge National Laboratory|
|Presenter||James B. White III|
|Abstract||Production high-performance-computing (HPC) centers have common needs for infrastructure software, but different centers can have very different strategies for meeting these needs. Even within a single center, systems from different vendors often present very different strategies. The open-cluster community has the opportunity to define de facto standards for cluster infrastructure software. To this end, we summarize the basic needs of production HPC centers, and we describe the strategies for meeting these needs used by the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory. The CCS has two large, proprietary clusters, an IBM RS/6000 SP and a Compaq AlphaServer SC. The SP and SC lines have dominated recent large HPC procurements, and they present the most direct competition to high-end open clusters. We describe our experience with the infrastructure software for these proprietary systems, in terms of successes the open-cluster community may want to emulate and mistakes they may want to avoid.|
|Systems Session V: Something Different|
|Title||Cheap Cycles from the Desktop to the Dedicated Cluster: Combining Opportunistic and Dedicated Scheduling with Condor|
|Author Inst||U. Wisconsin|
|Abstract||Science and industry are increasingly reliant
on large amounts of computational power. With more CPU cycles,
larger problems can be solved and more accurate results obtained.
However, large quantities of computational resources have not
always been affordable.
For over a decade, the Condor High Throughput Computing System has scavenged otherwise wasted CPU cycles from desktop workstations. These inconsistently available resources have proven to be a significant source of computational power, enabling scientists to solve ever more complex problems. Condor efficiently harnesses existing resources, reducing or eliminating the need to purchase expensive supercomputer time or equivalent hardware.
Dedicated clusters of commodity PC hardware running Linux are becoming widely used as computational resources. The cost to performance ratio for such clusters is unmatched by other platforms. It is now feasible for smaller groups to purchase and maintain their own clusters.
Software for controlling these clusters has, to date, used dedicated scheduling algorithms. These algorithms assume constant availability of resources to compute fixed schedules. Unfortunately, due to hardware or software failures, dedicated resources are not always available over the long-term. Moreover, these dedicated scheduling solutions are only applicable to certain classes of jobs, and can only manage dedicated clusters or large SMP machines.
Condor overcomes these limitations by combining aspects of dedicated and opportunistic scheduling in a single system. This paper will discuss Condor's research into dedicated scheduling, how elements of opportunism are used and integrated, and why this approach provides a better computational resource for the end-user. Our experience managing parallel MPI jobs alongside serial applications will be addressed. Finally, future work in the area of merging dedicated and opportunistic scheduling will be explored.
By using both desktop workstations and dedicated clusters, Condor harnesses all available computational power to enable the best possible science at a low cost.
|Title||An Innovative Inter-connect for Clustering Linux Systems|
|Author||John M. Levesque|
|Author Inst||Times N Systems|
|Presenter||John M. Levesque|
|Abstract||Times N Systems, a systems company in Austin
Texas, has a unique interconnect strategy. Multiple Intel nodes
are connected via a large Shared Memory using fiber optics,
the hardware runs either W2K or Linux. Individual nodes can
access shared memory by directly addressing data beyond their
local memory. This collection of hardware, nodes and interconnect
is refer to as a team.
The current system has a latency of 2.3 microseconds to shared memory and the next SMN (Shared Memory Node) will see 1/2 this latency. Other enhancements to the current system include 1) increasing Shared Memory from 1Gbyte to 16 Gbytes, 2) increasing the number of nodes from 16 to 128 and 3) increasing the 65 Mbytes/sec bandwidth to 2 Gbytes/Sec. These improvements will be shipped the end of this year.
The Scientific and Technical offering will include ISV supplied Fortran and C compilers and debuggers, MPI and performance tools from TNS. MPI has been optimized to use the SMN for both message passing and synchronization. The Times N System is an excellent I/O performer as well. Disks can be allocated as SMD Disks and striped across all processors in the team. Additionally TNS has ported gfs a Linux global file system to the TNS Team.
The talk will concentrate on the overall architecture of the Times N System, illustrating how popular scientific codes will benefit from the fastest processors connected with a low latency interconnect.