HPC Ecosystems β€” OpenHPC 2.x System Administrator 101

A practical guide for managing users, software, and infrastructure in an OpenHPC 2.x cluster environment.


πŸ“‘ Contents


πŸ“„ smshost Cheatsheet

Quick reference guide for common administrative tasks on the smshost.

HPC Ecosystems Document

The security policy for SharePoint prevents embedding. Please view the document directly:

Open Document in SharePoint

πŸ‘€ Adding Users to Cluster (OpenHPC 2.x)

Warewulf manages system files using the wwsh file * interface. To view currently managed files:

Command:
wwsh file list
[root@smshost ~]# wwsh file list dynamic_hosts : rw-r--r-- 0 root root 3800 /etc/hosts group : rw-r--r-- 1 root root 1086 /etc/group hosts : rw-r--r-- 1 root root 3800 /etc/hosts ifcfg-ib0.ww : rw-r--r-- 1 root root 280 /etc/sysconfig/network-scripts/ifcfg-ib0 munge.key : r-------- 1 munge munge 1024 /etc/munge/munge.key network : rw-r--r-- 1 root root 16 /etc/sysconfig/network passwd : rw-r--r-- 1 root root 2829 /etc/passwd shadow : rw-r----- 1 root root 1556 /etc/shadow

User accounts are created on the smshost and then propagated to compute nodes via Warewulf.

  • Add users using the standard sudo useradd command.
  • Sync account files across the cluster:
wwsh file resync passwd shadow group

⚑ Force Propagation

/warewulf/bin/wwgetfiles

πŸ’» Installing Software to Compute Nodes (OpenHPC 2.x)

Summary

Most of the provisioned image's configuration is conducted in a chroot filesystem. These chroots cannot be directly provisioned by Warewulf. Once satisfied with the chroot configuration, it is encapsulated and compressed into a Virtual Node File System (VNFS) image, which Warewulf provisions. Think of the chroot as the β€œsource code” and the VNFS as the β€œcompiled binary.”

[root@smshost ~]$ wwvnfs --chroot $CHROOT

Software Installation Steps

  1. Install software into the compute node root filesystem (chroot):
    dnf install fail2ban --installroot $CHROOT
  2. Rebuild the VNFS:
    sudo wwvnfs --chroot $CHROOT
  3. Reboot compute nodes.
  4. Verify scheduler is running.

Install System Software for Compute Nodes

Assuming the directory structure on the smshost represents the root filesystem for the compute node (chroot). The default location is defined in input.local and is likely: /opt/ohpc/admin/images/rocky8.6

export CHROOT=/opt/ohpc/admin/images/rocky8.6
sudo dnf -y --installroot $CHROOT install python37

The above command installs Python 3.7 directly into the root filesystem of the compute node image.

πŸ› οΈ Install Software Apps for Users (OpenHPC 2.x)

Python 3 compiler installation: Download the source code, extract it, and navigate to the folder:

./configure --enable-optimizations --with-ensurepip=install --enable-shared --prefix=/opt/ohpc/pub/compiler/python/${PYTHON_VERSION} make -j$(nproc) sudo make install

⚠️ PATH Warnings

WARNING: The scripts pip3 and pip3.10 are installed in '/opt/ohpc/pub/compiler/python/3.10.12/bin' which is not on PATH. Consider adding this directory to PATH or, to suppress this warning, use --no-warn-script-location.

We solve this PATH warning using module files.

πŸ“‚ Update Application Module Files (OpenHPC 2.x)

Copy a template from $MODULEPATH to /opt/ohpc/pub/modulefiles/ and edit as needed:

#%Module1.0############################################################# proc ModulesHelp { } { puts stderr " " puts stderr "This module loads Python3.10.12" puts stderr " " puts stderr "See the man pages for Python3 for detailed information" puts stderr "on available compiler options and command-line syntax." puts stderr " " puts stderr "\nVersion 3.10.12\n" } module-whatis "Name: Python3" module-whatis "Version: 3.10.12" module-whatis "Category: compiler, runtime support" module-whatis "Description: Python3" module-whatis "URL: https://www.python.org/downloads/release/python-31012/" set version 3.10.12 prepend-path PATH /opt/ohpc/pub/compiler/python/3.10.12/bin prepend-path MANPATH /opt/ohpc/pub/compiler/python/3.10.12/share/man prepend-path INCLUDE /opt/ohpc/pub/compiler/python/3.10.12/include prepend-path LD_LIBRARY_PATH /opt/ohpc/pub/compiler/python/3.10.12/lib #family "compiler"

πŸ”§ IPMI / BMC (Remote Management)

The IPMI / BMC network allows remote control of hardware (reboot, power up/down). Recommended practice is to separate the BMC management network from the production network. Default node credentials: Username: chpc, Password: bmc123qwe.

🌐 IP Address Conventions

Standard IP assignments for production and management networks:

Production Network
10.10.10.0/24
BMC / Management Network
(recommended)
10.10.11.0/24
BMC / Management Network
(on delivery)
10.10.10.2xx
10.10.10.1xx 10.10.11.1xx 10.10.10.2xx
Example: compute00
10.10.10.100 10.10.11.100 10.10.10.200
Example: compute12
10.10.10.112 10.10.11.112 10.10.10.212
Example: compute44
10.10.10.144 10.10.11.144 10.10.10.244

⚑ The Most Common IPMI Commands

Check the status of a node (verify if it is powered on or unreachable):

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power status

Remotely power down a node:

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power off

Remotely power up a node:

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power on

Remotely reboot a node:

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power reset

πŸ“œ Scripts

There is a resource script setbmc.sh to enable faster manual configuration (see HPC Ecosystems GitHub).

πŸŽ“ Learning Steps

Extract node configuration from Warewulf and store in input.local for future provisioning. Determine the correct ordering of nodes:

  • Use warewulf nodescan to quickly add nodes; check ordering carefully.
  • Use BMC commands to flash nodes to visually identify them.

πŸ“‹ Cheatsheet

⚑ IPMI Quick Commands

ipmitool -U chpc -P bmc123qwe -H 10.10.10.200 power status
ipmitool -U chpc -P bmc123qwe -H 10.10.10.203 power on
ipmitool -U chpc -P bmc123qwe -H 10.10.10.213 power off
ipmitool -U chpc -P bmc123qwe -H 10.10.10.202 sdr list
ipmitool -U chpc -P bmc123qwe -H 10.10.10.202 sdr type Temperature