Skip to content

HPC Ecosystems OpenHPC 2.x System Administrator 101

Contents

How To - HPC Ecosystems smshost cheatsheet.docx

Adding Users to cluster (OpenHPC 2.x)

Warewulf uses wwsh file * to control files to import. The current files can be viewed with wwsh file list

[root@smshost ~]# wwsh file list dynamic_hosts : rw-r--r-- 0 root root 3800 /etc/hosts group : rw-r--r-- 1 root root 1086 /etc/group hosts : rw-r--r-- 1 root root 3800 /etc/hosts ifcfg-ib0.ww : rw-r--r-- 1 root root 280 /etc/sysconfig/network-scripts/ifcfg-ib0 munge.key : r-------- 1 munge munge 1024 /etc/munge/munge.key network : rw-r--r-- 1 root root 16 /etc/sysconfig/network passwd : rw-r--r-- 1 root root 2829 /etc/passwd shadow : rw-r----- 1 root root 1556 /etc/shadow

Adding users is done on the smshost, and then propagated to compute nodes via Warewulf:

  • Add users using the traditional sudo useradd approach.
  • Sync with: ```bash
  • wwsh file resync passwd shadow group ``` To force propagation:
  • on compute nodes: /warewulf/bin/wwgetfiles

Installing Software to compute nodes (OpenHPC 2.x)

Summary

While most of the provisioned image's configuration is conducted in a chroot filesystem, these chroots cannot be directly provisioned by Warewulf. Once we are satisfied with our chroot configuration, we must encapsulate and compress this filesystem into a Virtual Node File System (VNFS) image which Warewulf can provision. You can think of the chroot behaving as the source code, and the VNFS behaving as the compiled binary of that source.

[root@smshost ~]$ wwvnfs --chroot $CHROOT

Software installation steps

  1. Install software into the compute node root filesystem (chroot): bash dnf install fail2ban --installroot $CHROOT
  2. Rebuild the VNFS:
    sudo wwvnfs --chroot $CHROOT
    
  3. Reboot compute nodes.
  4. Verify scheduler is running.

Install system software for compute nodes (OpenHPC 2.x)

Assuming the directory structure on the smshost represents the root filesystem for the compute node (aka chroot), the default location is defined in input.local and will likely be: /opt/ohpc/admin/images/rocky8.6

export CHROOT=/opt/ohpc/admin/images/rocky8.6
sudo dnf -y --installroot $CHROOT install python37

the above command will install python3.7 to the root filesystem of the compute node image.

Install Software Apps for Users (OpenHPC 2.x)

Python 3 compiler: Download the source code. Extract, go into folder.

./configure  --enable-optimizations --with-ensurepip=install --enable-shared --
prefix=/opt/ohpc/pub/compiler/python/${PYTHON_VERSION}

make -j$(nproc)
sudo make install

PATH warnings

WARNING: The scripts pip3 and pip3.10 are installed in '/opt/ohpc/pub/compiler/python/3.10.12/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

We solve this path warning with module files!

Update application module files (OpenHPC 2.x)

Copy a template from $MODULEPATH

/opt/ohpc/pub/modulefiles/

#%Module1.0############################################################# ######## proc ModulesHelp { } { puts stderr " " puts stderr "This module loads Python3.10.12" puts stderr " " puts stderr "See the man pages for Python3 for detailed information" puts stderr "on available compiler options and command-line syntax." puts stderr " " puts stderr "\nVersion 3.10.12\n" } module-whatis "Name: Python3" module-whatis "Version: 3.10.12" module-whatis "Category: compiler, runtime support" module-whatis "Description: Python3" module-whatis "URL: https://www.python.org/downloads/release/python-31012/ " set version 3.10.12 prepend-path PATH /opt/ohpc/pub/compiler/python/3.10.12/bin prepend-path MANPATH /opt/ohpc/pub/compiler/python/3.10.12/share/man prepend-path INCLUDE /opt/ohpc/pub/compiler/python/3.10.12/include prepend-path LD_LIBRARY_PATH /opt/ohpc/pub/compiler/python/3.10.12/lib #family "compiler"

IPMI / BMC (Remote Management)

To remotely control the hardware systems (i.e. reboot, power up, power down) we use IPMI to interface with the Baseboard Management Controller (BMC). The supplied systems will typically be configured with dual-role / shared interfaces, where a single ethernet port will serve the role of both production and baseboard management.

While the BMC is initially configured to share the same network range as the production network, they are two separate interfaces - BMC can be misconfigured and the production network (standard networking) will still work, and likewise BMC can be configured correctly and reachable while the compute hardware is faulty.

It is advisable to separate the BMC network from the production network. The HPC Ecosystems Project standard is to separate the networks, but delivered systems are sharing the same network and it is left as an exercise (and option) for the site administrators to define their BMC network as it fits their environment.

The nodes (some are not configured correctly and are left as an exercise for the site) are configured with a default username of chpc with password bmc123qwe.

IP Address Conventions.

The table below illustrates the standards adopted by HPC Ecosystems relating to production network, management network, and the delivered interim network (to be changed by site administrators).

Production Network
10.10.10.0/24
BMC / management Network
(recommended)
10.10.11.0/24
BMC / management Network
(on delivery)
"
10.10.10.1xx 10.10.11.1xx 10.10.10.2xx
Example: compute00
10.10.10.100 10.10.11.100 10.10.10.200
Example: compute12
10.10.10.112 10.10.11.112 10.10.10.212
Example: compute44
10.10.10.144 10.10.11.144 10.10.10.244

The most common IPMI commands

Check the status of a node (to verify if it is indeed powered on if it otherwise unreachable)

Remotely power down a node

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power off

Remotely power up a node

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power on

Remotely reboot a node

ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power reset

Scripts

There is a resource available setbmc.sh to enable faster manual configuration (see HPC Ecosystems github).

Learning Steps

Node Information & Cluster Configuration

Extract the node configuration information from Warewulf and store these in the input.local file for future provisioning.

Determine the correct ordering of the nodes. - HINT: warewulf supports a nodescan option to quickly add nodes - it is possible that the nodes are not added in the correct order. - HINT: you can use BMC commands to make nodes flash to identify them

Cheatsheet

IPMI Quick Commands

ipmitool -U chpc -P bmc123qwe -H 10.10.10.200 power status
ipmitool -U chpc -P bmc123qwe -H 10.10.10.203 power on
ipmitool -U chpc -P bmc123qwe -H 10.10.10.213 power off
ipmitool -U chpc -P bmc123qwe -H 10.10.10.202 sdr list
ipmitool -U chpc -P bmc123qwe -H 10.10.10.202 sdr type Temperature