HPC Ecosystems OpenHPC 2.x System Administrator 101
Contents¶
- Adding Users to cluster (OpenHPC 2.x)
- Installing Software to compute nodes
- Summary
- Software installation steps
- Install system software for compute nodes
- Install Software Apps for Users
- PATH warnings
- Update application module files
- IPMI / BMC (remote management)
- Learning Steps
- Node Information & Cluster Configuration
- Cheatsheet
How To - HPC Ecosystems smshost cheatsheet.docx
Adding Users to cluster (OpenHPC 2.x)¶
Warewulf uses wwsh file * to control files to import. The current files can be viewed with
wwsh file list
Adding users is done on the smshost, and then propagated to compute nodes via Warewulf:
- Add users using the traditional
sudo useraddapproach. - Sync with: ```bash
wwsh file resync passwd shadow group``` To force propagation:- on compute nodes:
/warewulf/bin/wwgetfiles
Installing Software to compute nodes (OpenHPC 2.x)¶
Summary¶
While most of the provisioned image's configuration is conducted in a chroot filesystem, these chroots cannot be directly provisioned by Warewulf. Once we are satisfied with our chroot configuration, we must encapsulate and compress this filesystem into a Virtual Node File System (VNFS) image which Warewulf can provision. You can think of the chroot behaving as the source code, and the VNFS behaving as the compiled binary of that source.
[root@smshost ~]$ wwvnfs --chroot $CHROOT
Software installation steps¶
- Install software into the compute node root filesystem (chroot):
bash dnf install fail2ban --installroot $CHROOT - Rebuild the VNFS:
sudo wwvnfs --chroot $CHROOT - Reboot compute nodes.
- Verify scheduler is running.
Install system software for compute nodes (OpenHPC 2.x)¶
Assuming the directory structure on the smshost represents the root filesystem for the compute node (aka chroot), the default location is defined in input.local and will likely be: /opt/ohpc/admin/images/rocky8.6
export CHROOT=/opt/ohpc/admin/images/rocky8.6 sudo dnf -y --installroot $CHROOT install python37
the above command will install python3.7 to the root filesystem of the compute node image.
Install Software Apps for Users (OpenHPC 2.x)¶
Python 3 compiler: Download the source code. Extract, go into folder.
./configure --enable-optimizations --with-ensurepip=install --enable-shared --
prefix=/opt/ohpc/pub/compiler/python/${PYTHON_VERSION}
make -j$(nproc)
sudo make install
PATH warnings¶
We solve this path warning with module files!
Update application module files (OpenHPC 2.x)¶
Copy a template from $MODULEPATH
/opt/ohpc/pub/modulefiles/
IPMI / BMC (Remote Management)¶
To remotely control the hardware systems (i.e. reboot, power up, power down) we use IPMI to interface with the Baseboard Management Controller (BMC). The supplied systems will typically be configured with dual-role / shared interfaces, where a single ethernet port will serve the role of both production and baseboard management.
While the BMC is initially configured to share the same network range as the production network, they are two separate interfaces - BMC can be misconfigured and the production network (standard networking) will still work, and likewise BMC can be configured correctly and reachable while the compute hardware is faulty.
It is advisable to separate the BMC network from the production network. The HPC Ecosystems Project standard is to separate the networks, but delivered systems are sharing the same network and it is left as an exercise (and option) for the site administrators to define their BMC network as it fits their environment.
The nodes (some are not configured correctly and are left as an exercise for the site) are configured with a default username of chpc with password bmc123qwe.
IP Address Conventions.
The table below illustrates the standards adopted by HPC Ecosystems relating to production network, management network, and the delivered interim network (to be changed by site administrators).
|
Production Network 10.10.10.0/24 |
BMC / management Network (recommended) 10.10.11.0/24 |
BMC / management Network (on delivery) " |
|---|---|---|
| 10.10.10.1xx | 10.10.11.1xx | 10.10.10.2xx |
| Example: compute00 | ||
| 10.10.10.100 | 10.10.11.100 | 10.10.10.200 |
| Example: compute12 | ||
| 10.10.10.112 | 10.10.11.112 | 10.10.10.212 |
| Example: compute44 | ||
| 10.10.10.144 | 10.10.11.144 | 10.10.10.244 |
The most common IPMI commands
Check the status of a node (to verify if it is indeed powered on if it otherwise unreachable)
Remotely power down a node
ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power off
Remotely power up a node
ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power on
Remotely reboot a node
ipmitool -U chpc -P bmc123qwe -H 10.10.10.208 power reset
Scripts
There is a resource available setbmc.sh to enable faster manual configuration (see HPC Ecosystems
github).
Learning Steps¶
Node Information & Cluster Configuration¶
Extract the node configuration information from Warewulf and store these in the input.local file for future provisioning.
Determine the correct ordering of the nodes. - HINT: warewulf supports a nodescan option to quickly add nodes - it is possible that the nodes are not added in the correct order. - HINT: you can use BMC commands to make nodes flash to identify them
Cheatsheet¶
IPMI Quick Commands¶
ipmitool -U chpc -P bmc123qwe -H 10.10.10.200 power status ipmitool -U chpc -P bmc123qwe -H 10.10.10.203 power on ipmitool -U chpc -P bmc123qwe -H 10.10.10.213 power off ipmitool -U chpc -P bmc123qwe -H 10.10.10.202 sdr list ipmitool -U chpc -P bmc123qwe -H 10.10.10.202 sdr type Temperature