Dec 2024 Phoenix Maintenance Changelog
This document provides a detailed overview of the updates and improvements made during the recent maintenance period. These changes are aimed at enhancing the system's reliability, performance, and user experience. Below is a breakdown of each major improvement.
Notable Changes
Phoenix System Software and Security Upgrades
The cluster's operating system was upgraded from Rocky Linux 8.9 to 8.10. This upgrade includes improved security features, performance optimizations, and support for newer libraries and tools. These updates enhance system stability and compatibility with modern software requirements.
Phoenix Slurm Scheduler Upgrade
The Slurm Scheduler was upgraded from version 23.11.5
to 24.5.0
. This update brings better job scheduling algorithms, improved resource management, and compatibility with newer Slurm features. The new version also resolves several bugs, enhancing the overall user experience.
Phoenix Mamba and Jupyter Environment Updates
The Mamba package manager was updated from version 1.5.1
to 1.5.9
, alongside updates to the Jupyter environments. These updates improve compatibility with newer Python libraries and address performance and stability issues.
If you need to use the older Mamba environment, you can load it with:
module load mamba/.1.5.1
Instead of:
module load mamba/latest
High-Availability Networking Repairs
Critical repairs were completed on the high-availability networking infrastructure to address reliability issues. These changes ensure a more robust and fault-tolerant network, reducing the risk of disruptions and improving overall connectivity for compute nodes and services.
Improved Zsh Compatibility
Updates were made to improve the compatibility of the Zsh shell:
- Bash functions were migrated to standalone bash scripts, ensuring they work as expected regardless of the shell being used.
Phoenix Rebuild of OpenMPI for Broader Application Support
OpenMPI was rebuilt to expand compatibility and resolve prior issues:
- Previously, OpenMPI was linked against compilers optimized for AVX512 instructions, causing silent failures on nodes lacking AVX512 support.
- The new version
4.1.7
is available via:
module load openmpi/4.1.7
- The older version remains accessible via:
module load openmpi/4.1.5
Users are encouraged to try the new module, as it will become the default in the future. However, the older module will remain available for now.
Other Notable Changes
- The
thisjob
script has been enhanced to automatically check$SLURM_JOB_ID
if no job ID is provided. - Added bash-completion support for interactive and other slurm commands.
- Automated node health checks have been revised.
MPI Performance Metric
We used the OSU Micro-Benchmarks (OMB) v7.5 from Ohio State University to check the status of nodes before and after maintenance. These tests measure bandwidth and latency on randomly selected, unique node pairs across all nodes, using all six MPI modules on Phx.
This workflow runs a large number of test jobs to verify the health of:
- Individual nodes
- MPI modules
- Mamba module
- Slurm scheduler
The goal is to ensure the system functions properly before the cluster is released at the end of maintenance.
Additional Help
If you would like any additional information about these changes, or find these changes are negatively impacting your work, please feel free to reach out.
If you require further assistance, contact the Research Computing Team:
- Visit the RTO Request Help page to create a support ticket.
- Use the
#rc-support
Slack Channel for quick inquiries. - Attend our office hours for live assistance.
We also offer a series of Educational Opportunities and Workshops.