Skip to main content

Dec 2024 Phoenix Maintenance Changelog

· 3 min read
Research Computing Team
RC Docs Maintainers

This document provides a detailed overview of the updates and improvements made during the recent maintenance period. These changes are aimed at enhancing the system's reliability, performance, and user experience. Below is a breakdown of each major improvement.

Notable Changes

Phoenix System Software and Security Upgrades

The cluster's operating system was upgraded from Rocky Linux 8.9 to 8.10. This upgrade includes improved security features, performance optimizations, and support for newer libraries and tools. These updates enhance system stability and compatibility with modern software requirements.

Phoenix Slurm Scheduler Upgrade

The Slurm Scheduler was upgraded from version 23.11.5 to 24.5.0. This update brings better job scheduling algorithms, improved resource management, and compatibility with newer Slurm features. The new version also resolves several bugs, enhancing the overall user experience.

Phoenix Mamba and Jupyter Environment Updates

The Mamba package manager was updated from version 1.5.1 to 1.5.9, alongside updates to the Jupyter environments. These updates improve compatibility with newer Python libraries and address performance and stability issues.

If you need to use the older Mamba environment, you can load it with:

module load mamba/.1.5.1

Instead of:

module load mamba/latest

High-Availability Networking Repairs

Critical repairs were completed on the high-availability networking infrastructure to address reliability issues. These changes ensure a more robust and fault-tolerant network, reducing the risk of disruptions and improving overall connectivity for compute nodes and services.

Improved Zsh Compatibility

Updates were made to improve the compatibility of the Zsh shell:

  • Bash functions were migrated to standalone bash scripts, ensuring they work as expected regardless of the shell being used.

Phoenix Rebuild of OpenMPI for Broader Application Support

OpenMPI was rebuilt to expand compatibility and resolve prior issues:

  • Previously, OpenMPI was linked against compilers optimized for AVX512 instructions, causing silent failures on nodes lacking AVX512 support.
  • The new version 4.1.7 is available via:
module load openmpi/4.1.7
  • The older version remains accessible via:
module load openmpi/4.1.5

Users are encouraged to try the new module, as it will become the default in the future. However, the older module will remain available for now.

Other Notable Changes

  • The thisjob script has been enhanced to automatically check $SLURM_JOB_ID if no job ID is provided.
  • Added bash-completion support for interactive and other slurm commands.
  • Automated node health checks have been revised.

MPI Performance Metric

We used the OSU Micro-Benchmarks (OMB) v7.5 from Ohio State University to check the status of nodes before and after maintenance. These tests measure bandwidth and latency on randomly selected, unique node pairs across all nodes, using all six MPI modules on Phx.

This workflow runs a large number of test jobs to verify the health of:

  • Individual nodes
  • MPI modules
  • Mamba module
  • Slurm scheduler

The goal is to ensure the system functions properly before the cluster is released at the end of maintenance.

Additional Help

If you would like any additional information about these changes, or find these changes are negatively impacting your work, please feel free to reach out.

If you require further assistance, contact the Research Computing Team:

  • Visit the RTO Request Help page to create a support ticket.
  • Use the #rc-support Slack Channel for quick inquiries.
  • Attend our office hours for live assistance.

We also offer a series of Educational Opportunities and Workshops.

September 2024 Phoenix Maintenance Changelog

· 3 min read
Research Computing Team
RC Docs Maintainers

This document outlines the latest updates and improvements deployed during the April 2025 maintenance. These enhancements are designed to improve system performance, security, and usability across the Phoenix cluster.

Notable Changes

Phoenix System and Security Updates

  • Applied latest Rocky Linux OS security updates to ensure continued protection against known vulnerabilities.
  • The Slurm workload manager was upgraded from 24.05.1 to 24.05.3, addressing critical security patches and improving scheduler stability.
  • Per-node energy tracking has been enabled, allowing for greater insight into system power usage.

Phoenix Resource and Portal Enhancements

  • Added 16 additional GPU MIG instances, increasing availability for GPU shard-based workloads.
  • The web portal was upgraded from version 3.0.3 to 3.1.7.

Jupyter and Environment Manager Updates

  • Jupyter Lab updated to the latest stable version, offering improved performance and UI features.
  • Mamba environment manager upgraded to 1.5.10 for better compatibility with modern Python packages.

Improved Job Submission Experience

The job_submit plugin was modernized to improve feedback for interactive job submissions. Jobs submitted with missing arguments will now output helpful default value messages. For example:

$ salloc -t 240
salloc: QOS not specified; assigning "public" qos
salloc: cpus-per-task not specified; assigning 1 core
salloc: time_limit <= 240 and Partition not specified; assigning "htc" partition
salloc: Pending job allocation 19824107
salloc: job 19824107 queued and waiting for resources

$ salloc -p general -t 240 -q public -c 1
salloc: Pending job allocation 19824117
salloc: job 19824117 queued and waiting for resources

Technical Updates

Infrastructure and Firmware

  • Warewulf updated to 4.5.8-1 to enhance node provisioning and cluster management.
  • Grace Hopper firmware upgraded to version 3.17.0.
  • Dell PowerStore firmware updated from 2.1.1.1 to 3.6.1.3.
  • Firewall firmware received critical updates to ensure network security.
  • An arbiter process was added to the soldtn node for improved coordination and fault tolerance.

Jupyter and Python Tooling

  • The Jupyter Notebook environment has been updated to the most recent stable version.
  • Mamba now at version 1.5.10, improving environment creation speed and dependency resolution.

MPI Performance Metric

We utilized the OSU Micro-Benchmarks (OMB) v7.4 from Ohio State University to validate the health of the Phoenix system before and after maintenance. These tests assess bandwidth and latency across randomly selected node pairs using all eight MPI modules on Sol.

This ensures:

  • Proper node performance
  • MPI module functionality
  • Mamba module integrity
  • Slurm scheduler behavior

A large number of test jobs were submitted to verify the overall system health.

To view performance comparisons from before and after this maintenance, visit: OMB tests - Google Drive

Additional Help

If you need assistance or notice any issues following these changes, please contact the Research Computing Team:

  • Submit a ticket via the RTO Request Help page
  • Join the #rc-support Slack channel for quick questions
  • Attend office hours for real-time support

For more information on our Educational Opportunities and Workshops, please visit our events page.