Skip to main content

May 2026 Maintenance - Sol Upgrades

· 3 min read
Research Computing Team
RC Docs Maintainers

This post summarizes the updates and improvements deployed during the May 2026 maintenance window on Sol. Sol received a cluster-wide firmware refresh, an updated kernel and driver stack, a Slurm upgrade with a new job submit plugin, and a refreshed GPU resource naming scheme to improve clarity for users.

GPU Resource Naming

The naming of A100 GPU resources on Sol has been updated to improve readability and make it easier to target a specific class of GPU. MIG (Multi-Instance GPU) functionality, performance, and availability are unchanged; only the resource names that users specify in job scripts have been updated.

New naming convention:

  • 20 GB A100 "MIG" slice → a100.20gb
  • 40 GB A100 GPU → a100.40gb
  • 80 GB A100 GPU → a100 (unchanged for compatibility)

To request a specific GPU type, update your job scripts and interactive commands accordingly:

#SBATCH -G a100.20gb:1
#SBATCH -G a100.40gb:1

Jobs that were pending in the queue specifically targeting the renamed resources will need to be canceled and resubmitted using the new names.

Firmware and Nvidia Driver Updates

Firmware was refreshed across the cluster during this maintenance window. GPU nodes, CPU nodes, and login nodes were all brought current.

As nodes rebooted into the new image, they picked up:

  • Latest kernel and security updates
  • A custom BeeGFS client patch enabling relatime (relative access timestamps), matching the change applied to Phoenix in the March 2026 maintenance. This significantly improves detection of active scratch files, reducing false positives when identifying data for removal.
  • NVIDIA GPU driver updated to 595.71.05 to support CUDA 13.2
  • ROCm and AMD GPU drivers updated to 7.2.3
  • Updated NVIDIA DOCA drivers for ConnectX devices

Slurm Updates

Slurm has been upgraded to 25.11.6 on Sol, bringing in upstream bug fixes and security updates. Alongside the version bump, a new job_submit.lua plugin was deployed with an accompanying test suite that exercises every condo QOS, ensuring job submission rules behave consistently across partitions. Partition sets have also been simplified to reduce confusion when selecting where to run jobs, and outdated class accounts have been removed.

GPU Benchmarking

A selected subset of the public A100_80 GPUs were benchmarked with the Nvidia HPC-Benchmarks Container, specifically the HPL benchmark.

YearGPUsRmax (PFLOPS)Per-GPU (TFLOPS)Rpeak (TFLOPS)Efficiency
20252402.62010.924,68056.0%
20262242.61711.684,36859.9%
Δ−16−0.1%+7.0%+3.9 pp

The 2026 software/driver stack delivers a measurable efficiency gain over the 2025 baseline.

Technical Updates

  • Slurm updated to 25.11.6 on Sol
  • BeeGFS servers updated from 7.4.6 to 7.4.7
  • BeeGFS relatime patch applied to Sol clients, improving detection of active scratch files
  • NVIDIA GPU driver updated to 595.71.05 (CUDA 13.2 support) across the cluster
  • ROCm and AMD GPU drivers updated to 7.2.3
  • NVIDIA DOCA drivers updated for ConnectX devices
  • Home directories now statically mounted on Sol, improving reliability (/data directories remain dynamically mounted on demand)
  • Open OnDemand updated on Sol
  • Mamba, Jupyter, and Jupyter AI updated to the latest versions
  • Obsolete Mamba environments and Jupyter kernels hidden from the OOD interface
  • General firmware, kernel, and security updates applied to all systems

March 2026 Maintenance - Phoenix Network and Storage Upgrades

· 4 min read
Research Computing Team
RC Docs Maintainers

This post summarizes the updates and improvements deployed during the March 2026 maintenance window. Phoenix received major enhancements to its Ethernet networking and bug fixes to the Scratch filesystem.

Network Upgrades

Phoenix's core Ethernet network has been upgraded from 10/25 Gb/s to 40/100 Gb/s, quadrupling the aggregate bandwidth between racks to up to 200 Gb/s. This provides significantly improved performance for data transfers and network communication.

Scratch Filesystem Improvements

Over the past few months, many users have noticed lag or delays while traversing scratch. This was caused by the client constantly trying to connect over the OmniPath network, even when nodes did not have OmniPath connectivity. This issue has been resolved by forcing the client to maintain connections over the existing network. Users should see improved performance when traversing and accessing scratch.

RDMA connections have also been enabled for scratch, which should further improve performance for users with RDMA-capable network interfaces. This was previously disabled due to a bug in the BeeGFS client disabling RDMA when IPv6 was disabled, which is the case on Phoenix.

We have also applied a custom patch to the BeeGFS client to properly update to relatime (relative access timestamps), which should resolve issues with files appearing to have incorrect timestamps when accessed from Phoenix. Relative access time will update a file's access time on any read or write operation if the previous access time is older than the modification time, or older than 24 hours. Testing has shown that this does not affect the performance of scratch and provides more accurate access times for files on scratch data, which is important for the accuracy of the scratch data retention policy. This patch will be applied to Sol at the next maintenance window.

Scratch Benchmarking

The IO500 benchmarks, a widely used benchmark for evaluating the performance of storage systems in high-performance computing environments, have been run on Phoenix's scratch filesystem to evaluate the performance improvements from the recent updates. The results show that Phoenix's scratch filesystem is capable of delivering high performance for a variety of workloads, including small file access, large file access, and metadata operations. The results even outperform Sol's scratch filesystem in some tests, which is impressive given that Sol's scratch is running on newer hardware.

Benchmark Results:

Test NameSol (March 2023)Phoenix (March 2026)
ior-easy-write12.49 GiB/s22.15 GiB/s
ior-hard-write1.03 GiB/s0.54 GiB/s
ior-easy-read17.15 GiB/s27.31 GiB/s
ior-hard-read1.69 GiB/s2.28 GiB/s
find1433.96 kIOPS1994.09 kIOPS
mdtest-easy-write86.99 kIOPS130.48 kIOPS
mdtest-hard-write8.63 kIOPS7.36 kIOPS
mdtest-easy-stat333.97 kIOPS961.19 kIOPS
mdtest-hard-stat86.93 kIOPS236.19 kIOPS
mdtest-easy-delete58.25 kIOPS129.70 kIOPS
mdtest-hard-delete10.21 kIOPS10.37 kIOPS
mdtest-hard-read11.38 kIOPS20.12 kIOPS
Overall Bandwidth4.40 GiB/s5.23 GiB/s
Overall IOPS61.76 kIOPS102.06 kIOPS
Overall Score16.4823.09

Nvidia GPU Updates

The NVIDIA GPU driver has been updated to 595.45.04 supporting CUDA 13.2. However, NVIDIA has removed support for the Tesla V100 and GTX 1080 Ti GPUs in their latest driver releases. As a result, the V100/GTX 1080 Ti GPUs on Phoenix are currently running on an older driver version (580.95.05) that supports CUDA 13.0.

Technical Updates

  • Slurm updated to 25.11.3
  • Web portals updated to 4.0.10 on Phoenix.
  • Mamba package manager updated to 2.5.0 on Phoenix.
  • Jupyter updated on Phoenix.
  • Obsolete VASP modules removed to clean up module list
  • General security updates applied to all systems
  • InfiniBand cards installed in pcg085-pcg088
  • Home directories are now statically mounted on Phoenix, improving reliability (/data directories are still dynamically mounted on demand)

January 2026 Maintenance - Introducing Phoenix Scratch

· 3 min read
Research Computing Team
RC Docs Maintainers

This post summarizes the updates and improvements deployed during the January 2026 maintenance window. Changes to Sol were minimal, while Phoenix received a major enhancement with the introduction of Phoenix Scratch, a new high performance scratch storage system. Together, these updates improve performance, security, and overall usability across the clusters.

Introducing Phoenix Scratch

Phoenix Scratch Photo

Two racks housing the storage and networking infrastructure for Phoenix Scratch in the Iron Mountain Data Center.

We are pleased to announce the availability of Phoenix Scratch, a new high performance parallel scratch filesystem designed to support data intensive workloads on Phoenix.

  • 3 PiB of shared storage space
  • Directly mountable on InfiniBand, Omni-Path, and Ethernet fabrics
  • High throughput and low latency for I/O intensive applications
  • Parallel file system support for seamless integration with existing workflows
  • NVMe backed metadata for improved responsiveness
  • Policy based quotas for flexible and fair storage management

As with Sol Scratch, Phoenix Scratch is intended for temporary storage only. It is well suited for active job data, checkpoints, and intermediate results, but it should not be used for long term data retention. The automatic 90 day data retention policy applies to both Phoenix and Sols. Users should ensure that important data is backed up to appropriate long term storage systems.

Behind the Build

Bringing Phoenix Scratch online required significant infrastructure work, including over 800 meters of cabling and the installation of 507 HDDs, 60 SSDs, and 12 NVMe drives across 19 servers. The system is composed of three primary components:

Metadata Servers (MDS): Six high performance servers dedicated to metadata operations, each equipped with NVMe drives. These systems were built using nodes donated by Cirrus Logic and customized to support NVMe storage with drives donated by Intel.

Object Storage Servers (OSS): Thirteen storage servers providing the bulk data capacity and throughput. These systems use a mix of HDDs and SSDs and repurpose hardware previously deployed as the Cholla Storage System.

Networking: Each storage node is connected via 100 Gb Omni-Path, 100 Gb InfiniBand, and dual 40 Gb Ethernet links, ensuring high bandwidth and low latency access regardless of interconnect.

We extend our sincere thanks to our partners at Intel and Cirrus Logic for their generous hardware contributions, which made Phoenix Scratch possible. We also thank our researchers for their patience while Phoenix continued to operate during the deployment and integration of this new filesystem.

Transferring Data to Phoenix Scratch

Globus is the recommended method for transferring data to and from Phoenix Scratch. We have created a Globus collection specifically for Phoenix Scratch. For more information, see our documentation on transferring data between supercomputers.

Technical Updates

Infrastructure and Firmware

  • Duo 2FA Enabled for password-based SSH. More information can be found in the Duo 2FA Documentation.
  • Slurm upgraded to 25.11.1 on Sol and Phoenix.
  • Web portals updated to 4.0.8 on Sol and Phoenix.
  • Swap enabled on Sol and Phx login nodes to improve stability during high memory usage.
  • InfiniBand fabrics separated for Sol and Phoenix to improve performance and stability.
  • OmniPath fabric managers updated to 12.0.1
  • Updated Horizon project storage with latest patch release
  • Upgraded Hypervisors to latest stable release.
  • Load-balancing infrastructure updated to improve both security and reliability..