MaSIF on the HPC Cluster

MaSIF (Molecular Surface Interaction Fingerprints) predicts protein interaction patterns using geometric deep learning on molecular surfaces.

Application	Task	Output
MaSIF-site	Predict protein–protein interaction sites	Per-vertex interaction probability
MaSIF-ligand	Classify ligand binding pocket type	Ligand class (7 types)
MaSIF-search	Scan for structural binding partners	Ranked binding configurations

Reference: Gainza et al., Nature Methods 17, 184–192 (2020).

Getting Started

1. Load the module

module load masif/1.0

This sets MASIF_ROOT and MASIF_SIF, and provides the masif-exec, masif-shell, masif_data, and copy_data shell functions.

2. Copy the working files to scratch

The installed scripts and data under /packages/apps/masif/1.0/ are read-only. Use the copy_data alias to get a personal writable copy on scratch:

copy_data

This copies everything to /scratch/$USER/masif/cpu/ and changes into that directory. You only need to do this once. After that, run all jobs from your copy.

warning

The data directories contain only scripts, lists, and pre-trained model weights — not preprocessed surface data. Preprocessing writes to data_preparation/ inside each application directory. A full dataset (all proteins from the paper) requires ~400 GB. Plan accordingly.

3. Bind scratch into the container

Apptainer does not automatically bind /scratch. Set MASIF_BINDS before running any commands or submitting jobs — add this to your ~/.bashrc so it persists across sessions:

echo 'export MASIF_BINDS=/scratch/$USER/masif/cpu:/scratch/$USER/masif/cpu' >> ~/.bashrc
source ~/.bashrc

All Slurm scripts read MASIF_BINDS automatically — no changes needed in the scripts. If your data lives elsewhere, point MASIF_BINDS to that path instead.

4. (Optional) Browse the read-only install

masif_data          # cd to /packages/apps/masif/1.0/userdata

Interactive Use

Two shell functions are available for running commands inside the MaSIF container:

# Run a single command
masif-exec bash -c "cd $PWD && ./data_prepare_one.sh 4ZQK_A"

# Open an interactive shell
masif-shell

Both automatically bind MASIF_ROOT into the container. To bind additional paths (e.g. a project directory outside your scratch working directory):

export MASIF_BINDS=/scratch/myproject:/scratch/myproject
masif-exec bash -c "cd $PWD && ./data_prepare_one.sh 4ZQK_A"

MaSIF-site — PPI Site Prediction

Predicts which surface residues are likely to participate in protein–protein interactions.

Quick start — single protein

Interactive
Slurm Batch

cd /scratch/$USER/masif/cpu/masif_site

# Preprocess: download PDB, compute surface mesh + electrostatics + patches (~1–2 min)
masif-exec bash -c "cd $PWD && ./data_prepare_one.sh 4ZQK_A"

# Predict interaction sites
masif-exec bash -c "cd $PWD && ./predict_site.sh 4ZQK_A"

# Colour surface by predicted score (writes a .ply file)
masif-exec bash -c "cd $PWD && ./color_site.sh 4ZQK_A"

cd /scratch/$USER/masif/cpu/masif_site

# 1. Preprocess all proteins in lists/full_list.txt (CPU array job)
sbatch data_prepare.slurm

# 2. Train the neural network (GPU, ~40 h for full dataset)
sbatch masif_site_train.slurm

# 3. Evaluate on benchmark set (GPU)
sbatch masif_site_eval.slurm

# 4. Predict on a custom list (CPU array job)
sbatch predict_site.slurm

The pre-trained model weights are already included in nn_models/all_feat_3l/model_data/ — you can skip training and run predictions directly.

Output:

output/all_feat_3l/pred_data/pred_4ZQK_A.npy — per-vertex scores
output/all_feat_3l/pred_surfaces/4ZQK_A.ply — coloured surface mesh (open in PyMOL)

Using your own PDB file

masif-exec bash -c "cd $PWD && ./data_prepare_one.sh --file /path/to/protein.pdb 4ZQK_A"

Multi-chain input

For a single chain: 4ZQK_A For a complex where chains A and B interact: 1AKJ_A_B

Slurm resource reference

MaSIF-site Slurm script resources

Script	Resources	Input	Logs
`data_prepare.slurm`	CPU array, 2 cores, 16 GB, 3 h/task	`lists/full_list.txt`	`exelogs/data_prepare.<jobid>_<taskid>.{out,err}`
`masif_site_train.slurm`	1 GPU, 4 cores, 32 GB, 40 h	preprocessed patches	`exelogs/masif_site_train.<jobid>.{out,err}`
`masif_site_eval.slurm`	1 GPU, 4 cores, 32 GB, 40 h	trained model	`exelogs/masif_site_eval.<jobid>.{out,err}`
`predict_site.slurm`	CPU array, 2 cores, 16 GB, 3 h/task	`lists/full_list.txt`	`exelogs/predict_site.<jobid>_<taskid>.{out,err}`

tip

Adjust #SBATCH --partition= and #SBATCH --array= to match your protein list length and cluster partition names before submitting.

danger

The current container uses TensorFlow 1.12, which requires CUDA ≤10. A100/Ampere GPUs (CUDA 11+) are not compatible. A GPU-enabled container will be provided separately. Training and evaluation will fall back to CPU in the meantime, but will be significantly slower.

MaSIF-ligand — Ligand Pocket Classification

Classifies binding pockets into 7 ligand categories using 12 Å geodesic patches.

cd /scratch/$USER/masif/cpu/masif_ligand

# 1. Preprocess proteins (CPU array job)
sbatch data_prepare.slurm

# 2. Generate TFRecords for training
sbatch make_tfrecord.slurm

# 3. Train the classifier (GPU)
sbatch train_model.slurm

# 4. Evaluate on test set (GPU)
sbatch evaluate_test.slurm

Protein lists are numpy arrays in lists/:

train_pdbs_sequence.npy
val_pdbs_sequence.npy
test_pdbs_sequence.npy

Output in test_set_predictions/:

<PDB>_<chains>_labels.npy — ground truth labels
<PDB>_<chains>_logits.npy — predicted logits

MaSIF-ligand Slurm script resources

Script	Resources
`data_prepare.slurm`	CPU array, 1 core, 16 GB, 2 h/task
`make_tfrecord.slurm`	CPU, 1 core, 8 GB, 48 h
`train_model.slurm`	1 GPU, 1 core, 16 GB, 48 h
`evaluate_test.slurm`	1 GPU, 1 core, 16 GB, 24 h

MaSIF-search — PPI Surface Scanning

Scans a database of protein surfaces for structural binding partners of a query patch.

cd /scratch/$USER/masif/cpu/masif_ppi_search

# 1. Preprocess proteins (CPU array job)
sbatch data_prepare.slurm

# 2. Cache training patch pairs (shape-complementarity filtered)
masif-exec bash -c "cd $PWD && ./cache_nn.sh nn_models.sc05.custom_params"

# 3. Train the descriptor network (GPU)
sbatch masif_ppi_search_train.slurm

# 4. Compute descriptors for search (GPU)
sbatch masif_ppi_search_comp_desc.slurm

# 5. Compute GIF descriptors (optional, for GIF-based search)
masif-exec bash -c "cd $PWD && ./compute_gif_descriptors.sh"

MaSIF-search Slurm script resources

Script	Resources
`data_prepare.slurm`	CPU array, 1 core, 8 GB, 1 h/task
`masif_ppi_search_train.slurm`	1 GPU, 1 core, 32 GB, 40 h
`masif_ppi_search_comp_desc.slurm`	1 GPU, 1 core, 32 GB, 20 h

Unbound benchmark variant

Scripts for the unbound docking benchmark are in /scratch/$USER/masif/cpu/masif_ppi_search_ub/:

cd /scratch/$USER/masif/cpu/masif_ppi_search_ub
sbatch data_prepare.slurm                   # processes lists/benchmark_list_ub.txt
sbatch masif_ppi_search_comp_desc.slurm     # compute descriptors

MaSIF-pdl1 Benchmark

Reproduces the PDL1 benchmark from the paper.

cd /scratch/$USER/masif/cpu/masif_pdl1_benchmark
sbatch data_prepare.slurm                         # CPU array, lists/full_list.txt
masif-exec bash -c "cd $PWD && ./run_benchmark_nn.sh"

MaSIF-peptides

Evaluates MaSIF-site and MaSIF-search on peptide–protein interactions.

cd /scratch/$USER/masif/cpu/masif_peptides

# Extract helix data (CPU array, lists/bc-100-list.txt)
sbatch data_extract_helix.slurm

# Precompute patches (CPU array, lists/all_peptides.txt)
sbatch data_precompute_patches.slurm

# Evaluate (CPU array, reads from in/x<task_id> split files)
sbatch masif_site_masif_search_eval.slurm

Adjusting Slurm Scripts

Every script has two lines to review before submitting:

#SBATCH --partition=short    # change to your cluster's CPU partition name
#SBATCH --partition=gpu      # change to your cluster's GPU partition name
#SBATCH --array=1-1000       # adjust upper bound to match your list length

The SIF image path defaults to /packages/apps/simg/masif.sif. Override per-job:

MASIF_SIF=/other/path/masif.sif sbatch data_prepare.slurm

Or export for a batch of submissions:

export MASIF_SIF=/other/path/masif.sif
sbatch data_prepare.slurm
sbatch masif_site_train.slurm

Extra bind mounts for data outside /scratch/$USER/masif/cpu:

export MASIF_BINDS=/scratch/myproject/data:/scratch/myproject/data
sbatch data_prepare.slurm

Visualising Results in PyMOL

.ply surface files are produced by color_site.sh. To view them:

Install the MaSIF PyMOL plugin on your local machine: see /packages/apps/masif/1.0/pymol_plugin_installation.md
Copy the .ply file from the cluster to your machine
In PyMOL:
```
loadply 4ZQK_A.ply
```
Hide all objects except those containing iface to show the predicted interaction site coloured by score.

Checking Job Status

squeue -u $USER                                              # running/pending jobs
sacct -j <jobid> --format=JobID,State,ExitCode,Elapsed      # completed job summary
cat exelogs/data_prepare.<jobid>_1.out                      # inspect one task's log

Troubleshooting

MASIF_ROOT not set / python can't find source files

Make sure you loaded the module before submitting: module load masif. The slurm scripts fall back to /packages/apps/masif/1.0 if the variable is unset, but it's safer to have the module loaded.

Permission denied writing output files

You are running from the read-only install at /packages/apps/masif/1.0/. Run copy_data and submit jobs from /scratch/$USER/masif/cpu/ instead.

fatal: not a git repository

This warning can appear if a script's git rev-parse fallback fires. It is harmless — the scripts use MASIF_ROOT when set, which takes precedence over git.

Container can't find my data files

Files outside MASIF_ROOT and your working directory are not visible inside the container. Set MASIF_BINDS to bind your data location, or use --bind directly with apptainer exec.

GPU jobs run on CPU / TensorFlow doesn't see the GPU

TF 1.12 requires CUDA ≤10. Cluster GPUs with CUDA 11+ (e.g. A100) are not supported by this container. A GPU-compatible container will be provided separately.

Known Limitations

TensorFlow 1.12 only — no TF2 / eager mode. CUDA ≤10 required for GPU acceleration.
Preprocessing takes ~1–2 min/protein (bottleneck: MDS geodesic coordinates and APBS electrostatics). Use array jobs for large datasets.
Results may differ slightly from the published paper because the MATLAB preprocessing pipeline has been replaced with Python equivalents. To reproduce exact paper results, see masif_paper on GitHub.
DSSP is not included in the container and is not required for any of the three main applications.

Getting Started​

1. Load the module​

2. Copy the working files to scratch​

3. Bind scratch into the container​

4. (Optional) Browse the read-only install​

Interactive Use​

MaSIF-site — PPI Site Prediction​

Quick start — single protein​

Using your own PDB file​

Multi-chain input​

Slurm resource reference​

MaSIF-ligand — Ligand Pocket Classification​

MaSIF-search — PPI Surface Scanning​

Unbound benchmark variant​

MaSIF-pdl1 Benchmark​

MaSIF-peptides​

Adjusting Slurm Scripts​

Visualising Results in PyMOL​

Checking Job Status​

Troubleshooting​

Known Limitations​