High performance and technical computing is all about scale and speed. Many applications, such as drug discovery and genomics, financial services and image processing require access to a large and diverse set of computing resources on demand. With more and faster computing power, you can convert an idea into a discovery, a hypothesis into a cure or an inspiration into a product. Google Cloud provides the HPC community with on-demand access to large amounts of high-performance resources with Compute Engine. But a challenge remains: how do you harness these powerful resources to execute your HPC jobs quickly, and seamlessly augment an existing HPC cluster with Compute Engine capacity?
To help with this problem, we teamed up with SchedMD to release a preview of tools that makes it easier to launch the Slurm workload manager on Compute Engine, and to expand your existing cluster when you need extra resources. This integration was built by the experts at SchedMD in accordance with Slurm best practices. Slurm is a leading open-source HPC workload manager used often in the TOP500 supercomputers around the world.
Step 1: Grab the Cloud Deployment Manager scripts from SchedMD’s Github repository. Review the included
README.mdfor more information. You may want to customize the deployment manager scripts for your needs. Many cluster parameters can be configured in the included
At a minimum, you need to edit
slurm-cluster.yamlto paste in your munge_key and specify your GCP username in
default_usersand the Slurm version you want to use (e.g., 17.11.5).
Step 2: Run the following command from the Cloud Shell or your local terminal with gcloud command installed:
gcloud deployment-manager deployments create slurm --config slurm-cluster.yaml
Then, navigate to the Deployment Manager section of the developer console and observe that your deployment is successful.
Compute Engine section of the developer console, you’ll see that Deployment Manager created a number of VM instances for you, among them a Slurm login node. After the VMs are provisioned, Slurm will be installed and configured on the VMs. You can now SSH into the login node by clicking the SSH button in the console or by running
gcloud compute ssh login1 --zone=us-west1-a(Note: You may need to change the zone if you modified it in the
Once you’ve logged in, you can interact with Slurm and submit jobs as usual using sbatch. For example, copy the sample script below into a new file called
#!/bin/bash # #SBATCH --job-name=hostname_sleep_sample #SBATCH --output=out_%j.txt # #SBATCH --nodes=2 srun hostname sleep 60
Then, submit it with:
You can then observe the job being distributed to the compute nodes using the
squeuecommands. Notice how if the submitted job requires more resources than initially deployed, new instances will be automatically created, up to the maximum specified in
slurm-cluster.yaml. To try this, set
#SBATCH --nodes=4and resubmit the job. Once the ephemeral compute instances are idle for a period of time specified, they'll be deprovisioned.
Note that for your convenience the deployment manager scripts set up NFS as part of the deployment.
Check out the included README for more information and if you need help getting started with Slurm check out the quick start guide or contact SchedMD.