Digital research alliance GPU: FAST

Introduction[edit | edit source]

Here you can find a general introduction to the system: click here
Technical supports can be sought here.

Apply for a CCDB account[edit | edit source]

**The instructions below only apply if you are registering for the first time! Even users with deactivated accounts do not need to apply for a new one. They can use the renewal link.**

Academic principal investigators (full-time or part-time faculty members at a Canadian university, college or research hospital that is eligible to hold CFI grants.) must register with the CCDB first and wait up to 2 business days for the approval. Once they receive an RI (Role Identifier), all of their group members can register in the CCDB.

Group members go through the same registration process. They have to indicate their sponsor's RI and there is no need to provide information about the research area. Following that, sponsors should approve the group members through the email they receive.

Registration with CCDB[edit | edit source]

Click here to go to the login page.
Click on the Register.
Read the policies and consents here.
After giving all the required consents, click on the Submit and then mark No.
Complete the form. (register with your institutional email address.)
You will receive a confirmation email. If not, contact: accounts@tech.alliancecan.ca
Click on the confirmation link sent to your email:
- As a PI, an administrator will process the account.
- As a group member, an email will be sent to the sponsor for confirmation.
After the approval, a notification email will be sent to let you know that your account is now activated.

What can you do at the portal?[edit | edit source]

Register
Manage personal information and roles
Apply for resource allocation competition
Manage RAP information and membership

Resource Allocation Projects (RAP)[edit | edit source]

National computational resources are accessible through Resource Allocation Projects (RAPs) identified by a RAPI and group name. Computing resources are allocated to groups of researchers, not individuals, and statistics on facility usage are reported using a name that does not have implicit meanings. RAPs represent the group of researchers to whom resources are allocated. There are two main types of RAPs:

Default RAP: When a PI role is activated, a default RAP is automatically created to manage quotas for storage and cloud resources. The default RAP enables PIs and sponsored users to opportunistically use compute resources at the lowest priority level. On CCDB, it uses the convention def-profname.
RAC RAP: An RAP is created for a PI when they receive an award through the RAC application process. The RAC RAPI typically takes the form abc-123-ab, with an associated group name typically of the form of rrg-profname-xx or rpp-profname-xx for HPC allocations, and crg-profname-xx or cpp-profname-xx for Cloud allocations, depending on the competition.

Group Names[edit | edit source]

It is an alias of the Research Allocation Project Identifier (RAPI). Each RAPI has a unique group name (one-to-one mapping), but it is often easier for users to remember the group name.

Typically, group names follow this convention (where “xx” represents some sequence of digits and letters):

Default RAP: def-[profname][-xx]
RRG/HPC resource RAP: rrg-[profname][-xx]
RPP/HPC resource RAP: rpp-[profname][-xx]
RRG/Cloud resource RAP: crg-[profname][-xx]
RPP/Cloud resource RAP: cpp-[profname][-xx]

The group name is used as a POSIX group name with an associated POSIX group ID and is propagated through LDAP in the dn attribute: dn: cn=rpp-profname,ou=Group,dc=computecanada,dc=ca

Available national systems and software[edit | edit source]

National systems can be found here.
Available software can be found here.

You can also use the commands before to look for available software:

$ module reset : reset the modules
$ module avail : list available software
$ module spider keyword : search for module by keyword
$ module list : list loaded modules
$ module load NAME : load a module

Connecting to clusters via SSH[edit | edit source]

For Windows free software MobaXterm is suggested.It combines:
- SSH client(to login systems)
- SFTP client (to copy files)
- Xwindow server (to run graphical applications)
Linux and Mac users can use command line tools ssh, scp, rsync etc.
- - For Mac only**: running graphical applications remotely requires free software XQuartz.

Accessing and managing files[edit | edit source]

/home: 50G, 0.5m files; backed up regularly
/project: 1T (extendable to 10TB) per group, 0.5m files; backed up.
/scratch: 20TB per user, 1m files, up to 100T; 2 months of life

Submitting a job[edit | edit source]

Jobs are submitted through SLURM and SLURM script

$ sbatch myjob.sh

# !/bin/bash

<specify resource requirements>
<load necessary modules>
<setup environment>
<run your program>

Useful SLURM commands:

$ squeue -u USERNAME : check running/pending processes
$ sq
$ sacct : show the past jobs
$ scancel JOBID: cancel a job
$ salloc <params> : For interactive quick jobs (24 hour limit)

AI and machine learning[edit | edit source]

This hyperlink leads to a webpage containing a tutorial on machine learning and detailed guidelines on how to set up your virtual environment.

Clusters used for machine learning applications require special care due to their differences from local machines. Clusters use a distributed filesystem and accessing files has different performance implications. AI practitioners should choose wisely where to put their data and follow good practices listed in the sections below:

Avoid Anaconda[edit | edit source]

It is suggested not using Anaconda, and instead try using virtualenv instead.

How to install software packages[edit | edit source]

Managing the dataset[edit | edit source]

Storage and file management[edit | edit source]

The clusters offer diverse storage options to accommodate the needs of different users, ranging from high-speed temporary local storage to long-term storage. Users can choose the storage medium that best suits their needs by referring to the documentation on Storage and file management.

Choosing the right storage type for your dataset[edit | edit source]

For datasets around 10 GB or less, they can fit in memory.
Avoid reading data from disk during machine learning tasks.
For datasets around 100 GB or less, transfer them to the local storage of the compute node at the beginning of the job.
Local storage is faster and more reliable than shared storage.
Temporary directory available for each job at $SLURM_TMPDIR.
For datasets larger than 100 GB, leave them in shared storage.
Project space can be used for permanent storage.
Scratch space is not for permanent storage.
Shared storage is for low-frequency storing and reading.

Datasets containing lots of small files (e.g. image datasets)[edit | edit source]

In machine learning, managing large collections of small files can be problematic due to filesystem quotas and slowed down software. It is recommended to store data in large single-file archives on a distributed filesystem. More information can be found in the documentation on Handling large collections of files.

Long running computations[edit | edit source]

It is recommended to use checkpointing for long computations to prevent losing all the work in case of an outage and to gain priority for short jobs. Most machine learning libraries support checkpointing, and a general checkpointing solution is available if needed.

Checkpointing with PyTorch
Checkpointing with TensorFlow

Large-scale machine learning (big data)[edit | edit source]

PyTorch and TensorFlow provide utilities to handle large-scale training natively, but scaling classic machine learning methods is not as widely discussed and can be a frustrating problem to solve. This is a guide that might help.

References[edit | edit source]

Technical documentation

Page data
Keywords	safety, laboratory
Authors	Kimia Ketabforoosh
License	CC-BY-SA-4.0
Organizations	FAST
Language	English (en)
Related	0 subpages, 3 pages link here
Impact	49 page views (more)
Created	May 5, 2023 by Kimia Ketabforoosh
Modified	January 29, 2024 by Felipe Schenone
Cite as	Kimia Ketabforoosh (2023–2024). "Digital research alliance GPU: FAST". Appropedia. Retrieved July 27, 2024.
API queries	basic, semantic, html, files, more