Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Kaen-Coon AI Cluster by College of Computing, Khon Kaen University

Kaen-Coon AI Cluster is a high-performance computing platform (HPC), powered by Slurm and LiCO software, providing excellent resources for large-scale research and artificial intelligence workload. The documentation purpose is to serve a tutorial and a brief guide on using Kaen-Coon AI Cluster for both users and administrators.

Registration can be found here.

Eligibility

  • College of Computing student: available only with project proposal approved by the advisor at no costs.
  • College of Computing researchers and lecturers: available with no costs.
  • Students within Khon Kaen University: available with project proposal approved by the advisor and hourly fees.
  • People outside Khon Kaen University: Not currently available. To be announced.

Contribution

Please refer to this page

Glossary

CUDA
Compute Unified Device Architecture. A computing platform specialized on parallel computation, which existed in NVIDIA GPU. Must interact via CUDA driver and CUDA framework in C++ (which includes Python).
H100
One of NVIDIA’s industrial GPU (Blackwell generation) lineups.1
MIG
A partitioning system for NVIDIA GPU, allowing multiple instances running simultaneously on one GPU with on-premise performance. 2
Slurm
A workload management and jobs scheduling system for multiple computing nodes.
LiCO
Web-based software developed by Lenovo. It provides Slurm interaction points via web interface, unlike usual Slurm interaction via command lines.

Slurm Glossaries

Job
A unit of workload for Slurm to process; usually expressed in a shell script.
Node
One single unit of computer within a cluster.
Cluster
A group of computing nodes.
Container
Operation-system-level virtualization with isolation from others processes. In Slurm, it is another way of getting any packages within the job environment.
Partition
A grouping of nodes. Each partition may have different specifications.
Queue
A waiting line for jobs. Some queues may have priorities (preemptive jobs) or access to specific partition of the cluster.
Wall Time
The maximum period of time allowed for a job to run. If the job is still running when running time exceeds the wall time, Slurm will stop the job immediately with SIGTERM.

Additional Slurm terminologies/concepts can be found here.


  1. https://www.nvidia.com/en-us/data-center/h100/

  2. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

System Architecture

  • lico: login node
  • cn1: computing node
    • 2x AMD EPYC 9224: 24 Core 2.5 GHz
    • 16x 64 GB memory
    • 2x 960GB M.2 NVMe SSD + 1x 15.36TB NVMe SSD
    • 4x NVIDIA H100 80GB HBM3
      • 1 Full 80GB
      • 3 (7x 10GB MIGs)
  • cn2: computing node
    • 2x AMD EPYC 9224: 24 Core 2.5 GHz
    • 16x 64 GB memory
    • 2x 960GB M.2 NVMe SSD + 1x 15.36TB NVMe SSD
    • 4x NVIDIA H100 80GB HBM3
      • 4 Full 80GB

Cluster nodes diagram

For Users

To interact with the cluster, go to https://10.198.253.15:8000. All operations will happen on the website via web interface.

Note

Currently, the cluster portal is operated with a self-signed certificate. The browser might see the website as a potential threat. However, it is safe to proceed.

Note

Make sure that you already obtained username/password for the cluster. Registration can be found here.

Before running a program, we must upload scripts and data to the cluster first. When we start running programs, we have something to run and calculate.

Then, we can creating/submitting a job completing desired workload.

However, when the job needs specific environments, we can create containers and operate under containers allowing the script to use custom programs.

Files Operations

LiCO allows access to the tree view of file directories, making it easy for file operations on web interface.

  1. Click console icon button on the upper right navigation bar.
  2. If the file tree does not appear, click expanding tab on the lower left side.
  3. The file tree should appear on the left side.

Creating/Uploading files

Click on create directory and file icon to either create a directory or a file on that path.

We can drag and drop files/directories from your computer to the file tree window as well. The file(s) will automatically transfer from your computer to the cluster.

Editing files

  1. Right-click on the file inside the file tree.
  2. Select “Edit”.
  3. The web will show a buffer window. Edit it.
  4. Click “Save” to save the change.

Tip

It is more convenient to edit the file(s) in your computer with your desired IDE and later overwrite the file(s) on the cluster by uploading.

Deleting files

  1. Select the file(s) to be deleted from the file tree.
  2. Click delete icon button.
  3. Confirm deletion.

Advanced

Since the cluster is based on Slurm, files can be transferred directly via scp or rsync programs as another remote server.

scp -r your/project/directory your_username@10.198.253.15:~/your/working/directory
rsync -avPh your/project/directory your_username@10.198.253.15:~/your/working/directory

Interactive files editing on the cluster can also be done with ssh with given credentials.

Submitting a Job

Create a Job

Before creating a job, you may want to transfer some script(s) and dataset to the cluster.

  1. Go to the menu item “Job Templates” on the left side.
  2. Select a suitable template for your job (Common Job). Here are some details for frequently-used templates:
    • Common Job: cover most of the need of running bare scripts.
    • Singularity: be able to run jobs with custom containers.
    • PyTorch: expose PyTorch library for the job to use.
  3. Fill up the job detail:
    • Template Information:
      • Job Name: the job name. Must be alphanumeric including underscore.
      • Workspace: the directory that the job will start. Log files will be also generated here.
    • Resource Options:
      • Queue: choose your allowed queue.
      • Exclusive: whether to disallow other jobs running on the same node. Keep the box empty.
      • Node: the number of computers to run the job; usually 1.
      • CPU Cores Per Node: the number of CPU cores per computer to run. Must be under the limit.
      • GPU per Node: either 0 or 1.
        • GPU Per Node: the number of GPU per computer to run; usually 1.
        • GPU Resource Type: gpu:STU_GPU if you are a student (MIG). gpu:Full_GPU if you are lecturer/researcher.
      • Memory Used: the memory per job in MB. Must be under the limit.
      • Wall Time: the maximum duration the job allowed to run. Must be under the limit.
  4. Click “Submit” to create the job and submit it to the cluster.

Monitor a Job

After submitting a job, the job will go through a queue and being run by a node.

Most of monitoring will be in the page of the menu item “Job Monitoring” on the left side. After entering, there will be 3 sections.

Queue type

  • Running: jobs currently being run by nodes will be shown here.
  • Waiting: jobs being held in a queue.
  • Completed: history of jobs already run.

Tip

To inspect why your job is currently queueing, go to “Waiting” tab, click on the waiting job’s scheduler ID, and inspect on the value of Reason variable. In normal circumstances, the value should be normal.

Assuming the job is out of queue and running, you can monitor it by:

  1. Click “Running” tab. There will be your jobs that are running (if already passed the queue).
  2. Click on the job name to see further details of the running job. There will be tabs on each aspect of the detail.
    • Log: shows all stdout from the running job. Useful for tracking the current state of the job.
    • Resource: all resource used for running the job.

After the job has finished running:

  1. Click “Completed” tab. You will see the history of completed jobs
  2. Click on the job name to see further details of that job.
  3. Click “Log” tab to show completed stdout from that job.

Copy a Job

In case you want to rerun the job or modify a few job details, you can quickly copy the detail reference from the previous job, so you do not have to recreate a new job from scratch again.

  1. Go to either “Running”, “Waiting”, or “Completed” tab; the one that has the job.
  2. Click “Action” button on the right side of the job you intend to copy.
  3. Click “Copy”, this will recreate another job with filled details.
  4. Edit the detail, and then click “Submit” to send a new job.

Creating a Container

To be added.

Limitations

Here are current resource limitations for each user roles.

ResourceStudentResearcher/Lecturer
CPU/Job3 cores16 cores
Memory/JobTBD36 GiB
Wall Time3 Hours5 Days
GPU/Job1 MIG1 Full Card

Important

Make sure that your job’s resource requests do not exceed the given limitation. The job might be stuck forever in Queueing if the requested resource exceed the limitation of your quotas.

Frequent Problems

Why does my submitted job appear to stuck in the queue forever?

One way to diagnose the problem is to click on the Scheduler ID and look at the scheduler’s description. If the Reason variable equals to:

  • QOSMaxWallDurationPerJobLimit: Your job requested wall time for too long and will stick in the queue forever. Please cancel the current job, copy the job, and adjust the wall time to be under your role’s limitations.

Why my job cannot use existing GPU?

Especially for students groups, only MIG GPUs are allowed. By default, your script is not usually specified to use MIG GPUs for computation, but full GPUs. As the job is only allocated for MIG GPUs, requesting a full GPU inside might be problematic. To make your job use MIG GPUs, some environmental declarations are needed. In this case, we specify “CUDA devices” to be a MIG GPU via declaring CUDA_VISIBLE_DEVICES environment variable as a MIG GPU ID. You can copy this line of the script before the execution line of the program.

export CUDA_VISIBLE_DEVICES=$(
    nvidia-smi -L | \
        grep -Eo 'MIG-[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}' | \
        head -n 1
)

The script basically lists all available GPUs with UUID. Then, we grab the first MIG GPU ID (the one Slurm allocates for us), and declaring it the CUDA device to be used.

For Administrator

To be added.

Reference

  • AI Cluster Project Report by PTOA: PDF
  • Cluster Manual for Researcher, by PTOA: Thai PDF
  • Cluster Manual for Students, by PTOA: Thai PDF

Contribute

The whole documentation is written in markdown files at https://github.com/anonymaew/kaen-coon-doc.

If you find a page missing some information and need some edits, you can edit the page by clicking the pencil icon in the top right corner. This will fork the repo, allowing you to edit the documentation. Then, please submit a pull request as well.

You are also welcomed to leave comments and suggestions without writing the documentation by creating new issue.

Workflow