Easy Training Reproduction A Practical Guide For Sapientinc HRM

Aug 8, 2025 by JurnalWarga.com 64 views

Hey guys! Ever felt the frustration of trying to reproduce training runs, battling with incompatible package versions, and spending ages compiling Flash Attention? It's a real time-sink, right? Well, this guide is designed to help you sidestep those headaches. We'll walk through a straightforward approach to reproduce training setups quickly and efficiently, focusing on avoiding those time-consuming compilation steps. Let's dive in!

The key to streamlining this process is leveraging pre-compiled binaries. These are your secret weapon for dodging lengthy compilations. You can check out the available pre-compiled binaries here for Flash Attention. For a deeper dive into why this works, check out this discussion. This guide focuses on utilizing these pre-compiled options to make your life easier. Currently, we're sticking with PyTorch 2.7 because, at the time of writing, there isn't a pre-compiled wheel binary for Flash Attention for PyTorch 2.8. Of course, if you're not worried about compilation times, feel free to use PyTorch 2.8!

Training on B100 and B200 GPUs

For those of you working with B100 and B200 GPUs, this section is your go-to. To get started, the recommended Docker image is pytorch/pytorch_2.7.0-cuda12.6-cudnn9-devel. This image provides a solid foundation with the right CUDA and cuDNN versions, ensuring compatibility and optimal performance for your training runs. Using a Docker image like this is a fantastic way to maintain consistency across different environments and avoid those dreaded "it works on my machine" issues. The use of pre-built docker images streamline the setup process, as all the necessary dependencies and configurations are already in place. This approach significantly reduces the time and effort required to get your training environment up and running, allowing you to focus on the core aspects of your research or development work. This is particularly beneficial when dealing with complex software stacks that involve multiple libraries and frameworks. After logging into your instance using the Docker image, the next step is to install the required dependencies and clone the HRM repository. The first command, pip install flash_attn, installs the Flash Attention library, which is crucial for accelerating transformer-based models. Flash Attention is an optimized version of the attention mechanism that is designed to reduce computational complexity and memory usage, leading to faster training times and the ability to handle larger models. The subsequent commands, git clone https://github.com/sapientinc/HRM and cd HRM/, clone the HRM repository from GitHub and navigate into the project directory. This repository likely contains the code, scripts, and configuration files necessary for your specific training tasks. The final command in this sequence, pip install -r requirements.txt, installs all the Python packages listed in the requirements.txt file. This file typically contains a list of dependencies that the project needs to run correctly, including libraries such as PyTorch, Transformers, and other utilities. By installing all the dependencies at once, you ensure that your environment is fully equipped to execute the training scripts. Once you've got your environment set up, the next step is to dive into the README file within the HRM repository. This file is your roadmap to understanding the project's structure, available training scripts, and specific instructions for each training task. The README will typically outline the different training configurations, datasets, and parameters that you can use. It might also include information on how to monitor your training progress, save checkpoints, and evaluate the performance of your models. Make sure to read this carefully before you kick off any training runs. One crucial step in preparing for your training is adapting the --nproc-per-node field. This parameter determines the number of GPUs you want to utilize for your training job. It's essential to set this value correctly to maximize GPU utilization and minimize training time. If you have multiple GPUs available on your machine, increasing this number will allow you to distribute the workload across all the GPUs, leading to faster training. However, setting it too high can lead to memory issues if the workload exceeds the GPU's capacity. Therefore, it's crucial to consider the number of GPUs available and the memory requirements of your model and dataset when setting this parameter. As a practical example, the Sudoku Extreme 1k training task typically takes around 1 hour and 30 minutes when run on a single B100 GPU. This gives you a benchmark to compare against and estimate the training time for other tasks or configurations. Remember that the actual training time can vary depending on factors such as the model size, dataset size, and the specific training parameters used. However, this baseline provides a useful reference point for planning your training experiments.

pip install flash_attn
git clone https://github.com/sapientinc/HRM
cd HRM/
pip install -r requirements.txt

Then, as mentioned, follow the instructions in the README file for your chosen training task. Remember to adapt the --nproc-per-node argument to match the number of GPUs you're using. Just as a ballpark, the Sudoku Extreme 1k task takes around 1 hour and 30 minutes on a single B100.

Training on RTX 5090 GPUs

Now, let's talk about RTX 5090 GPUs. These powerful cards require CUDA 12.8 as a minimum. One important thing to note is that the adam-atan2 package isn't compatible, so we'll be using adam-atan2-pytorch instead. This is a common situation when working with cutting-edge hardware; sometimes, you need to make slight adjustments to your software stack to ensure everything plays nicely together. Using adam-atan2-pytorch is a straightforward swap that allows you to continue using the Adam optimizer while maintaining compatibility with your RTX 5090 GPUs. The initial step in setting up your environment for RTX 5090 GPUs involves selecting the appropriate Docker image. In this case, we recommend using the pytorch/pytorch_2.7.0-cuda12.8-cudnn9-devel image. This image is specifically designed to provide a compatible environment for PyTorch 2.7, CUDA 12.8, and cuDNN 9, which are the necessary components for leveraging the full potential of your RTX 5090 GPUs. Docker images are a fantastic tool for managing dependencies and ensuring consistent performance across different machines, so using the right image is crucial for a smooth setup process. Once you've launched your instance with the correct Docker image, the next step is to log in and begin installing the necessary software packages and libraries. This is where you'll be setting up the foundation for your training runs. The first command you'll likely want to run is pip install flash_attn. This command installs the Flash Attention library, which is a crucial component for accelerating training, especially with large models. Flash Attention optimizes the attention mechanism, reducing memory usage and improving computational efficiency, which is particularly beneficial when working with GPUs like the RTX 5090. After installing Flash Attention, the next step is to clone the HRM repository. This is typically done using the command git clone https://github.com/sapientinc/HRM. The HRM repository likely contains the code, scripts, and configuration files specific to your training task. By cloning the repository, you're bringing all the necessary project files onto your local machine or instance, allowing you to modify and run the training scripts. Once the repository is cloned, you'll want to navigate into the project directory using the command cd HRM. This command changes your current directory to the HRM directory, where you'll find the project's contents. Now that you're in the project directory, you need to address the incompatibility issue with the adam-atan2 package. As mentioned earlier, RTX 5090 GPUs require the adam-atan2-pytorch package instead. To make this substitution, you'll use the sed command to modify the requirements.txt file. The command sed -i 's/adam-atan2/adam-atan2-pytorch/g' requirements.txt replaces all occurrences of adam-atan2 with adam-atan2-pytorch in the requirements.txt file. This ensures that the correct package is installed when you run the subsequent installation command. After modifying the requirements.txt file, you can install all the project dependencies using the command pip install -r requirements.txt. This command reads the requirements.txt file and installs all the listed packages, including the adam-atan2-pytorch package you just specified. This step ensures that your environment has all the necessary libraries and dependencies to run the training scripts. However, the substitution of adam-atan2 with adam-atan2-pytorch may necessitate further adjustments in the project's source code. This is where the remaining sed commands come into play. For example, the command sed -i 's/adam_atan2/adam_atan2_pytorch/g' pretrain.py replaces occurrences of adam_atan2 with adam_atan2_pytorch in the pretrain.py file. This might be necessary because the new package might have slightly different naming conventions or function calls. Similarly, the command sed -i 's/AdamATan2/AdamAtan2/g' pretrain.py corrects any class or function names that use a different casing style in the new package. These adjustments ensure that your code correctly references the new package and its components. Finally, the command sed -i 's/lr=0,/lr=0.0001,/g' pretrain.py modifies the learning rate parameter in the pretrain.py file. This might be necessary because the adam-atan2-pytorch package might have different optimal learning rate ranges compared to the original adam-atan2 package. Adjusting the learning rate is a common practice when switching optimizers or packages to ensure that the training process converges effectively. These modifications are essential to ensure that your training scripts work seamlessly with the adam-atan2-pytorch package and the RTX 5090 GPUs.

Use this docker image: pytorch/pytorch_2.7.0-cuda12.8-cudnn9-devel

Then login to the instance and run these commands:

pip install flash_attn
git clone https://github.com/sapientinc/HRM
cd HRM
sed -i 's/adam-atan2/adam-atan2-pytorch/g' requirements.txt 
pip install -r requirements.txt
sed -i 's/adam_atan2/adam_atan2_pytorch/g' pretrain.py
sed -i 's/AdamATan2/AdamAtan2/g' pretrain.py
sed -i 's/lr=0,/lr=0.0001,/g' pretrain.py

Again, follow the README instructions for your specific training, adapting --nproc-per-node as needed. On a single 5090, Sudoku Extreme 1k takes less than 2 hours. You can even shave that down to around 1 hour 35 minutes by setting eval_interval to the same value as epochs (the default is 20000).

So there you have it! By following these steps and leveraging pre-compiled binaries, you can significantly reduce the time and hassle involved in reproducing training runs. Happy training!