Punica is very interesting project that showing running multiple LORAs model in single GPU. There are few things need to be done to make this project works in your local and avoiding issue like
- _kernels.rms_norm(o, x, w, eps) RuntimeError: output must be a CUDA tensor
- /torch/utils/cpp_extension.py”, line 2120, in _run_ninja_build
- raise RuntimeError(message) from e
- RuntimeError: Error compiling objects for extension
- error: subprocess-exited-with-error
- rich modules not installed and so on
Here are the steps
- Change NVCC version, I’m downgrade it into CUDA 12.1.
- Install G++ and GCC (version 10)
MAX_GCC_VERSION=10
sudo apt install gcc-$MAX_GCC_VERSION g++-$MAX_GCC_VERSION
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-$MAX_GCC_VERSION $MAX_GCC_VERSION
sudo apt install g++
3. Install the right torch version based on your CUDA version
pip install torch==2.5.1+cu121 --index-url https://download.pytorch.org/whl/cu121
4. Build from source!
pip install ninja numpy torch
# Clone punica
git clone https://github.com/punica-ai/punica.git
cd punica
git submodule sync
git submodule update --init
# If you encouter problem while compilation, set TORCH_CUDA_ARCH_LIST to your CUDA architecture.
# I'm using RTX4090, so ADA is 8.9. Check your version
export TORCH_CUDA_ARCH_LIST="8.9"
# Build and install punica
pip install -v --no-build-isolation .
Why build from source works? Because its required to compile a new CUDA kernel design SGMV