Fix the problem running Vertex AI local-run with GPU based training docker asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest
producing error with Transformer Trainer()
gcloud ai custom-jobs local-run --gpu --executor-image-uri=asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest --local-package-path=YOUR_PYTHON_PACKAGE --script=YOUR_SCRIPT_PYTHON_FILE
The error appear
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of
Transformers. Use `eval_strategy` instead
warnings.warn(
Setting up Trainer...
Starting training...
0%| | 0/3060 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::runtime_error'
what(): torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set.
exit status 139
ERROR: (gcloud.ai.custom-jobs.local-run)
Docker failed with error code 139.
Command: docker run --rm --runtime nvidia -v -e --ipc host
This problem what(): torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set.
apparently because the PyTorch issue.