Categories
Google Cloud

Fix Vertex AI Custom Job torch_xla $PJRT_DEVICE is not set.

Fix the problem running Vertex AI local-run with GPU based training docker asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest producing error with Transformer Trainer()

gcloud ai custom-jobs local-run --gpu --executor-image-uri=asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest --local-package-path=YOUR_PYTHON_PACKAGE --script=YOUR_SCRIPT_PYTHON_FILE

The error appear

/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
Setting up Trainer...
Starting training...
  0%|          | 0/3060 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::runtime_error'
  what():  torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set.

exit status 139
ERROR: (gcloud.ai.custom-jobs.local-run) 
        Docker failed with error code 139.
        Command: docker run --rm --runtime nvidia -v -e  --ipc host 

This problem what(): torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set. apparently because the PyTorch issue.

When I change to tf-gpu.2-17.py310:latest the problem showing

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.
exit status 1

The solution is easy as change using tensorflow-GPU based training Vertex AI Docker.

Here are that work for latest transformers:

  • asia-docker.pkg.dev/vertex-ai/training/tf-gpu.2-15.py310:latest
  • and even tf-gpu.2-12.py310:latest works

Leave a Reply

Your email address will not be published. Required fields are marked *