How to train PyTorch Neural Networks on a TPU with multi-core processing on Kaggle?

1 week ago 8
ARTICLE AD BOX

I've been trying to train the resnet50 model provided by PyTorch on a v5-e8 TPU using the torch_xla package because my Kaggle GPU quote is almost depleted. However I've just been continuously running into problems. Training on a single TPU core works, but the moment I try to use multi-core processing, all hell breaks loose. For reference, I've used the code snippets provided on XLA's GitHub page as well as the official PyTorch XLA docs, but I still get the same error everytime. AI is not helpful either. Using the debugging argument debug_single_process=True in torch_xla.launch() to specify the usage of only 1 core is the only way to get it to work, but that defeats the purpose.

Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 8 worker addresses, got 1.
Read Entire Article