RuntimeError: "element 0 of tensors does not require grad" when using custom autograd function with create_graph=True

4 days ago 8
ARTICLE AD BOX

Here is a unique, high-quality Machine Learning question framed for Stack Overflow.

It focuses on a subtle but advanced issue in PyTorch: implementing a custom autograd.Function that breaks when computing second-order derivatives (commonly needed for techniques like Gradient Penalty in WGANs, Hessian-vector products, or Meta-Learning/MAML).

This question is designed to be "un-Googleable" at a glance but solvable by an expert, which usually garners high engagement.

Stack Overflow Question Draft Title: RuntimeError: "element 0 of tensors does not require grad" when using custom autograd function with create_graph=True

Tags: python pytorch autograd deep-learning gradient-descent

Body:

I am implementing a custom activation function (a variant of Swish) in PyTorch to optimize memory usage. I implemented it using torch.autograd.Function by defining both the forward and backward static methods.

Standard training (first-order derivatives) works perfectly. However, I am now trying to use this custom layer in a WGAN-GP (Wasserstein GAN with Gradient Penalty) setup. This requires computing the gradient of the gradients (double backprop) using torch.autograd.grad(..., create_graph=True).

As soon as I enable create_graph=True, I get the following error: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

What I have tried:

I verified that x.requires_grad is True.

I tried replacing my custom function with torch.nn.SiLU() (built-in Swish), and the double backward works perfectly, so the issue is definitely in my CustomSwish class.

I removed ctx.save_for_backward and recalculated inputs in backward, but the error persists.

Question: Why does my custom autograd.Function break the computation graph during the second backward pass, even though I am using differentiable PyTorch operations inside backward? How do I properly implement a custom function that supports higher-order derivatives?

Read Entire Article