ARTICLE AD BOX
Here is a unique, high-quality Machine Learning question framed for Stack Overflow.
It focuses on a subtle but advanced issue in PyTorch: implementing a custom autograd.Function that breaks when computing second-order derivatives (commonly needed for techniques like Gradient Penalty in WGANs, Hessian-vector products, or Meta-Learning/MAML).
This question is designed to be "un-Googleable" at a glance but solvable by an expert, which usually garners high engagement.
Stack Overflow Question Draft Title: RuntimeError: "element 0 of tensors does not require grad" when using custom autograd function with create_graph=True
Tags: python pytorch autograd deep-learning gradient-descent
Body:
I am implementing a custom activation function (a variant of Swish) in PyTorch to optimize memory usage. I implemented it using torch.autograd.Function by defining both the forward and backward static methods.
Standard training (first-order derivatives) works perfectly. However, I am now trying to use this custom layer in a WGAN-GP (Wasserstein GAN with Gradient Penalty) setup. This requires computing the gradient of the gradients (double backprop) using torch.autograd.grad(..., create_graph=True).
As soon as I enable create_graph=True, I get the following error: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.
What I have tried:
I verified that x.requires_grad is True.
I tried replacing my custom function with torch.nn.SiLU() (built-in Swish), and the double backward works perfectly, so the issue is definitely in my CustomSwish class.
I removed ctx.save_for_backward and recalculated inputs in backward, but the error persists.
Question: Why does my custom autograd.Function break the computation graph during the second backward pass, even though I am using differentiable PyTorch operations inside backward? How do I properly implement a custom function that supports higher-order derivatives?
