G
1

Serious question, my fine-tuning job on a custom model got stuck for 36 hours

I was working on a small text model for a local business in Austin, just a simple task to sort customer emails. The training data was clean, the setup looked good, but the job just hung at 85% for a day and a half. I checked everything: cloud credits, data pipeline, even the specific GPU cluster. Turns out there was a weird bug in the checkpoint saving code from the framework I was using, PyTorch Lightning. It wasn't failing, just pausing forever. Has anyone else hit a wall with a training job that seemed fine but just stopped moving?
2 comments

Log in to join the discussion

Log In
2 Comments
jordan305
jordan3055d ago
Did you try rolling back to an older version of Lightning?
3
anthony426
anthony4265d agoMost Upvoted
Ha, I'm too scared to even update it!
7