Serious question, my fine-tuning job on a custom model got stuck for 36 hours

I was working on a small text model for a local business in Austin, just a simple task to sort customer emails. The training data was clean, the setup looked good, but the job just hung at 85% for a day and a half. I checked everything: cloud credits, data pipeline, even the specific GPU cluster. Turns out there was a weird bug in the checkpoint saving code from the framework I was using, PyTorch Lightning. It wasn't failing, just pausing forever. Has anyone else hit a wall with a training job that seemed fine but just stopped moving?

2 comments

2 Comments

jordan3055d ago

Did you try rolling back to an older version of Lightning?

anthony4265d agoMost Upvoted

Ha, I'm too scared to even update it!