Hi All.
I wonder what will be the best solution.
Just an example:
While running a deep learning job with 60 epoch's I wish to run evaluation every 5 epoch's.
The evaluation is async and can run in parallel with the train job.
One solution is creating a dag the training job will exit every 5
epoch's run
evaluation
job and next job will continue with the next epoch's.
Another way might be using a dag with and service node the job will use condor_chrip to update the progress and the script (service
node) will send evaluation
job according the job progress.
Maybe there is better way?
Thanks
David
|