Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] transfer_in/output_files only if they exist
- Date: Thu, 28 Feb 2019 13:58:43 +0000
- From: Duncan Brown <dabrown@xxxxxxx>
- Subject: Re: [HTCondor-users] transfer_in/output_files only if they exist
Hi Todd,
I've encountered the following error:
6070226.0 dbrown 2/27 17:41 Error from slot1@CRUSH-SUGWG-OSG-10-5-228-85: Job did not exit as promised when sent its checkpoint signal. Promised exit was with exit code 255, actual exit status was with exit code 255.
The logs are not particularly illuminating. Starter:
02/27/19 17:37:37 (pid:40934) GGT GGT GGT about to set io wait to 0
02/27/19 17:37:37 (pid:40934) GGT GGT GGT about to set io wait to 0
02/27/19 17:41:15 (pid:40934) Periodic Checkpointing all jobs.
02/27/19 17:41:15 (pid:40934) Process exited, pid=42189, status=255
02/27/19 17:41:15 (pid:40934) Hold all jobs
02/27/19 17:41:15 (pid:40934) GGT GGT GGT about to set io wait to 0
02/27/19 17:41:15 (pid:40934) GGT GGT GGT about to set io wait to 0
02/27/19 17:41:15 (pid:40934) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.5.2.4:5483>.
02/27/19 17:41:15 (pid:40934) IO: Failed to read packet header
02/27/19 17:41:15 (pid:40934) Lost connection to shadow, waiting 2400 secs for reconnect
02/27/19 17:41:15 (pid:40934) Returning from CStarter::JobReaper()
02/27/19 17:41:15 (pid:40934) Got SIGQUIT. Performing fast shutdown.
02/27/19 17:41:15 (pid:40934) ShutdownFast all jobs.
02/27/19 17:41:15 (pid:40934) GGT GGT GGT about to set io wait to 0
02/27/19 17:41:15 (pid:40934) CREDMON: Couldn't find dir "40934" in /var/lib/condor/credential
Shadow:
02/27/19 14:47:11 Initializing a VANILLA shadow for job 6070226.0
02/27/19 14:47:11 (6070226.0) (2132975): Request to run on slot1@CRUSH-SUGWG-OSG-10-5-228-85 <10.5.228.85:9618?addrs=10.5.228.85-9618&noUDP&sock=2273_aadb_3> was ACCEPTED
02/27/19 14:47:11 (6070226.0) (2132975): File transfer completed successfully.
02/27/19 14:47:11 (6070226.0) (2132975): WriteUserLog checking for event log rotation, but no lock
02/27/19 14:47:20 (6070226.0) (2132975): WriteUserLog checking for event log rotation, but no lock
02/27/19 14:52:21 (6070226.0) (2132975): WriteUserLog checking for event log rotation, but no lock
02/27/19 14:55:31 (6070226.0) (2132975): File transfer completed successfully.
02/27/19 17:41:15 (6070226.0) (2132975): WriteUserLog checking for event log rotation, but no lock
02/27/19 17:41:15 (6070226.0) (2132975): Job 6070226.0 going into Hold state (code 36,65280): Error from slot1@CRUSH-SUGWG-OSG-10-5-228-85: Job did not exit as promised when sent its checkpoint signal. Promised exit was with exit code 255, actual exit status was with exit code 255.
02/27/19 17:41:15 (6070226.0) (2132975): **** condor_shadow (condor_SHADOW) pid 2132975 EXITING WITH STATUS 112
Submit file:
universe = vanilla
executable = testwrapper.sh
arguments = ./testjob.sh
output = testjob-$(cluster).out
error = testjob-$(cluster).err
log = testjob-$(cluster).log
transfer_executable = True
transfer_input_files = testjob.sh, my.input
transfer_output_files = my.output, my.checkpoint, wrapper.checkpoint, wrapper.log
when_to_transfer_output = ON_EXIT_OR_EVICT
+CheckpointExitBySignal = False
+CheckpointExitCode = 255
+WantCheckpointSignal = True
+WantFTOnCheckpoint = True
+CheckpointSig = 10
kill_sig = 10
queue
Relevant checkpointing code from executable:
function checkpoint_trap {
echo "checkpoint_trap function called, sending SIGTSTP to pid ${prog_pid}" &>> wrapper.log
/bin/kill -s TSTP ${prog_pid} &>> wrapper.log
exit 255
}
trap checkpoint_trap USR1
Will keep debugging. Any ideas?
Cheers,
Duncan.
> On Feb 15, 2019, at 6:08 PM, Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:
>
>> Follow-up question: is there a way to set something like
>>
>> periodic_transfer_spool = 3600
>>
>> so that the contents of the job's spool directory can be transferred back to the shadow's spool periodically? In combination with ON_EXIT_OR_EVICT that would give me periodic checkpointing if the job dies unexpectedly, in addition to when it is cleanly evicted.
>
> We have some experimental (meaning, probably at least partially broken) features intended to support this kind of use case. They're both designed around the observation that HTCondor has no real way of knowing when it's safe to transfer the job sandbox if the job is still running, but that if you're creating checkpoints, your job is going to know how to restart from them.
>
> If you want the job to periodically checkpoint, you can request that HTCondor send a signal to it every so often; when it exits successfully, HTCondor performs file transfer (as if the job had been evicted), but instead of going back into the queue, HTCondor just restarts the job right where it was running.
>
> If the job generates checkpoints on its own, you can also configure HTCondor to recognize, for example, that when the job exits with code 88, that means to perform file transfer (as if the job had been evicted), and then restart the job right where it had been running.
>
> See the following page on our Wiki for details, and do please let me know if either feature works for you. Thanks.
>
> https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalSupportForPeriodicCheckpointingInVanillaUniverse
>
> - ToddM
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
--
Duncan Brown Room 263-1, Physics Department
Charles Brightman Professor of Physics Syracuse University, NY 13244
http://dabrown.expressions.syr.edu Phone: 315 443 5993