Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1
- Date: Wed, 04 Nov 2015 19:15:20 +0000
- From: "Feldt, Andrew N." <afeldt@xxxxxx>
- Subject: Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1
Folks,
I neglected to note that this is on RHEL 6.7 in an NIS/NFS environment. Any thoughts on how to make checkpointing work in this environment are welcome!
Andy
> On Nov 2, 2015, at 3:11 PM, Feldt, Andrew N. <afeldt@xxxxxx> wrote:
>
> I recently found that our HTCondor jobs were never vacating since we had not set up a method for running condor_kbdd. So, I set it up so that a user logging into Gnome gets it run for him/her and has it killed when they log out. But, then I started getting reports of âuser abortedâ jobs. Some debugging showed me that, while nothing bad occurs when a checkpoint is made, a job which tries to restart from a checkpoint fails. This shows up in the userâs log file as:
>
> 001 (008.000.000) 11/02 11:18:32 Job executing on host: <129.15.nn.nn:9757?addrs=129.15.nn.nn-9757>
> ...
> 005 (008.000.000) 11/02 11:18:33 Job terminated.
> (0) Abnormal termination (signal 6)
> (0) No core file
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> Usr 0 00:10:04, Sys 0 00:00:00 - Total Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> 334 - Run Bytes Sent By Job
> 4097614 - Run Bytes Received By Job
> 0 - Total Bytes Sent By Job
> 0 - Total Bytes Received By Job
> ...
> 009 (008.000.000) 11/02 11:18:33 Job was aborted by the user.
>
>
> In the shadow log file for the job, I see:
>
> 11/02/15 11:18:33 (8.0) (2889688):Read: Opened "/var/lib/condor/spool/8/0/cluster8.proc0.subproc0" via file stream
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read headers OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[0](DATA) OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[1](STACK) OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Read all SegMaps OK
> 11/02/15 11:18:33 (8.0) (2889688):Read: Found a DATA block, increasing heap from 0x887000 to 0x986000
> 11/02/15 11:18:33 (8.0) (2889688):Read: About to overwrite 1789952 bytes starting at 0x7d1000(DATA)
> 11/02/15 11:18:33 (8.0) (2889688):Reaped child status - pid 2889690 exited with status 0
> 11/02/15 11:18:33 (8.0) (2889688):Read: *** longjmp causes uninitialized stack frame ***: condor_exec.8.0 terminated
>
> followed by a Backtrace, followed by:
>
> 11/02/15 11:18:33 (8.0) (2889688):Shadow: Job 8.0 exited, termsig = 6, coredump = 0, retcode = 0
> 11/02/15 11:18:33 (8.0) (2889688):Shadow: was killed by signal 6.
> 11/02/15 11:18:33 (8.0) (2889688):user_time = 0 ticks
> 11/02/15 11:18:33 (8.0) (2889688):sys_time = 2 ticks
> 11/02/15 11:18:33 (8.0) (2889688):Static Policy: removing job because OnExitRemove has become true
> 11/02/15 11:18:33 (8.0) (2889688):********** Shadow Exiting(102) **********
>
> on the RemoteHost in the StartLog, I see:
>
> 11/02/15 10:59:39 Starter pid 3317011 exited with status 0
> 11/02/15 10:59:39 slot1: State change: starter exited
> 11/02/15 10:59:39 slot1: State change: No preempting claim, returning to owner
> 11/02/15 10:59:39 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 11/02/15 11:18:03 slot1: State change: IS_OWNER is false
> 11/02/15 11:18:03 slot1: Changing state: Owner -> Unclaimed
> 11/02/15 11:18:32 slot1: Request accepted.
> 11/02/15 11:18:32 slot1: Remote owner is feldt@xxxxxxxxxx
> 11/02/15 11:18:32 slot1: State change: claiming protocol successful
> 11/02/15 11:18:32 slot1: Changing state: Unclaimed -> Claimed
> 11/02/15 11:18:32 slot1: Got activate_claim request from shadow (129.15.nn.nn)
> 11/02/15 11:18:32 slot1: Remote job ID is 8.0
> 11/02/15 11:18:32 slot1: Got universe "STANDARD" (1) from request classad
> 11/02/15 11:18:32 slot1: State change: claim-activation protocol successful
> 11/02/15 11:18:32 slot1: Changing activity: Idle -> Busy
> 11/02/15 11:18:33 condor_write(): Socket closed when trying to write 28 bytes to <129.15.nn.nn:9682>, fd is 8
> 11/02/15 11:18:33 Buf::write(): condor_write() failed
> 11/02/15 11:18:33 slot1: Called deactivate_claim_forcibly()
> 11/02/15 11:18:33 Starter pid 3319125 exited with status 0
> 11/02/15 11:18:33 slot1: State change: starter exited
>
> So, it looks like the job either dies trying to write to the shadow on the submitting host or is unable to execute the checkpointed file. Note that this is not a firewall issue. We have all ports open between the submit host and the startd host.
>
> I did see one SELinux issue, but added the following local mod based on it:
>
> allow hald_t condor_master_t:bus send_msg;
>
> This did not help and so I am stuck.
>
> Andy
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/