Hi,
One of the job went into X state as soon as the job is released
by the using condor_release. First of all the Job is held by
condor_hold it seems to be before shadow exit, to start the Job we
issued condor_release. condor_release is successfully but when we see
the condor_q shows that job in X state.
The Job is submitted through SOAP api. we are using version 7.2.3
I think below logs will help to find what went wrong to sent job to X
state.
In Schedd log:
8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold
log_hold:true notify:true
8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host =
<192.168.10.92:9620>
8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0
8/9 21:03:16 (pid:5215) Writing record to user
logfile=/mail/condor/log/VM_514_0.log owner=idealgrid
8/9 21:03:19 (pid:5215) FileLock object is updating timestamp on:
/mail/condor/log/VM_514_0.log
8/9 21:03:19 (pid:5215) FileLock::obtain(1) - @1249831999.611700 lock
on /mail/condor/log/VM_514_0.log now WRITE
8/9 21:03:21 (pid:5215) FileLock::obtain(2) - @1249832001.150186 lock
on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:22 (pid:5215) Shadow pid 6457 for job 514.0 exited with
status 102
8/9 21:03:22 (pid:5215) Deleting shadow rec for PID 6457, job (514.0)
8/9 21:03:22 (pid:5215) Writing record to user
logfile=/mail/condor/log/VM_514_0.log owner=idealgrid
8/9 21:03:22 (pid:5215) FileLock object is updating timestamp on:
/mail/condor/log/VM_514_0.log
8/9 21:03:22 (pid:5215) FileLock::obtain(1) - @1249832002.754296 lock
on /mail/condor/log/VM_514_0.log now WRITE
8/9 21:03:24 (pid:5215) FileLock::obtain(2) - @1249832004.133541 lock
on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:24 (pid:5215) Job 514.0 is finished
8/9 21:03:24 (pid:5215) Job cleanup for 514.0 will not block, calling
jobIsFinished() directly
8/9 21:03:24 (pid:5215) jobIsFinished() completed, calling
DestroyProc(514.0)
In ShadowLog:
8/9 21:03:03 (514.0) (6457): In handleJobRemoval(), sig 10
8/9 21:03:03 (514.0) (6457): setting exit reason on
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 102
8/9 21:03:03 (514.0) (6457): Resource
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from EXECUTING to
FINISHED
8/9 21:03:03 (514.0) (6457): Entering
DCStartd::deactivateClaim(forceful)
8/9 21:03:04 (514.0) (6457): DCStartd::deactivateClaim: successfully
sent command
8/9 21:03:04 (514.0) (6457): Killed starter (fast) at
<192.168.10.92:9620>
8/9 21:03:16 (514.0) (6457): Inside RemoteResource::updateFromStarter()
8/9 21:03:19 (514.0) (6457): Inside RemoteResource::resourceExit()
8/9 21:03:19 (514.0) (6457): setting exit reason on
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 107
8/9 21:03:19 (514.0) (6457): Resource
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from FINISHED to
FINISHED
8/9 21:03:19 (514.0) (6457): Job 514.0 is being evicted from
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
8/9 21:03:19 (514.0) (6457): FileLock::obtain(1) - @1249831999.591092
lock on /mail/condor/log/VM_514_0.log now WRITE
8/9 21:03:19 (514.0) (6457): FileLock::obtain(2) - @1249831999.610029
lock on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1249831999)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(RemoteSysCpu = 4.000000)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(RemoteUserCpu = 3435.000000)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(LastVacateTime = 1249831999)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(BytesSent = 0.000000)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(BytesRecvd = 9785.000000)
8/9 21:03:22 (514.0) (6457): **** condor_shadow (condor_SHADOW) pid
6457 EXITING WITH STATUS 102
In Starter Log
8/9 21:03:04 ProcAPI::buildFamily() Found daddypid on the system: 11157
8/9 21:03:08 Got SIGQUIT. Performing fast shutdown.
8/9 21:03:08 ShutdownFast all jobs.
8/9 21:03:08 Inside VMProc::ShutdownFast()
8/9 21:03:08 Inside VMProc::StopVM
8/9 21:03:08 VMGAHP[11157] <- 'CONDOR_VM_STOP 243 1'
8/9 21:03:09 VMGAHP[11157] -> 'S'
8/9 21:03:10 VMGAHP[11157] <- 'RESULTS'
8/9 21:03:11 VMGAHP[11157] -> 'R'
8/9 21:03:11 VMGAHP[11157] -> 'S' '1'
8/9 21:03:11 VMGAHP[11157] -> '243' '0' 'NULL'
8/9 21:03:11 PID for VM is changed from [23754] to [0]
8/9 21:03:12 Inside VM_GAHP_SERVER::cleanup()
8/9 21:03:12 VMGAHP[11157] <- 'QUIT'
8/9 21:03:17 VMGAHP[11157] -> 'S'
8/9 21:03:18 VMGahpServer::killVM() failed!
8/9 21:03:18 End of VM_GAHP_SERVER::cleanup
8/9 21:03:19 Inside VMProc::cleanup()
8/9 21:03:19 ProcAPI::buildFamily() Found daddypid on the system: 11157
In UserLog
001 (514.000.000) 08/09 15:39:59 Job executing on host:
<192.168.10.92:9620>
...
004 (514.000.000) 08/09 21:03:19 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:57:15, Sys 0 00:00:04 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
1957 - Run Bytes Received By Job
...
013 (514.000.000) 08/09 21:03:19 Job was released.
via condor_release (by user daemon)
...
009 (514.000.000) 08/09 21:03:22 Job was aborted by the user.
...
thanks
Johnson
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of
viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
www.wipro.com
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/