Hi Dan, Not for this particular job: # grep 875666 ShadowLog 8/25 12:12:13 (875666.0) (8844): Asked to write event of number 6. 8/25 12:17:05 (875666.0) (8844): Asked to write event of number 6. 8/25 12:17:05 (875666.0) (8844): ZKM: setting default map to (null) 8/25 12:27:05 (875666.0) (8844): Asked to write event of number 6. # grep 875666 ShadowLog.old 8/25 11:45:40 (875666.0) (28489): Got SIGTERM. Performing graceful shutdown. 8/25 11:45:40 (875666.0) (28489): Job 875666.0 is being evicted 8/25 11:45:40 (875666.0) (28489): Asked to write event of number 4. 8/25 11:45:44 (875666.0) (28489): ZKM: setting default map to (null) 8/25 11:45:44 (875666.0) (28489): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107 8/25 11:46:59 Initializing a VANILLA shadow for job 875666.0 8/25 11:46:59 (875666.0) (8844): ZKM: setting default map to (null) 8/25 11:46:59 (875666.0) (8844): Request to run on <10.255.255.207:52493> was ACCEPTED 8/25 11:47:05 (875666.0) (8844): Asked to write event of number 1. 8/25 11:52:13 (875666.0) (8844): Asked to write event of number 6. 8/25 11:57:13 (875666.0) (8844): Asked to write event of number 6. 8/25 12:02:05 (875666.0) (8844): ZKM: setting default map to (null) 8/25 12:02:13 (875666.0) (8844): Asked to write event of number 6. 8/25 12:07:13 (875666.0) (8844): Asked to write event of number 6. The entry "Got SIGTERM. Performing graceful shutdown." corresponds to the time that I ran 'condor_vacate_job' manually to restart the job. Any older shadow log entries for this particular job However, I do see the following for another job: 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration: 1200 seconds 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1166 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter <10.255.255.146:36787> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection re-established 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23. 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to unknown user from host <10.255.255.146:59311> for command 61001 (FILETRANS_DOWNLOAD), access level WRITE 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter <10.255.255.146:36787> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 22. 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1200 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter <10.255.255.146:36787> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection re-established 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23. 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to unknown user from host <10.255.255.146:33141> for command 61001 (FILETRANS_DOWNLOAD), access level WRITE 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter <10.255.255.146:36787> How can I determine who this "unknown user" is? --Mike Dan Bradley wrote: > Hi Mike, > > Are there any clues in the corresponding ShadowLog (on the submit side)? > > --Dan > > Michael Thomas wrote: >> I recently started seeing jobs fail with the errors below. These jobs >> come into our cluster from the globus job manager, which explicitly >> disables streaming output and transfers the output files when the jobs >> finish (via the NFSLite package from the VDT). The file transfer is now >> failing, which ultimately results in jobs being requeued and run again >> and again. >> >> These errors seem to have started at about the same time that I changed >> this particular grid user's shell from /bin/bash to /bin/true. But >> other users with a shell of /bin/true don't have problems with this >> output file transfer. >> >> Where else should I look for more information on what's going wrong? >> >> --Mike >> >> 8/24 20:49:03 ****************************************************** >> 8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP >> 8/24 20:49:03 ** /opt/condor/sbin/condor_starter >> 8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $ >> 8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $ >> 8/24 20:49:03 ** PID = 25226 >> 8/24 20:49:03 ** Log last touched 8/24 20:49:01 >> 8/24 20:49:03 ****************************************************** >> 8/24 20:49:03 Using config source: /home/condor/condor_config >> 8/24 20:49:03 Using local config sources: >> 8/24 20:49:03 /share/apps/condor/hosts/cithep230/condor_config.local >> 8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962> >> 8/24 20:49:03 Done setting resource limits >> 8/24 20:49:03 Communicating with shadow <10.255.255.216:48267> >> 8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local" >> 8/24 20:49:03 setting the orig job name in starter >> 8/24 20:49:03 setting the orig job iwd in starter >> 8/24 20:49:03 File transfer completed successfully. >> 8/24 20:49:04 Job 875666.0 set to execute immediately >> 8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0 >> 8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226 >> 8/24 20:49:04 Output file: >> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout >> 8/24 20:49:04 Error file: >> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr >> 8/24 20:49:10 Using wrapper >> /opt/condor/bin/condor_nfslite_job_wrapper.sh to exec >> Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml >> 8/24 20:49:10 Create_Process succeeded, pid=25229 >> 8/25 08:19:58 Process exited, pid=25229, status=0 >> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >> failure reading 5 bytes from unknown source. >> 8/25 08:19:58 IO: Failed to read packet header >> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >> 8/25 08:19:58 File transfer failed, forcing disconnect. >> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >> or for a reconnect attempt >> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >> failure reading 5 bytes from unknown source. >> 8/25 08:19:58 IO: Failed to read packet header >> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >> 8/25 08:19:58 File transfer failed, forcing disconnect. >> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >> or for a reconnect attempt >> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >> failure reading 5 bytes from unknown source. >> 8/25 08:19:58 IO: Failed to read packet header >> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >> 8/25 08:19:58 File transfer failed, forcing disconnect. >> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >> or for a reconnect attempt >> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >> failure reading 5 bytes from unknown source. >> 8/25 08:19:58 IO: Failed to read packet header >> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >> 8/25 08:19:58 File transfer failed, forcing disconnect. >> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >> or for a reconnect attempt >> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >> failure reading 5 bytes from unknown source. >> 8/25 08:19:58 IO: Failed to read packet header >> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >> or for a reconnect attempt >> 8/25 08:19:58 Got SIGQUIT. Performing fast shutdown. >> 8/25 08:19:58 ShutdownFast all jobs. >> 8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No >> family with the given PID is registered >> 8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd() >> 8/25 08:19:58 condor_write(): Socket closed when trying to write 67 >> bytes to <10.255.255.216:43187>, fd is 5 >> 8/25 08:19:58 Buf::write(): condor_write() failed >> 8/25 08:19:58 Failed to send job exit status to shadow >> 8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a >> reconnect attempt >> 8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Condor-users mailing list >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/condor-users/ >> > _______________________________________________ > Condor-users mailing list > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/condor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/condor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature