Not sure if it matters, but this did not happen with condor-7.0.3, but only started after a couple of weeks running condor 7.0.4. We are also using GSI security with our condor setup, but didn't change the GSI config settings when we upgraded from 7.0.3 to 7.0.4. --Mike Michael Thomas wrote: > Hi Dan, > > Not for this particular job: > > # grep 875666 ShadowLog > 8/25 12:12:13 (875666.0) (8844): Asked to write event of number 6. > 8/25 12:17:05 (875666.0) (8844): Asked to write event of number 6. > 8/25 12:17:05 (875666.0) (8844): ZKM: setting default map to (null) > 8/25 12:27:05 (875666.0) (8844): Asked to write event of number 6. > > # grep 875666 ShadowLog.old > 8/25 11:45:40 (875666.0) (28489): Got SIGTERM. Performing graceful shutdown. > 8/25 11:45:40 (875666.0) (28489): Job 875666.0 is being evicted > 8/25 11:45:40 (875666.0) (28489): Asked to write event of number 4. > 8/25 11:45:44 (875666.0) (28489): ZKM: setting default map to (null) > 8/25 11:45:44 (875666.0) (28489): **** condor_shadow (condor_SHADOW) > EXITING WITH STATUS 107 > 8/25 11:46:59 Initializing a VANILLA shadow for job 875666.0 > 8/25 11:46:59 (875666.0) (8844): ZKM: setting default map to (null) > 8/25 11:46:59 (875666.0) (8844): Request to run on > <10.255.255.207:52493> was ACCEPTED > 8/25 11:47:05 (875666.0) (8844): Asked to write event of number 1. > 8/25 11:52:13 (875666.0) (8844): Asked to write event of number 6. > 8/25 11:57:13 (875666.0) (8844): Asked to write event of number 6. > 8/25 12:02:05 (875666.0) (8844): ZKM: setting default map to (null) > 8/25 12:02:13 (875666.0) (8844): Asked to write event of number 6. > 8/25 12:07:13 (875666.0) (8844): Asked to write event of number 6. > > The entry "Got SIGTERM. Performing graceful shutdown." corresponds to > the time that I ran 'condor_vacate_job' manually to restart the job. > Any older shadow log entries for this particular job > > However, I do see the following for another job: > > 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration: 1200 seconds > 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1166 > 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter > 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787> > 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter > <10.255.255.146:36787> > 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection > re-established > 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23. > 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to > unknown user from host <10.255.255.146:59311> for command 61001 > (FILETRANS_DOWNLOAD), access level WRITE > 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter > <10.255.255.146:36787> > 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 22. > 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1200 > 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter > 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787> > 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter > <10.255.255.146:36787> > 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection > re-established > 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23. > 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to > unknown user from host <10.255.255.146:33141> for command 61001 > (FILETRANS_DOWNLOAD), access level WRITE > 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter > <10.255.255.146:36787> > > How can I determine who this "unknown user" is? > > --Mike > > Dan Bradley wrote: >> Hi Mike, >> >> Are there any clues in the corresponding ShadowLog (on the submit side)? >> >> --Dan >> >> Michael Thomas wrote: >>> I recently started seeing jobs fail with the errors below. These jobs >>> come into our cluster from the globus job manager, which explicitly >>> disables streaming output and transfers the output files when the jobs >>> finish (via the NFSLite package from the VDT). The file transfer is now >>> failing, which ultimately results in jobs being requeued and run again >>> and again. >>> >>> These errors seem to have started at about the same time that I changed >>> this particular grid user's shell from /bin/bash to /bin/true. But >>> other users with a shell of /bin/true don't have problems with this >>> output file transfer. >>> >>> Where else should I look for more information on what's going wrong? >>> >>> --Mike >>> >>> 8/24 20:49:03 ****************************************************** >>> 8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP >>> 8/24 20:49:03 ** /opt/condor/sbin/condor_starter >>> 8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $ >>> 8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $ >>> 8/24 20:49:03 ** PID = 25226 >>> 8/24 20:49:03 ** Log last touched 8/24 20:49:01 >>> 8/24 20:49:03 ****************************************************** >>> 8/24 20:49:03 Using config source: /home/condor/condor_config >>> 8/24 20:49:03 Using local config sources: >>> 8/24 20:49:03 /share/apps/condor/hosts/cithep230/condor_config.local >>> 8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962> >>> 8/24 20:49:03 Done setting resource limits >>> 8/24 20:49:03 Communicating with shadow <10.255.255.216:48267> >>> 8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local" >>> 8/24 20:49:03 setting the orig job name in starter >>> 8/24 20:49:03 setting the orig job iwd in starter >>> 8/24 20:49:03 File transfer completed successfully. >>> 8/24 20:49:04 Job 875666.0 set to execute immediately >>> 8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0 >>> 8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226 >>> 8/24 20:49:04 Output file: >>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout >>> 8/24 20:49:04 Error file: >>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr >>> 8/24 20:49:10 Using wrapper >>> /opt/condor/bin/condor_nfslite_job_wrapper.sh to exec >>> Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml >>> 8/24 20:49:10 Create_Process succeeded, pid=25229 >>> 8/25 08:19:58 Process exited, pid=25229, status=0 >>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>> failure reading 5 bytes from unknown source. >>> 8/25 08:19:58 IO: Failed to read packet header >>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>> or for a reconnect attempt >>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>> failure reading 5 bytes from unknown source. >>> 8/25 08:19:58 IO: Failed to read packet header >>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>> or for a reconnect attempt >>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>> failure reading 5 bytes from unknown source. >>> 8/25 08:19:58 IO: Failed to read packet header >>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>> or for a reconnect attempt >>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>> failure reading 5 bytes from unknown source. >>> 8/25 08:19:58 IO: Failed to read packet header >>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>> or for a reconnect attempt >>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>> failure reading 5 bytes from unknown source. >>> 8/25 08:19:58 IO: Failed to read packet header >>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>> or for a reconnect attempt >>> 8/25 08:19:58 Got SIGQUIT. Performing fast shutdown. >>> 8/25 08:19:58 ShutdownFast all jobs. >>> 8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No >>> family with the given PID is registered >>> 8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd() >>> 8/25 08:19:58 condor_write(): Socket closed when trying to write 67 >>> bytes to <10.255.255.216:43187>, fd is 5 >>> 8/25 08:19:58 Buf::write(): condor_write() failed >>> 8/25 08:19:58 Failed to send job exit status to shadow >>> 8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a >>> reconnect attempt >>> 8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Condor-users mailing list >>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >>> subject: Unsubscribe >>> You can also unsubscribe by visiting >>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >>> >>> The archives can be found at: >>> https://lists.cs.wisc.edu/archive/condor-users/ >>> >> _______________________________________________ >> Condor-users mailing list >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/condor-users/ > > > ------------------------------------------------------------------------ > > _______________________________________________ > Condor-users mailing list > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/condor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/condor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature