Hi Dan, I've set these log settings and will get back to you once I see the error appear in the shadow logs again. Regards, --Mike Dan Bradley wrote: > Mike, > > > This line from your ShadowLog would certainly cause the sort of behavior > you mention: > > 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to > unknown user from host <10.255.255.146:59311> for command 61001 > (FILETRANS_DOWNLOAD), access level WRITE > > > It sounds like the starter is connecting to the shadow to do file > transfers, but no authentication is taking place and the shadow is > rejecting the command. > > Please add D_SECURITY to SHADOW_DEBUG in your submit node condor > configuration. Since this may get a little involved, let's take this > off list. Once a new shadow starts and hits this problem, there should > be a bunch of extra information in the log. Could you please post that > to condor-admin@xxxxxxxxxxx? > > Thanks, > --Dan > > > Michael Thomas wrote: >> Hi Dan, >> >> Not for this particular job: >> >> # grep 875666 ShadowLog >> 8/25 12:12:13 (875666.0) (8844): Asked to write event of number 6. >> 8/25 12:17:05 (875666.0) (8844): Asked to write event of number 6. >> 8/25 12:17:05 (875666.0) (8844): ZKM: setting default map to (null) >> 8/25 12:27:05 (875666.0) (8844): Asked to write event of number 6. >> >> # grep 875666 ShadowLog.old >> 8/25 11:45:40 (875666.0) (28489): Got SIGTERM. Performing graceful shutdown. >> 8/25 11:45:40 (875666.0) (28489): Job 875666.0 is being evicted >> 8/25 11:45:40 (875666.0) (28489): Asked to write event of number 4. >> 8/25 11:45:44 (875666.0) (28489): ZKM: setting default map to (null) >> 8/25 11:45:44 (875666.0) (28489): **** condor_shadow (condor_SHADOW) >> EXITING WITH STATUS 107 >> 8/25 11:46:59 Initializing a VANILLA shadow for job 875666.0 >> 8/25 11:46:59 (875666.0) (8844): ZKM: setting default map to (null) >> 8/25 11:46:59 (875666.0) (8844): Request to run on >> <10.255.255.207:52493> was ACCEPTED >> 8/25 11:47:05 (875666.0) (8844): Asked to write event of number 1. >> 8/25 11:52:13 (875666.0) (8844): Asked to write event of number 6. >> 8/25 11:57:13 (875666.0) (8844): Asked to write event of number 6. >> 8/25 12:02:05 (875666.0) (8844): ZKM: setting default map to (null) >> 8/25 12:02:13 (875666.0) (8844): Asked to write event of number 6. >> 8/25 12:07:13 (875666.0) (8844): Asked to write event of number 6. >> >> The entry "Got SIGTERM. Performing graceful shutdown." corresponds to >> the time that I ran 'condor_vacate_job' manually to restart the job. >> Any older shadow log entries for this particular job >> >> However, I do see the following for another job: >> >> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration: 1200 seconds >> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1166 >> 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter >> 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787> >> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter >> <10.255.255.146:36787> >> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection >> re-established >> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23. >> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to >> unknown user from host <10.255.255.146:59311> for command 61001 >> (FILETRANS_DOWNLOAD), access level WRITE >> 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter >> <10.255.255.146:36787> >> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 22. >> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1200 >> 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter >> 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787> >> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter >> <10.255.255.146:36787> >> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection >> re-established >> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23. >> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to >> unknown user from host <10.255.255.146:33141> for command 61001 >> (FILETRANS_DOWNLOAD), access level WRITE >> 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter >> <10.255.255.146:36787> >> >> How can I determine who this "unknown user" is? >> >> --Mike >> >> Dan Bradley wrote: >> >>> Hi Mike, >>> >>> Are there any clues in the corresponding ShadowLog (on the submit side)? >>> >>> --Dan >>> >>> Michael Thomas wrote: >>> >>>> I recently started seeing jobs fail with the errors below. These jobs >>>> come into our cluster from the globus job manager, which explicitly >>>> disables streaming output and transfers the output files when the jobs >>>> finish (via the NFSLite package from the VDT). The file transfer is now >>>> failing, which ultimately results in jobs being requeued and run again >>>> and again. >>>> >>>> These errors seem to have started at about the same time that I changed >>>> this particular grid user's shell from /bin/bash to /bin/true. But >>>> other users with a shell of /bin/true don't have problems with this >>>> output file transfer. >>>> >>>> Where else should I look for more information on what's going wrong? >>>> >>>> --Mike >>>> >>>> 8/24 20:49:03 ****************************************************** >>>> 8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP >>>> 8/24 20:49:03 ** /opt/condor/sbin/condor_starter >>>> 8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $ >>>> 8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $ >>>> 8/24 20:49:03 ** PID = 25226 >>>> 8/24 20:49:03 ** Log last touched 8/24 20:49:01 >>>> 8/24 20:49:03 ****************************************************** >>>> 8/24 20:49:03 Using config source: /home/condor/condor_config >>>> 8/24 20:49:03 Using local config sources: >>>> 8/24 20:49:03 /share/apps/condor/hosts/cithep230/condor_config.local >>>> 8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962> >>>> 8/24 20:49:03 Done setting resource limits >>>> 8/24 20:49:03 Communicating with shadow <10.255.255.216:48267> >>>> 8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local" >>>> 8/24 20:49:03 setting the orig job name in starter >>>> 8/24 20:49:03 setting the orig job iwd in starter >>>> 8/24 20:49:03 File transfer completed successfully. >>>> 8/24 20:49:04 Job 875666.0 set to execute immediately >>>> 8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0 >>>> 8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226 >>>> 8/24 20:49:04 Output file: >>>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout >>>> 8/24 20:49:04 Error file: >>>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr >>>> 8/24 20:49:10 Using wrapper >>>> /opt/condor/bin/condor_nfslite_job_wrapper.sh to exec >>>> Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml >>>> 8/24 20:49:10 Create_Process succeeded, pid=25229 >>>> 8/25 08:19:58 Process exited, pid=25229, status=0 >>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>>> failure reading 5 bytes from unknown source. >>>> 8/25 08:19:58 IO: Failed to read packet header >>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>>> or for a reconnect attempt >>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>>> failure reading 5 bytes from unknown source. >>>> 8/25 08:19:58 IO: Failed to read packet header >>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>>> or for a reconnect attempt >>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>>> failure reading 5 bytes from unknown source. >>>> 8/25 08:19:58 IO: Failed to read packet header >>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>>> or for a reconnect attempt >>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>>> failure reading 5 bytes from unknown source. >>>> 8/25 08:19:58 IO: Failed to read packet header >>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>>> 8/25 08:19:58 File transfer failed, forcing disconnect. >>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>>> or for a reconnect attempt >>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0> >>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267> >>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming >>>> failure reading 5 bytes from unknown source. >>>> 8/25 08:19:58 IO: Failed to read packet header >>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156. >>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire >>>> or for a reconnect attempt >>>> 8/25 08:19:58 Got SIGQUIT. Performing fast shutdown. >>>> 8/25 08:19:58 ShutdownFast all jobs. >>>> 8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No >>>> family with the given PID is registered >>>> 8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd() >>>> 8/25 08:19:58 condor_write(): Socket closed when trying to write 67 >>>> bytes to <10.255.255.216:43187>, fd is 5 >>>> 8/25 08:19:58 Buf::write(): condor_write() failed >>>> 8/25 08:19:58 Failed to send job exit status to shadow >>>> 8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a >>>> reconnect attempt >>>> 8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Condor-users mailing list >>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >>>> subject: Unsubscribe >>>> You can also unsubscribe by visiting >>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >>>> >>>> The archives can be found at: >>>> https://lists.cs.wisc.edu/archive/condor-users/ >>>> >>>> >>> _______________________________________________ >>> Condor-users mailing list >>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >>> subject: Unsubscribe >>> You can also unsubscribe by visiting >>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >>> >>> The archives can be found at: >>> https://lists.cs.wisc.edu/archive/condor-users/ >>> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Condor-users mailing list >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/condor-users/ >> > _______________________________________________ > Condor-users mailing list > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/condor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/condor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature