For starts, you'll need to look at the StarterLog on the execute machine
to be able to see why condor_starter died.
Pasquale Tricarico wrote:
>Sorry for asking again, but really nobody has any idea on how to fix
>this problem?
>
>Thanks,
>Pasquale
>
>On 3/30/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:
>
>
>>Hi,
>>
>>We're still working on this problem.
>>
>>I just want to add that on the job log, we get this entry:
>>
>>...
>>007 (1454.000.000) 03/30 02:59:50 Shadow exception!
>> Can no longer talk to condor_starter <10.7.7.12:40786>
>> 1090105856 - Run Bytes Sent By Job
>> 151425472 - Run Bytes Received By Job
>>...
>>
>>relative to one of the crunching nodes. So why is condor_sterter dying
>>in the remote node? Is there any configuration parameter that can be
>>used to prevent this?
>>
>>Thanks,
>>Pasquale
>>
>>
>>On 3/26/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:
>>
>>
>>>Hi,
>>>
>>><feel free to jump directly to the question at the end of the message>
>>>
>>>We have a problem with an MPI job, on a system:
>>>
>>>$CondorVersion: 6.8.2 Oct 12 2006 $
>>>$CondorPlatform: X86_64-LINUX_RHEL3 $
>>>
>>>Let me start saying that I'm the admin of the cluster, so I'm
>>>debugging the application that a gray box to me (I can ask the user
>>>details about it).
>>>
>>>Simply put, each job runs on a different node, communicates with the
>>>other jobs, and writes a large number of big files at the end of its
>>>execution. We are NOT using a shared file system (NFS), so all the
>>>generated files are copied back to the head node, used for submission
>>>only.
>>>
>>>So the job ID=0 completes its execution (usually it's the first to
>>>complete), and from the ShadowLog I see that all the files generated
>>>on the node are being copied to the head node correctly. Soon after,
>>>while the first node is still transferring files, two more nodes
>>>complete their jobs and start to transfer... and on ShadowLog we get
>>>good comments such as:
>>>
>>>---
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
>>>MpiResource::resourceExit()
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
>>>RemoteResource::resourceExit()
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): setting
>>>exit reason on vm4@xxxxxxxxxx to 100
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxx changing state from EXECUTING to FINISHED
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Entering shutDownLogic(r=100)
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615): Normal exit
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm2@xxxxxxxxxxxxx FINISHED 100
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm2@xxxxxxxxxxxxx EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxxxxx FINISHED 100
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxxxxx EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm1@xxxxxxxxxxxxx EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm3@xxxxxxxxxxxxx EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm2@xxxxxxxxxxxxx EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxxxxx EXECUTING -1
>>>---
>>>
>>>But then, suddenly, I get something like:
>>>
>>>---
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>entering FileTransfer::HandleCommands
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>FileTransfer::HandleCommands read transkey=5#4604675c407de5b75d04959c
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>entering FileTransfer::Download
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>entering FileTransfer::DoDownload sync=0
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file(): going to write to filename /home/x.y/z.dat
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file: Receiving 2267460 bytes
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>condor_read(): Socket closed when trying to read 65536 bytes from
>>><10.X.Y.13:53477>
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>ReliSock::get_bytes_nobuffer: Failed to receive file.
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file: wrote 65536 bytes to file
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file(): ERROR: received 65536 bytes, expected 2267460!
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>DoDownload: SHADOW at 10.X.Y.250 failed to receive file
>>>/home/x.y/z.dat
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>DoDownload: exiting at 1509
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>condor_read(): Socket closed when trying to read 5 bytes from
>>><10.X.Y.12:40786>
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): IO: EOF
>>>reading packet header
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): ERROR
>>>"Can no longer talk to condor_starter <10.X.Y.12:40786>" at line 123
>>>in file NTreceivers.C
>>>---
>>>
>>>and then the MPI job is completely restarted...
>>>
>>>So my questions are: Is it OK for Condor if only a fraction of the
>>>total MPI jobs originally submitted are running at a given time? Or
>>>Condor interprets that as a problem and automatically restarts the
>>>entire MPI job? Should all the jobs complete their execution pretty
>>>much at the same time?
>>>
>>>Thank you very much for any insight you may provide on this problem.
>>>
>>>Pasquale
>>>
>>>
>>>
>>--
>>This space intentionally has nothing but text explaining why this
>>space has nothing but text explaining that this space would otherwise
>>have been left blank, and would otherwise have been left blank.
>>
>>
>>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at either
>https://lists.cs.wisc.edu/archive/condor-users/
>http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR