Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] log file indicates termination of job, but output file is empty !?!
- Date: Wed, 24 Nov 2010 20:47:13 -0800 (PST)
- From: Rob <spamrefuse@xxxxxxxxx>
- Subject: Re: [Condor-users] log file indicates termination of job, but output file is empty !?!
Edier Alberto Zapata Hernández wrote:
> On 11/24/10, Rob <spamrefuse@xxxxxxxxx> wrote:
>>
>> Hi,
>>
>> I am testing my condor pool by sending a large amount of jobs to it:
>>
>> #---- Condor submit file
>> Universe = Vanilla
>> Executable = sleeper.exe
>> should_transfer_files = YES
>> when_to_transfer_output = ON_EXIT
>>
>> Requirements = (target.Arch == "INTEL") && (target.OpSys == "WINNT51")
>>
>> output = 005/$(Cluster)_$(PROCESS).out
>> log = 005/$(Cluster)_$(PROCESS).log
>> log_xml = true
>>
>> arguments = "5"
>> Queue 15000
>> #----
>>
>>
>> The 'arguments = "5"' tells the sleeper.exe to sleep for 5 minutes, so I
>> know
>> that this job will run for close to 5 minutes on a pool PC.
>>
>> Most of the jobs complete nicely, giving the report in the .log file and its
>> output in the .out file.
>>
>> However, some jobs indicate that they have completed (see below), but the
>> output
>> file remains empty.
>> Notice that the "SentBytes" and "TotalSentBytes" at the end of the log file
>> are
>> both zero in this case!
>>
>> Any idea why and how this happens?
>> Should I investigate further? If yes, how?
>
> Good morning Rob,
> I recommend add the Error variable too, maybe there's something wrong
> in the execute nodes, and the Standard output don't catch it.
> You can add a line like this:
> error = 005/$(Cluster)_$(PROCESS).error
> and test again.
Thank you. I have tested again using the error entry in the submit file.
I found that both the error and output files are empty when this problem
re-occured.
The log file tells me that the job has terminated.
This is what I think is relevant in the log files:
(Notice the exit status "-1073741502"; what does that mean?)
ShadowLog:11/25 12:55:13 Initializing a VANILLA shadow for job 322.1464
ShadowLog:11/25 12:55:13 (322.1464) (19248): Request to run on slot1@34-5
<115.125.128.213:1045> was ACCEPTED
SchedLog:11/25 12:55:13 (pid:2043) Starting add_shadow_birthdate(322.1464)
SchedLog:11/25 12:55:13 (pid:2043) Started shadow for job 322.1464 on slot1@34-5
<115.125.128.213:1045> for spamrefuse@xxxxxxxxxxxxxx, (shadow pid = 19248)
ShadowLog:11/25 12:55:20 (322.1464) (19248): Job 322.1464 terminated: exited
with status -1073741502
ShadowLog:11/25 12:55:20 (322.1464) (19248): **** condor_shadow (condor_SHADOW)
pid 19248 EXITING WITH STATUS 100
SchedLog:11/25 12:55:20 (pid:2043) Shadow pid 19248 for job 322.1464 exited with
status 100
Here are some more details of what I have done:
$ cat 010.condor
Universe = Vanilla
Executable = sleeper.exe
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Requirements = (target.Arch == "INTEL") && (target.OpSys == "WINNT51")
output = 010/$(Cluster)_$(PROCESS).out
error = 010/$(Cluster)_$(PROCESS).err
log = 010/$(Cluster)_$(PROCESS).log
log_xml = true
arguments = "10"
Queue 10000
$ condor_submit 010.condor
[...snip...and wait for some time....]
$ ls -1s 010/322_1464.*
0K 010/322_1464.err
8K 010/322_1464.log
0K 010/322_1464.out
$ cat 010/322_1464.log
<c>
<a n="MyType"><s>SubmitEvent</s></a>
<a n="EventTypeNumber"><i>0</i></a>
<a n="EventTime"><s>2010-11-25T08:45:51</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="SubmitHost"><s><115.145.140.71:60614></s></a>
</c>
<c>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTypeNumber"><i>1</i></a>
<a n="EventTime"><s>2010-11-25T12:05:21</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="ExecuteHost"><s><115.145.228.233:1048></s></a>
</c>
<c>
<a n="MyType"><s>JobImageSizeEvent</s></a>
<a n="EventTypeNumber"><i>6</i></a>
<a n="EventTime"><s>2010-11-25T12:05:29</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="Size"><i>756</i></a>
</c>
<c>
<a n="MyType"><s>JobSuspendedEvent</s></a>
<a n="EventTypeNumber"><i>10</i></a>
<a n="EventTime"><s>2010-11-25T12:08:39</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
<a n="MyType"><s>JobUnsuspendedEvent</s></a>
<a n="EventTypeNumber"><i>11</i></a>
<a n="EventTime"><s>2010-11-25T12:13:44</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
</c>
<c>
<a n="MyType"><s>JobEvictedEvent</s></a>
<a n="EventTypeNumber"><i>4</i></a>
<a n="EventTime"><s>2010-11-25T12:13:44</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="Checkpointed"><b v="f"/></a>
<a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="SentBytes"><r>0.000000000000000E+00</r></a>
<a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
<a n="TerminatedAndRequeued"><b v="f"/></a>
<a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTypeNumber"><i>1</i></a>
<a n="EventTime"><s>2010-11-25T12:14:16</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="ExecuteHost"><s><115.145.228.111:1047></s></a>
</c>
<c>
<a n="MyType"><s>JobSuspendedEvent</s></a>
<a n="EventTypeNumber"><i>10</i></a>
<a n="EventTime"><s>2010-11-25T12:23:37</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
<a n="MyType"><s>JobUnsuspendedEvent</s></a>
<a n="EventTypeNumber"><i>11</i></a>
<a n="EventTime"><s>2010-11-25T12:28:42</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
</c>
<c>
<a n="MyType"><s>JobEvictedEvent</s></a>
<a n="EventTypeNumber"><i>4</i></a>
<a n="EventTime"><s>2010-11-25T12:28:42</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="Checkpointed"><b v="f"/></a>
<a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="SentBytes"><r>0.000000000000000E+00</r></a>
<a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
<a n="TerminatedAndRequeued"><b v="f"/></a>
<a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTypeNumber"><i>1</i></a>
<a n="EventTime"><s>2010-11-25T12:29:01</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="ExecuteHost"><s><115.145.228.116:1046></s></a>
</c>
<c>
<a n="MyType"><s>JobSuspendedEvent</s></a>
<a n="EventTypeNumber"><i>10</i></a>
<a n="EventTime"><s>2010-11-25T12:38:48</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
<a n="MyType"><s>JobUnsuspendedEvent</s></a>
<a n="EventTypeNumber"><i>11</i></a>
<a n="EventTime"><s>2010-11-25T12:43:53</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
</c>
<c>
<a n="MyType"><s>JobEvictedEvent</s></a>
<a n="EventTypeNumber"><i>4</i></a>
<a n="EventTime"><s>2010-11-25T12:43:53</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="Checkpointed"><b v="f"/></a>
<a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="SentBytes"><r>0.000000000000000E+00</r></a>
<a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
<a n="TerminatedAndRequeued"><b v="f"/></a>
<a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTypeNumber"><i>1</i></a>
<a n="EventTime"><s>2010-11-25T12:44:03</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="ExecuteHost"><s><115.145.228.94:1050></s></a>
</c>
<c>
<a n="MyType"><s>JobSuspendedEvent</s></a>
<a n="EventTypeNumber"><i>10</i></a>
<a n="EventTime"><s>2010-11-25T12:50:06</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
<a n="MyType"><s>JobUnsuspendedEvent</s></a>
<a n="EventTypeNumber"><i>11</i></a>
<a n="EventTime"><s>2010-11-25T12:55:11</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
</c>
<c>
<a n="MyType"><s>JobEvictedEvent</s></a>
<a n="EventTypeNumber"><i>4</i></a>
<a n="EventTime"><s>2010-11-25T12:55:11</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="Checkpointed"><b v="f"/></a>
<a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="SentBytes"><r>0.000000000000000E+00</r></a>
<a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
<a n="TerminatedAndRequeued"><b v="f"/></a>
<a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTypeNumber"><i>1</i></a>
<a n="EventTime"><s>2010-11-25T12:55:19</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="ExecuteHost"><s><115.145.228.213:1045></s></a>
</c>
<c>
<a n="MyType"><s>JobTerminatedEvent</s></a>
<a n="EventTypeNumber"><i>5</i></a>
<a n="EventTime"><s>2010-11-25T12:55:20</s></a>
<a n="Cluster"><i>322</i></a>
<a n="Proc"><i>1464</i></a>
<a n="Subproc"><i>0</i></a>
<a n="TerminatedNormally"><b v="t"/></a>
<a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="SentBytes"><r>0.000000000000000E+00</r></a>
<a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
<a n="TotalSentBytes"><r>0.000000000000000E+00</r></a>
<a n="TotalReceivedBytes"><r>1.186800000000000E+05</r></a>
</c>
Notice that "SentBytes" and "TotalReceivedBytes" are both zero; what does that
mean?
Thank you!
Rob.