Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] job running on two hosts?
- Date: Wed, 17 Nov 2004 09:49:10 -0500
- From: Dan Christensen <jdc@xxxxxx>
- Subject: Re: [Condor-users] job running on two hosts?
Dan Christensen <jdc@xxxxxx> writes:
> I'm running a standard universe job, and this is what the log file
> says:
>
> 000 (024.006.000) 11/15 22:57:33 Job submitted from host: <129.100.75.77:9657>
> ...
> 001 (024.006.000) 11/16 00:05:40 Job executing on host: <129.100.75.77:9668>
> ...
> 001 (024.006.000) 11/16 02:13:35 Job executing on host: <129.100.75.60:9622>
> ...
> 005 (024.006.000) 11/16 02:13:35 Job terminated.
> (1) Normal termination (return value 1)
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> 1169 - Run Bytes Sent By Job
> 1751048 - Run Bytes Received By Job
> 1169 - Total Bytes Sent By Job
> 1751048 - Total Bytes Received By Job
> ...
>
> There's no explanation of why the job was rerun on the second host.
> [Is this a bug in the logging?]
>
> And when it ran the second time, it seemed to start at the beginning,
> because it tried to open its output file, and it noticed that it
> already existed and quit right away.
Here's another clue I just found: I got an e-mail from Condor saying
that condor_schedd died on 129.100.75.77 due to a SEGV. I guess that
would explain the missing information in the user log file.
> Date: Tue, 16 Nov 2004 02:11:34 -0500
>
> "/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
> Condor will automatically restart this process in 10 seconds.
But now the question is, why did it die?
Dan