Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Logging what compute node a job executed/failed on
- Date: Thu, 26 Oct 2006 21:16:21 +0100
- From: "Shaun J. O'Callaghan" <Shaun.OCallaghan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Logging what compute node a job executed/failed on
Cheers Dan,
Shaun
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: 26 October 2006 16:08
To: Condor-Users Mail List
Subject: Re: [Condor-users] Logging what compute node a job
executed/failed on
The schedd history file (in your SPOOL directory) contains a record of
completed jobs, including LastRemoteHost. You can either scan through
this file with your own script, or you can run queries with
condor_history. Example:
condor_history -format "%s" ClusterId -format ".%s" ProcId -format "
%s\n" LastRemoteHost
If you do use condor_history, be aware that it is much more efficient to
run one big bulk query than to run condor_history individually for a
long list of jobs. Also be aware that the history file may be
periodically rotated, depending on your configuration.
--Dan
Shaun J. O'Callaghan wrote:
>
> Is there a way to get a little more information about condor jobs and
> where they run, exactly what happened other than having separate log
> files for each job e.g.
>
> Log = log_$(PROCESS).log
>
> In the submit file?
>
> There's an issue when we're submitting 1000+ jobs and we need to know
> which ones failed, and where they executed. We can of course get the
> failures via the return codes and error output but it would be helpful
> to know exactly where this job executed. All we have at the minute is
>
> 001 (021.000.000) 09/29 09:58:54 Job executing on host:
> <xxx.xxx.xxx.xxx:1104>
>
> And while this is useful, it would be helpful to have the execute node
> actually in the following:
>
> 005 (021.000.000) 09/29 09:58:55 Job terminated.
>
> (0) Abnormal termination (signal 53)
>
> (0) No core file
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
>
> 0 - Run Bytes Sent By Job
>
> 384684 - Run Bytes Received By Job
>
> 0 - Total Bytes Sent By Job
>
> 384684 - Total Bytes Received By Job
>
> .
>
> Rather than just the job id. E.g. what about:
>
> 005 (021.000.000) 09/29 09:58:55 Job terminated (after executing on
> node xxx.xxx.xxx.xxx)
>
> This probably seems trivial, but if anyone can suggest other methods
> I'd be more than happy to hear them.
>
> Kind Regards,
>
> Shaun
>
>
------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR