Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] I'm having shadow exceptions also
- Date: Thu, 16 Feb 2006 12:07:59 -0600 (CST)
- From: Beverly Seavey <seavey@xxxxxxxxxxx>
- Subject: [Condor-users] I'm having shadow exceptions also
Below are the tails of the logs of a couple of jobs running. The time
spent keeps going up, but the output file hasn't grown any for a day or
two.
_____________________________________________________
007 (4275.000.000)
02/14 05:18:36 Shadow exception!
ckpt server store failed
111447288 - Run Bytes Sent By Job
12501003 - Run Bytes Received By Job
...
001 (4275.000.000) 02/14 05:19:07 Job executing on host:
<144.92.73.157:9699>
...
006 (4275.000.000) 02/14 08:19:01 Image size of job updated: 19250
...
==> columnsD16-17_L8-9_NT9_S1_4x100fairCloseStay45_60Psign.log <==
...
007 (4276.000.000) 02/14 05:18:47 Shadow exception!
ckpt server store failed
46565200 - Run Bytes Sent By Job
12500685 - Run Bytes Received By Job
...
001 (4276.000.000) 02/14 05:19:14 Job executing on host:
<144.92.73.157:9699>
...
006 (4276.000.000) 02/14 08:19:14 Image size of job updated: 16070
...
________________________________________________________________
Is this [Ba sign that my
job is too big? I noticed the image size keeps
getting updated. Or is this something that just happens sometime?
What other parameters can I look at to figure out what is going on?
condor_q -analyze gives
__________________________________________________________________
---
4276.000: Request is being serviced
---
4277.000: Request is being serviced
---
4278.000: Run analysis summary. Of 113 machines,
73 are rejected by your job's requirements
21 reject your job because of their own requirements
8 match, but are serving users with a better priority in the pool
11 match, match, but reject the job for unknown reasons
0 match, but will not currently preempt their existing job
0 are available to run your job
Last successful match: Thu Feb 16 11:56:47 2006
______________________________________________________________
Is there something I can change to make my requirements more generic?
and therefore reject fewer machines?
I have
___________________________________
########################
# Submit description file for
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign program
########################
Executable =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60PsignC
Requirements = OpSys =!= UNDEFINED
notification = Always
notify_user = seavey@xxxxxxxxxxx
Universe = standard
Universe = standard
Output =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign.out
input = post-genM_D16-17_L8-9_NT9_S10fairCloseStay45_60P.params
Log =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign.log
error =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign.error
Queue
_____________________________________