[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Shadow exceptions on Window Machines



Alian,

Thanks,

Here is contents of StarterLog on execute machine at the time the shadow
exception occurred:

3/24 03:13:43 STARTER_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
3/24 03:13:43 Process exited, pid=3456, status=0
3/24 03:13:43 in VanillaProc::JobCleanup()
3/24 03:13:43 ProcAPI: pid # 3456 was not found
3/24 03:13:43 ProcAPI: pid # 3032 was not found
3/24 03:13:43 ProcAPI: pid # 3456 was not found
3/24 03:13:43 Inside OsProc::JobCleanup()
3/24 03:13:43 TokenCache contents: 
condor-reuse-vm1@.
3/24 03:13:43 Reaper: all=1 handled=1 ShuttingDown=0
3/24 03:13:43 TokenCache contents: 
condor-reuse-vm1@.
3/24 03:13:43 entering FileTransfer::UploadFiles (final_transfer=1)
3/24 03:13:43 Sending changed file avv_ts000.dat, mod=1111620069,
dow=1111620036
3/24 03:13:43 Sending changed file avv_ts001.dat, mod=1111623994,
dow=1111620036

Many additional lines of sending change file ....

3/24 03:13:43 Sending changed file sumlayers009.dat, mod=1111655622,
dow=1111620036
3/24 03:13:43 STARTER_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
3/24 03:13:43 SEC_DEBUG_PRINT_KEYS is undefined, using default value of
False
3/24 03:13:43 FileTransfer::UploadFiles: sent
TransKey=1#4241f9c01d32a73e28786e24
3/24 03:13:43 entering FileTransfer::Upload
3/24 03:13:43 entering FileTransfer::DoUpload
3/24 03:13:43 condor_write(): send() returned -1, timeout=0,
errno=10054.  Assuming failure.
3/24 03:13:43 Buf::write(): condor_write() failed
3/24 03:13:43 ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
final_transfer))" at line 336 in file
..\src\condor_starter.V6.1\jic_shadow.C
3/24 03:13:43 ShutdownFast all jobs.
3/24 03:13:43 Got ShutdownFast when no jobs running.
3/24 03:13:43 Error disabling account condor-reuse-vm1 (ACCESS DENIED)
3/24 03:13:44 ******************************************************
3/24 03:13:44 ** condor_starter (CONDOR_STARTER) STARTING UP
3/24 03:13:44 ** C:\Condor\bin\condor_starter.exe
3/24 03:13:44 ** $CondorVersion: 6.6.7 Oct 14 2004 $
3/24 03:13:44 ** $CondorPlatform: INTEL-WINNT40 $
3/24 03:13:44 ** PID = 3120
3/24 03:13:44 ******************************************************
3/24 03:13:44 Using config file: C:\Condor\condor_config
3/24 03:13:44 Using local config files: C:\Condor/condor_config.local

The following seem to occur when shadow exceptions are encountered, but
not on jobs that complete properly:

3/24 03:13:43 condor_write(): send() returned -1, timeout=0,
errno=10054.  Assuming failure.
3/24 03:13:43 Buf::write(): condor_write() failed
3/24 03:13:43 ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
final_transfer))" at line 336 in file
..\src\condor_starter.V6.1\jic_shadow.C

Any ideas ???



Richard Dodge
Kimberly-Clark Corporation
2100 Winchester Rd.
Neenah, WI 54956
(920) 721-5134
Fax: (920) 721-7748
rdodge@xxxxxxx


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Alain Roy
Sent: Wednesday, March 30, 2005 1:29 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Shadow exceptions on Window Machines



>What are shadow exceptions and what can I do to avoid them?

The condor_shadow is a program that watches over a job. There is one
shadow 
per job, and it runs on the submission computer. When there is an 
exception, there has been some sort of problem that prevents the shadow 
from continuing. This could be anything from a permissions problem to a 
programming error on our part.

The condor_starter is a program that watches over a job, but it runs on
the 
execution machine. It can also have an exception that causes your job to
fail.

>007 (3387.000.000) 03/24 03:13:43 Shadow exception!
>         Can no longer talk to condor_starter on execute machine
>(172.16.204.38)

Do two things:

1) Look in the ShadowLog for messages from around 3:13 and see what
error 
messages you have.

2) On the execution computer (172.16.204.38), look in the StarterLog for

messages around 3:13 and see what error messages you have.

One of these log files is likely to point the finger at the problem. If
it 
doesn't, we can increase the amount of debugging output in the log files

and try again.

You might ask--why do you have to go digging through log files in order
to 
find the problem? In some cases, we should have implemented a better
method 
of propagating errors to you via the user log file. In other cases, it's

really hard to figure out how to propagate the error messages because of

the nature of the problem. As we are able to improve the error
reporting, 
we do. Given the wide variety of problems that occur, this is a hard
job.

I hope this helps to understand the problem.

-alain


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


------------------------------------------------------------------------------
This e-mail is intended for the use of the addressee(s) only and may contain privileged, confidential, or proprietary information that is exempt from disclosure under law.  If you have received this message in error, please inform us promptly by reply e-mail, then delete the e-mail and destroy any printed copy.   Thank you.
==============================================================================