Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [condor-users] Condor falling over overnight
- Date: Fri, 6 Feb 2004 15:58:01 -0800
- From: "Simon Hoyle" <shoyle@xxxxxxxxx>
- Subject: RE: [condor-users] Condor falling over overnight
We are also getting the error message about 'GetCursorPos' - described in
http://www.cs.wisc.edu/~lists/archive/condor-users/msg00521.html
Could this be causing the problem?
The above message says this problem should be fixed in release 6.6.1. Any
idea when this will be released?
Until it is released, would an effective workaround be to install version
6.4.7 instead?
Thanks,
Simon
Simon Hoyle,
Inter-American Tropical Tuna Commission
Scripps Institute of Oceanography
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027 Fax: (858) 546-7133
-----Original Message-----
From: owner-condor-users@xxxxxxxxxxx [mailto:owner-condor-users@xxxxxxxxxxx]
On Behalf Of Simon Hoyle
Sent: Friday, February 06, 2004 1:03 PM
To: condor-users@xxxxxxxxxxx
Subject: [condor-users] Condor falling over overnight
Hi,
We are having some - probably very basic - problems getting Condor
running at our site. This is the first time I have tried to set up
Condor. The OS is Windows XP Professional.
Short jobs that run during the day have worked, but longer overnight
jobs are failing. This appears regularly in the StarterLog (shutdown 15
minutes after creating a process) on each machine in the pool, and I
don't know why.
1/20 23:28:42 File transfer completed successfully.
1/20 23:28:43 Starting a VANILLA universe job with ID: 4.1
1/20 23:28:43 IWD: C:\Condor/execute\dir_3892
1/20 23:28:43 Output file: C:\Condor/execute\dir_3892\test.out
1/20 23:28:43 Renice expr "10" evaluated to 10
1/20 23:28:43 About to exec C:\WINDOWS\System32\cmd.exe /Q /C
condor_exec.bat 1
1/20 23:28:43 Create_Process succeeded, pid=1304
1/20 23:43:46 Got SIGQUIT. Performing fast shutdown.
1/20 23:43:46 ShutdownFast all jobs.
1/20 23:44:42 Got SIGTERM. Performing graceful shutdown.
1/20 23:44:42 ShutdownGraceful all jobs.
1/20 23:44:46 Our Parent process (pid 1780) exited; shutting down
1/20 23:44:46 Process exited, pid=1304, status=0
1/20 23:44:46 condor_write(): send() returned -1, timeout=300,
errno=10054. Assuming failure.
1/20 23:44:46 Buf::write(): condor_write() failed
1/20 23:44:46 ERROR "Assertion ERROR on (result)" at line 266 in file
..\src\condor_starter.V6.1\NTsenders.C
1/20 23:44:46 ShutdownFast all jobs.
The following is also a regular feature in the Shadowlog
1/29 03:27:24 ******************************************************
1/29 03:27:24 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/29 03:27:24 ** $CondorVersion: 6.6.0 Nov 24 2003 $
1/29 03:27:24 ** $CondorPlatform: INTEL-WINNT40 $
1/29 03:27:24 ** PID = 3928
1/29 03:27:24 ******************************************************
1/29 03:27:24 Using config file: C:\Condor\condor_config
1/29 03:27:24 Using local config files: C:\Condor/condor_config.local
1/29 03:27:24 DaemonCore: Command Socket at <192.168.0.74:3222>
1/29 03:27:25 Initializing a VANILLA shadow
1/29 03:27:25 (5.1) (3928): Request to run on <192.168.0.74:1033> was
ACCEPTED
1/29 03:27:25 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:25 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:26 ******************************************************
1/29 03:27:26 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/29 03:27:26 ** $CondorVersion: 6.6.0 Nov 24 2003 $
1/29 03:27:26 ** $CondorPlatform: INTEL-WINNT40 $
1/29 03:27:26 ** PID = 3172
1/29 03:27:26 ******************************************************
1/29 03:27:26 Using config file: C:\Condor\condor_config
1/29 03:27:26 Using local config files: C:\Condor/condor_config.local
1/29 03:27:26 DaemonCore: Command Socket at <192.168.0.74:3239>
1/29 03:27:27 (5.1) (3928): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:27 Initializing a VANILLA shadow
1/29 03:27:27 (5.0) (2384): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:28 (5.2) (3172): Request to run on <192.168.0.36:2603> was
ACCEPTED
1/29 03:27:28 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:28 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:28 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
1/29 03:27:31 (5.2) (3172): perm::init: Lookup Account Name shoyle
failed (err=1722), using Everyone
We're also seeing all of the Condor daemons exiting on the central
manager overnight whenever a large job is submitted.
Messages 597 and 137 on this list also had (err=1722), but the list has
no information about how the problems were resolved.
I sent a query about this problem to condor-admin over a week ago, but
have had no reply apart from the automatic one.
Hope someone can help, thanks,
Simon
Simon Hoyle,
Inter-American Tropical Tuna Commission
Scripps Institute of Oceanography
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027 Fax: (858) 546-7133
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>