Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Bad Condor jobs killing GUI apps
- Date: Thu, 08 Apr 2004 10:11:53 +0200
- From: Alain EMPAIN <alain.empain@xxxxxxxxx>
- Subject: Re: [condor-users] Bad Condor jobs killing GUI apps
I wonder if this kind of behavior happened to Linux pools ?
Perhaps XP is to be blamed in place to CONDOR ?
Cheers,
Alain
On Thu, 2004-04-08 at 09:59, Chris Tottle wrote:
> I have also noticed something very similar, and this has happened to us on
> 6.6.0, 6.6.1 and 6.6.2 running on Windows XP Pro.
>
> The w/s in question were all installed using the install program, and the user
> is allowed to both run jobs on the w/s and submit from that w/s.
>
> Since we are still evaluating Condor, the machine was setup to always run
> Condor, and process jobs.
>
> The following has happened to several users...
>
> The user has logged onto the w/s, started Word or something, and then gone to a
> meeting (leaving them selves logged in). During the meeting Condor starts
> processing jobs on the w/s.
>
> When the user comes back to use the w/s, moves the mouse, the w/s has
> "crashed". All of the icons have vanished from the System Tray, and the w/s has
> lost network connectivity.
>
> This seems to be an intermittent fault, and like Andy, we are also unable to
> reproduce the error.
>
> My StarterLog file looks very similar to the one that Andy posted.
>
> Regards,
>
>
> Chris Tottle
> ISG Windows Development (Team Leader)
> INFOS
> Cardiff University
> 39 - 41 Park Place
> Cardiff
> CF10 3BB
>
> 029 20875221
>
> >>> agoar@xxxxxxxxxx 07/04/2004 15:48:20 >>>
> We have a issue with running Condor that has proven very difficult to
> debug.
>
> Our Condor pool consists of about 2000 desktop machines (all running
> WindowsXP). Condor uses the machines when they are idle (mostly at
> night).
>
> We have received occasional reports from users that when they come in in
> the morning all (or almost all) of their GUI apps had been shut down.
> Users were reporting that if they disable Condor on their machine (thus
> removing the machine from the pool), then the problem would go away. At
> first we through the GUI apps shutting down had nothing to do with
> Condor. But it's happened enough times, and we have finally seen the
> behavior for ourselves, to be convinced there is a link. One of our
> admins was standing by a PC in the pool, when all of a sudden all the
> GUI apps shut down. He looked at the Condor log files, and verified that
> a job had just finished running on the machine. The starter log file
> contains the following lines just before the GUI apps started shutting
> down:
>
> 4/6 08:39:21
> ******************************************************
> 4/6 08:39:21 ** condor_starter (CONDOR_STARTER) STARTING UP
> 4/6 08:39:21 ** $CondorVersion: 6.4.7 Jan 27 2003 $
> 4/6 08:39:21 ** $CondorPlatform: INTEL-WINNT40 $
> 4/6 08:39:21 ** PID = 3236
> 4/6 08:39:21
> ******************************************************
> 4/6 08:39:21 DaemonCore: Command Socket at <10.104.41.216:3239>
> 4/6 08:39:21 Submitting machine is "admin-srv50.micron.com"
> 4/6 08:39:21 entering init_user_ids()...watch out.
> 4/6 08:39:22 File transfer completed successfully.
> 4/6 08:39:23 Starting a VANILLA universe job.
> 4/6 08:39:23 Output file:
> C:\Progra~1\Condor/execute\dir_3236\admin-srv50_tppprod_21097_EngExt.bat
> out
> 4/6 08:39:23 Error file:
> C:\Progra~1\Condor/execute\dir_3236\admin-srv50_tppprod_21097_EngExt.bat
> err
> 4/6 08:39:23 About to exec C:\WINNT\System32\cmd.exe /Q /C
> condor_exec.bat
> 4/6 08:39:23 Create_Process succeeded, pid=3320
> 4/6 08:40:04 Job exited, pid=3320, status=0
> 4/6 08:40:06 Got SIGQUIT. Performing fast shutdown.
> 4/6 08:40:06 ShutdownFast all jobs.
> 4/6 08:40:06 **** condor_starter (condor_STARTER) EXITING WITH
> STATUS 0
>
> Can someone explain the "Got QIGQUIT.." line? What's a fast shutdown? Is
> this normal? Has anyone seen cases where the Condor starter daemon
> finishing a job affects the interactive apps running on the same
> machine?
>
> So far, we have not been able to reproduce the issue at will (although
> we are still trying). It does seem to be a specific job that causes this
> every time.
>
> Thanks.
>
> Andy Goar
> Middleware Group
> Micron Technology Inc.
> email: agoar@xxxxxxxxxx
> Phone: (208)368-3254
> Support: (208)368-4850
> "Three things are certain: Death, taxes, and lost data. Guess
> which has occurred?"
>
>
>
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
--
------------------------------------------------------------
Dr Alain Empain <alain.empain@xxxxxxxxx> <alain@xxxxxxxxxx>
Bioinformatics, Molecular Genetics,
Fac. Med. Vet., University of Liège, Belgium
Bd de Colonster, B43 B-4000 Liège (Sart-Tilman)
WORK: +32 4 366 3821 FAX: +32 4 366 4122
HOME: rue des Martyrs,7 B- 4550 Nandrin
+32 85 51 23 41 GSM: +32 497 70 17 64
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>