[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] PeriodicRemove crashes schedd



I tried to use a PeriodicRemove _expression_ in my job submit file:
 
PeriodicRemove =  ( JobStatus == 4 && JobWantsFileTransfer == false && JobWantsPostprocessing == false && JobWantsDeletion == true ) || ( JobStatus == 4 || JobStatus == 5) && (CurrentTime - EnteredCurrentStatus > 3600*24*10)
 
The attributes JobWantsFileTransfer, JobWantsPostprocessing, and JobWantsDeletion are attributes I give to the job in the submit file as well.
 
I submit jobs to a dedicated scheduler using remote submit:
 
condor_submit -remote ...
 
My client runs Windows 7 and the remote scheduler runs Windows Server 2008 R2.
 
Now, when the scheduler evaluates this _expression_ to true it tells me that it aborts the job and then it crashes. Although the Master starts it up it is not able to reconnect to the Master making the problem even worse. In the schedd logfile it's showing the following entries:
 
05/30/13 13:29:42 (pid:1336) Calling Handler <SecManStartCommand::WaitForSocketCallback RESCHEDULE> (3)
05/30/13 13:29:42 (pid:1336) Return from Handler <SecManStartCommand::WaitForSocketCallback RESCHEDULE> 0.0000s
05/30/13 13:29:44 (pid:1336) Job 53.0 aborted: The job attribute PeriodicRemove _expression_ '( JobStatus == 4 && JobWantsFileTransfer == false && JobWantsPostprocessing == false && JobWantsDeletion == true ) || ( JobStatus == 4 || JobStatus == 5 ) && ( CurrentTime - EnteredCurrentStatus > 3600 * 24 * 10 )' evaluated to TRUE
05/30/13 13:29:54 (pid:1500) Locale: English_United States.1252
05/30/13 13:29:54 (pid:1500) Setting maximum accepts per cycle 8.
05/30/13 13:29:54 (pid:1500) ******************************************************
05/30/13 13:29:54 (pid:1500) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
05/30/13 13:29:54 (pid:1500) ** D:\ResourceManagementSystem\condor\bin\condor_schedd.exe
05/30/13 13:29:54 (pid:1500) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
05/30/13 13:29:54 (pid:1500) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
05/30/13 13:29:54 (pid:1500) ** $CondorVersion: 7.8.7 Apr 29 2013 $
05/30/13 13:29:54 (pid:1500) ** $CondorPlatform: X86-WINDOWS_5.1 $
05/30/13 13:29:54 (pid:1500) ** PID = 1500
05/30/13 13:29:54 (pid:1500) ** Log last touched 5/30 12:29:44
05/30/13 13:29:54 (pid:1500) ******************************************************
05/30/13 13:29:54 (pid:1500) Using config source: D:\ResourceManagementSystem/condor/etc/condor_config
05/30/13 13:29:54 (pid:1500) Using local config sources:
05/30/13 13:29:54 (pid:1500)    D:\ResourceManagementSystem/condor/etc/condor_config.local
05/30/13 13:29:54 (pid:1500) SharedPortEndpoint: listener already created.
05/30/13 13:29:54 (pid:1500) DaemonCore: command socket at <10.10.0.93:9619?sock=1720_2c2e_6>
05/30/13 13:29:54 (pid:1500) DaemonCore: private command socket at <10.10.0.93:9619?sock=1720_2c2e_6>
05/30/13 13:29:54 (pid:1500) Setting maximum accepts per cycle 8.
05/30/13 13:29:54 (pid:1500) History file rotation is enabled.
05/30/13 13:29:54 (pid:1500)   Maximum history file size is: 20971520 bytes
05/30/13 13:29:54 (pid:1500)   Number of rotated history files is: 2
05/30/13 13:29:55 (pid:1500) About to rotate ClassAd log D:\ResourceManagementSystem/condor/spool/job_queue.log
05/30/13 13:29:55 (pid:1500) Job 53.0 has no Owner attribute.  Removing....
05/30/13 13:29:55 (pid:1500) 56.0: JobLeaseDuration remaining: 1045
05/30/13 13:29:55 (pid:1500) 26.0: JobLeaseDuration remaining: 1085
05/30/13 13:29:55 (pid:1500) 25.0: JobLeaseDuration remaining: 963
05/30/13 13:36:25 (pid:1500) condor_read(): timeout reading 5 bytes from daemon at <10.10.0.93:9619>.
05/30/13 13:36:25 (pid:1500) IO: Failed to read packet header
05/30/13 13:36:25 (pid:1500) SECMAN: no classad from server, failing
05/30/13 13:36:25 (pid:1500) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.10.0.93:9619> (try 1 of 3): SECMAN:2007:Failed to end classad message.
05/30/13 13:42:55 (pid:1500) condor_read(): timeout reading 5 bytes from daemon at <10.10.0.93:9619>.
05/30/13 13:42:55 (pid:1500) IO: Failed to read packet header
05/30/13 13:42:55 (pid:1500) SECMAN: no classad from server, failing
05/30/13 13:42:55 (pid:1500) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.10.0.93:9619> (try 2 of 3): SECMAN:2007:Failed to end classad message.|SECMAN:2007:Failed to end classad message.
05/30/13 13:42:55 (pid:1500) ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent.
05/30/13 13:42:55 (pid:1500) ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <10.10.0.93:9619?sock=1720_2c2e>" at line 9128 in file z:\home\felixwolfheimer\drm-development\trunk\condor\condor-7.8.7\src\condor_daemon_core.v6\daemon_core.cpp
05/30/13 13:42:55 (pid:1500) Cron: Killing all jobs
05/30/13 13:42:55 (pid:1500) CronJobList: Deleting all jobs
05/30/13 13:42:55 (pid:1500) Cron: Killing all jobs
05/30/13 13:42:55 (pid:1500) CronJobList: Deleting all jobs
....
Here it terminates and the master tries to restart it again (with the same effect).
 
Any idea/advice is welcome.