Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] why jobs are always evicted on remotes machines?
- Date: Tue, 28 Oct 2003 15:19:25 +0100 (CET)
- From: habib mazouni <mmhforce@xxxxxxxx>
- Subject: [condor-users] why jobs are always evicted on remotes machines?
hello,
I already sent three messages but unfortunately I
didn't receive any answer.
well, I will summarize my problem once again:
I have a 4-node Linux cluster running Condor. I have
tried, unsuccessfully, to run jobs on the remotes
nodes. but they were evicted on these nodes!!, and
finally, all the executions were held locally on the
submitting machine.
i don't understand why the jobs cannot be executed on
the remotes machines?
********************************************************
my sub file have a simple structure, like:
********************************************************
universe = standard
Executable = /home/condor/test
initialdir = /home/condor
transfer_executable = TRUE
ould_transfer_files = YES
when_to_transfer_output = ON_EXIT
Output = out.$(process)
Log = log.$(process)
Queue 15
*******************************************************
here is the relevant part of the log file Log.1:
*******************************************************
[condor@node1 condor]$ cat log.1
000 (133.001.000) 10/28 14:17:17 Job submitted from
host: <130.98.172.55:58106>...
001 (133.001.000) 10/28 14:17:49 Job executing on
host: <130.98.172.56:37429>
...
004 (133.001.000) 10/28 14:17:50 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
224 - Run Bytes Sent By Job
3587556 - Run Bytes Received By Job
...
001 (133.001.000) 10/28 14:18:07 Job executing on
host: <130.98.172.56:37429>
...
004 (133.001.000) 10/28 14:18:07 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
224 - Run Bytes Sent By Job
3587556 - Run Bytes Received By Job
...
001 (133.001.000) 10/28 14:18:11 Job executing on
host: <130.98.172.55:58105>
...
005 (133.001.000) 10/28 14:18:42 Job terminated.
(1) Normal termination (return value 13)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote
Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
893 - Run Bytes Sent By Job
3587999 - Run Bytes Received By Job
1341 - Total Bytes Sent By Job
10763111 - Total Bytes Received By Job
...
*******************************************************
concerning the different the job queue, i obtain the
following when i execute condor_q -analyze:
*******************************************************
[root@node1 bin]# ./condor_q -analyze
-- Submitter: node1.xtrem.der.edf.fr :
<130.98.172.55:58106> : node1.xtrem.der.edf.fr
ID OWNER SUBMITTED RUN_TIME ST
PRI SIZE CMD
---
133.002: Request is being serviced
---
133.003: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own
requirements
1 match, but are serving users with a better
priority in the pool
2 match, but prefer another specific job despite
its worse user-priority
0 match, but cannot currently preempt their
existing job
0 are available to run your job
Last successful match: Tue Oct 28 14:18:57 2003
---
133.005: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own
requirements
1 match, but are serving users with a better
priority in the pool
2 match, but prefer another specific job despite
its worse user-priority
0 match, but cannot currently preempt their
existing job
0 are available to run your job
No successful match recorded.
Last failed match: Tue Oct 28 14:18:57 2003
Reason for last match failure: no match found
---
133.006: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own
requirements
1 match, but are serving users with a better
priority in the pool
2 match, but prefer another specific job despite
its worse user-priority
0 match, but cannot currently preempt their
existing job
0 are available to run your job
---
133.007: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own
requirements
1 match, but are serving users with a better
priority in the pool
2 match, but prefer another specific job despite
its worse user-priority
0 match, but cannot currently preempt their
existing job
0 are available to run your job
5 jobs; 4 idle, 1 running, 0 held
[root@node1 bin]# ./condor_q -analyze
-- Submitter: node1.xtrem.der.edf.fr :
<130.98.172.55:58106> : node1.xtrem.der.edf.fr
ID OWNER SUBMITTED RUN_TIME ST
PRI SIZE CMD
---
133.002: Request is being serviced
---
133.003: Request is being serviced
---
133.005: Request is being serviced
---
133.006: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own
requirements
3 match, but are serving users with a better
priority in the pool
0 match, but prefer another specific job despite
its worse user-priority
0 match, but cannot currently preempt their
existing job
0 are available to run your job
No successful match recorded.
Last failed match: Tue Oct 28 14:19:17 2003
Reason for last match failure: no match found
---
133.007: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own
requirements
3 match, but are serving users with a better
priority in the pool
0 match, but prefer another specific job despite
its worse user-priority
0 match, but cannot currently preempt their
existing job
0 are available to run your job
5 jobs; 2 idle, 3 running, 0 held
[root@node1 bin]# ./condor_q -analyze
-- Submitter: node1.xtrem.der.edf.fr :
<130.98.172.55:58106> : node1.xtrem.der.edf.fr
ID OWNER SUBMITTED RUN_TIME ST
PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
********************************************************
the relevant part of my SchedLog file :
********************************************************
10/28 14:17:17 DaemonCore: Command received via UDP
from host <130.98.172.55:33426>
10/28 14:17:17 DaemonCore: received command 421
(RESCHEDULE), calling handler (reschedule_negotiator)
10/28 14:17:17 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:17 Called reschedule_negotiator()
10/28 14:17:35 Activity on stashed negotiator socket
10/28 14:17:35 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:35 Checking consistency running and
runnable jobs
10/28 14:17:35 Tables are consistent
10/28 14:17:35 Out of servers - 3 jobs matched, 5 jobs
idle, 1 jobs rejected
10/28 14:17:38 Started shadow for job 133.0 on
"<130.98.172.55:58105>", (shadow pid = 25865)
10/28 14:17:40 Started shadow for job 133.1 on
"<130.98.172.56:37429>", (shadow pid = 25869)
10/28 14:17:43 Started shadow for job 133.2 on
"<130.98.172.57:45074>", (shadow pid = 25871)
10/28 14:17:43 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:50 Sent RELEASE_CLAIM to startd on
<130.98.172.56:37429>
10/28 14:17:50 Match record (<130.98.172.56:37429>,
133, 1) deleted
10/28 14:17:51 DaemonCore: Command received via TCP
from host <130.98.172.56:46324>
10/28 14:17:51 DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
10/28 14:17:51 Got VACATE_SERVICE from
<130.98.172.56:46324>
10/28 14:17:51 Sent RELEASE_CLAIM to startd on
<130.98.172.57:45074>
10/28 14:17:51 Match record (<130.98.172.57:45074>,
133, 2) deleted
10/28 14:17:52 DaemonCore: Command received via TCP
from host <130.98.172.57:48867>
10/28 14:17:52 DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
10/28 14:17:52 Got VACATE_SERVICE from
<130.98.172.57:48867>
10/28 14:17:56 Activity on stashed negotiator socket
10/28 14:17:56 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:56 Checking consistency running and
runnable jobs
10/28 14:17:56 Tables are consistent
10/28 14:17:56 Out of servers - 2 jobs matched, 5 jobs
idle, 1 jobs rejected
*******************************************************
the StarterLog file of node3 on witch the jobs were
evicted
*******************************************************
[condor@node3 log]$ cat StarterLog
Now in new log file /home/condor/log/StarterLog
GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT
PERIODIC_CKPT ]
10/28 14:19:31 *FSM* Got asynchronous event "DIE"
10/28 14:19:31 *FSM* Executing transition function
"req_die"
10/28 14:19:31 *FSM* Transitioning to state
"TERMINATE"
10/28 14:19:31 *FSM* Executing state func
"terminate_all()" [ ]
10/28 14:19:31 *FSM* Transitioning to state
"SEND_STATUS_ALL"
10/28 14:19:31 *FSM* Executing state func
"dispose_all()" [ ]
10/28 14:19:31 *FSM* Reached state "END"
10/28 14:19:31 ********* STARTER terminating normally
**********
10/28 14:19:43 ********** STARTER starting up
***********
10/28 14:19:43 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/28 14:19:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/28 14:19:43
******************************************
10/28 14:19:43 Submitting machine is
"node1.xtrem.der.edf.fr"
10/28 14:19:43 EventHandler {
10/28 14:19:43 func = 0x80706d0
10/28 14:19:43 mask = SIGALRM SIGHUP SIGINT SIGUSR1
SIGUSR2 SIGCHLD SIGTSTP
10/28 14:19:43 }
10/28 14:19:43 Done setting resource limits
10/28 14:19:43 *FSM* Transitioning to state
"GET_PROC"
10/28 14:19:43 *FSM* Executing state func
"get_proc()" [ ]
10/28 14:19:43 Entering get_proc()
10/28 14:19:43 Entering get_job_info()
10/28 14:19:43 Startup Info:
10/28 14:19:43 Version Number: 1
10/28 14:19:43 Id: 133.5
10/28 14:19:43 JobClass: STANDARD
10/28 14:19:43 Uid: 504
10/28 14:19:43 Gid: 505
10/28 14:19:43 VirtPid: -1
10/28 14:19:43 SoftKillSignal: 20
10/28 14:19:43 Cmd: "/home/condor/test"
10/28 14:19:43 Args: ""
10/28 14:19:43 Env: ""
10/28 14:19:43 Iwd: "/home/condor"
10/28 14:19:43 Ckpt Wanted: TRUE
10/28 14:19:43 Is Restart: FALSE
10/28 14:19:43 Core Limit Valid: TRUE
10/28 14:19:43 Coredump Limit 0
10/28 14:19:43 User uid set to 99
10/28 14:19:43 User uid set to 99
10/28 14:19:43 User Process 133.5 {
10/28 14:19:43 cmd = /home/condor/test
10/28 14:19:43 args =
10/28 14:19:43 env =
10/28 14:19:43 local_dir = dir_13235
10/28 14:19:43 cur_ckpt =
dir_13235/condor_exec.133.5
10/28 14:19:43 core_name = dir_13235/core
10/28 14:19:43 uid = 99, gid = 99
10/28 14:19:43 v_pid = -1
10/28 14:19:43 pid = (NOT CURRENTLY EXECUTING)
10/28 14:19:43 exit_status_valid = FALSE
10/28 14:19:43 exit_status = (NEVER BEEN EXECUTED)
10/28 14:19:43 ckpt_wanted = TRUE
10/28 14:19:43 coredump_limit_exists = TRUE
10/28 14:19:43 coredump_limit = 0
10/28 14:19:43 soft_kill_sig = 20
10/28 14:19:43 job_class = STANDARD
10/28 14:19:43 state = NEW
10/28 14:19:43 new_ckpt_created = FALSE
10/28 14:19:43 ckpt_transferred = FALSE
10/28 14:19:43 core_created = FALSE
10/28 14:19:43 core_transferred = FALSE
10/28 14:19:43 exit_requested = FALSE
10/28 14:19:43 image_size = -1 blocks
10/28 14:19:43 user_time = 0
10/28 14:19:43 sys_time = 0
10/28 14:19:43 guaranteed_user_time = 0
10/28 14:19:43 guaranteed_sys_time = 0
10/28 14:19:43 }
10/28 14:19:43 *FSM* Transitioning to state
"GET_EXEC"
10/28 14:19:43 *FSM* Executing state func
"get_exec()" [ SUSPEND VACATE DIE ]
10/28 14:19:43 Entering get_exec()
10/28 14:19:43 Executable is located on submitting
host
10/28 14:19:43 Expanded executable name is
"/home/condor/spool/cluster133.ickpt.subproc0"
10/28 14:19:43 Going to try 3 attempts at getting the
inital executable
10/28 14:19:43 Entering get_file(
/home/condor/spool/cluster133.ickpt.subproc0,
dir_13235/condor_exec.133.5, 0755 )
10/28 14:19:44 Opened
"/home/condor/spool/cluster133.ickpt.subproc0" via
file stream
10/28 14:19:49 Get_file() transferred 3587233 bytes,
587500 bytes/second
10/28 14:19:49 Fetched orig ckpt file
"/home/condor/spool/cluster133.ickpt.subproc0" into
"dir_13235/condor_exec.133.5" with 1 attempt
10/28 14:19:50 Executable
'dir_13235/condor_exec.133.5' is linked with
"$CondorVersion: 6.4.7 Jan 26 2003 $" on a
"$CondorPlatform: INTEL-LINUX-GLIBC22 $"
10/28 14:19:50 *FSM* Executing transition function
"spawn_all"
10/28 14:19:50 Pipe built
10/28 14:19:50 New pipe_fds[14,1]
10/28 14:19:50 cmd_fd = 14
10/28 14:19:50 Calling execve(
"/home/condor/execute/dir_13235/condor_exec.133.5",
"condor_exec.133.5", "-_condor_cmd_fd", "14", 0,
"CONDOR_VM=vm1",
"CONDOR_SCRATCH_DIR=/home/condor/execute/dir_13235", 0
)
10/28 14:19:50 Started user job - PID = 13236
10/28 14:19:50 cmd_fp = 0x82b2d30
10/28 14:19:50 end
10/28 14:19:50 *FSM* Transitioning to state
"SUPERVISE"
10/28 14:19:50 *FSM* Executing state func
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM
DIE CHILD_EXIT PERIODIC_CKPT ]
10/28 14:19:50 *FSM* Got asynchronous event
"CHILD_EXIT"
10/28 14:19:50 *FSM* Executing transition function
"reaper"
10/28 14:19:50 Process 13236 exited with status 129
10/28 14:19:50 EXEC of user process failed, probably
insufficient swap
10/28 14:19:50 *FSM* Transitioning to state
"PROC_EXIT"
10/28 14:19:50 *FSM* Executing state func
"proc_exit()" [ DIE ]
10/28 14:19:50 *FSM* Executing transition function
"dispose_one"
10/28 14:19:50 Sending final status for process 133.5
10/28 14:19:50 STATUS encoded as CKPT, *NOT*
TRANSFERRED
10/28 14:19:50 User time = 0.000000 seconds
10/28 14:19:50 System time = 0.000000 seconds
10/28 14:19:50 Unlinked "dir_13235/condor_exec.133.5"
10/28 14:19:50 Can't unlink "dir_13235/core" - errno =
2
10/28 14:19:50 Removed directory "dir_13235"
10/28 14:19:50 *FSM* Transitioning to state
"SUPERVISE"
10/28 14:19:50 *FSM* Executing state func
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM
DIE CHILD_EXIT PERIODIC_CKPT ]
10/28 14:19:50 *FSM* Got asynchronous event "DIE"
10/28 14:19:50 *FSM* Executing transition function
"req_die"
10/28 14:19:50 *FSM* Transitioning to state
"TERMINATE"
10/28 14:19:50 *FSM* Executing state func
"terminate_all()" [ ]
10/28 14:19:50 *FSM* Transitioning to state
"SEND_STATUS_ALL"
10/28 14:19:50 *FSM* Executing state func
"dispose_all()" [ ]
10/28 14:19:50 *FSM* Reached state "END"
10/28 14:19:50 ********* STARTER terminating normally
**********
...
...
*******************************************************
when I checked the log file of the collector, I
noticed that the period of negiciation is rather short
. how can I make to change it?
*******************************************************
10/28 15:01:38 ---------- Started Negotiation Cycle
----------
10/28 15:01:38 Phase 1: Obtaining ads from collector
...
10/28 15:01:38 Getting all public ads ...
10/28 15:01:38 Sorting 9 ads ...
10/28 15:01:38 Getting startd private ads ...
10/28 15:01:38 Got ads: 9 public and 3 private
10/28 15:01:38 Public ads include 0 submitter, 3
startd
10/28 15:01:38 Phase 2: Performing accounting ...
10/28 15:01:38 Phase 3: Sorting submitter ads by
priority ...
10/28 15:01:38 Phase 4.1: Negotiating with schedds
...
10/28 15:01:38 ---------- Finished Negotiation Cycle
----------
10/28 15:01:58 ---------- Started Negotiation Cycle
----------
any help would be appreciated!
MAZOUNI habib.
___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
Yahoo! Mail : http://fr.mail.yahoo.com
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>