Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Checkpoint server installation problem.
- Date: Wed, 27 Jan 2010 02:59:23 +0900
- From: Genie Jhang <geniejhang@xxxxxxxxxxx>
- Subject: [Condor-users] Checkpoint server installation problem.
Hi. Genie again.
I feel sorry about day after day questions.
Now, it's about checkpoint server.
And I'm stuck with the line below. I don't know what the second line means.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Described in section
3.3.9. To have the checkpoint server managed by the
condor_master, the
DAEMON_LIST variable's value must list both
MASTER and
CKPT_SERVER.
Also add STARTD to allow jobs to run on the checkpoint server machine. Similarly, add SCHEDD to permit the submission of jobs from the checkpoint server machine.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I did add the lines below to the condor_config file in all our machines.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DAEMON_LIST = MASTER, STARTD, SCHEDD, CKPT_SERVER
CKPT_SERVER = $(SBIN)/condor_ckpt_server
USE_CKPT_SERVER = True
CKPT_SERVER_HOST = 192.168.0.109
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
and the file, condor_config.local, in the 192.168.0.109
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CKPT_SERVER_DIR = /data/ckpt_server
CKPT_SERVER_LOG = $(LOG)/CkptServerLog
MAX_CKPT_SERVER_LOG = 1000000
CKPT_SERVER_DEBUG = D_ALWAYS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Then, my MasterLog file says
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
01/25 17:47:50 Started process "/condor/sbin/condor_ckpt_server", pid and pgroup = 9895
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
and CkptServerLog file says
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
01/25 17:47:50 ******************************************************
01/25 17:47:50 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
01/25 17:47:50 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
01/25 17:47:50 ** $CondorPlatform: I386-LINUX_RHEL3 $
01/25 17:47:50 ** PID = 9895
01/25 17:47:50 ******************************************************
01/25 17:47:50 CKPT_SERVER running in directory /data/ckpt_server
01/25 17:47:50 Server Initializing
01/25 17:47:50 Server:
01/25 17:47:50 pheko09
01/25 17:47:50 Store Request Port: 5651
01/25 17:47:50 Store Request Socket Descriptor: 3
01/25 17:47:50 Store Request Buffer Size: 87380
01/25 17:47:50 Restore Request Port: 5652
01/25 17:47:50 Restore Request Socket Descriptor: 4
01/25 17:47:50 Restore Request Buffer Size: 87380
01/25 17:47:50 Service Request Port: 5653
01/25 17:47:50 Service Request Socket Descriptor: 5
01/25 17:47:50 Service Request Buffer Size: 87380
01/25 17:47:50 Signal handlers installed: SIGCHLD
01/25 17:47:50 SIGUSR1
01/25 17:47:50 SIGUSR2
01/25 17:47:50 SIGALRM
01/25 17:47:50 Total allowable transfers: 50
01/25 17:47:50 Number of storing transfers: 50
01/25 17:47:50 Number of restoring transfers: 50
01/25 17:47:50 Sending initial ckpt server ad to collector
01/25 17:47:50 ----------------------------------------------------
01/25 17:47:50 Begin removing stale checkpoint files.
01/25 17:47:50 Done removing stale checkpoint files.
01/25 17:47:50 Next stale checkpoint file check in 86400 seconds.
01/25 17:52:50 Sending ckpt server ad to collector...
01/25 17:57:50 Sending ckpt server ad to collector...
01/25 18:02:50 Sending ckpt server ad to collector...
01/25 18:07:50 Sending ckpt server ad to collector...
01/25 18:12:50 Sending ckpt server ad to collector...
01/25 18:17:50 Sending ckpt server ad to collector...
01/25 18:22:50 Sending ckpt server ad to collector...
01/25 18:27:50 Sending ckpt server ad to collector...
01/25 18:32:50 Sending ckpt server ad to collector...
01/25 18:37:50 Sending ckpt server ad to collector...
01/25 18:42:50 Sending ckpt server ad to collector...
01/25 18:47:50 Sending ckpt server ad to collector...
01/25 18:52:50 Sending ckpt server ad to collector...
01/25 18:57:50 Sending ckpt server ad to collector...
01/25 19:02:50 Sending ckpt server ad to collector...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
And there's no file in /data/ckpt_server directory, even though condor has it and in 755 permission.
What I did wrong?
Thanks for reading this long mail.