Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Checkpoint server installation problem.
- Date: Wed, 27 Jan 2010 10:13:48 -0500
- From: Preston Smith <psmith@xxxxxxxxxx>
- Subject: Re: [Condor-users] Checkpoint server installation problem.
I think you're fine - the checkpoint server is ready to do it's thing,
sending updates to the collector, just waiting for checkpoints to
handle.
Have you tried running a standard universe job, and forcing it to
checkpoint with condor_checkpoint?
On Tue, Jan 26, 2010 at 12:59 PM, Genie Jhang <geniejhang@xxxxxxxxxxx> wrote:
> Hi. Genie again.
>
> I feel sorry about day after day questions.
>
> Now, it's about checkpoint server.
>
> I read through the page,
> http://www.cs.wisc.edu/condor/manual/v7.4/3_8Checkpoint_Server.html, to
> learn how to install it.
>
> And I'm stuck with the line below. I don't know what the second line means.
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Described in section 3.3.9. To have the checkpoint server managed by the
> condor_master, the DAEMON_LIST variable's value must list both MASTER and
> CKPT_SERVER.
> Also add STARTD to allow jobs to run on the checkpoint server machine.
> Similarly, add SCHEDD to permit the submission of jobs from the checkpoint
> server machine.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> I did add the lines below to the condor_config file in all our machines.
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> DAEMON_LIST = MASTER, STARTD, SCHEDD, CKPT_SERVER
> CKPT_SERVER = $(SBIN)/condor_ckpt_server
> USE_CKPT_SERVER = True
> CKPT_SERVER_HOST = 192.168.0.109
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> and the file, condor_config.local, in the 192.168.0.109
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> CKPT_SERVER_DIR = /data/ckpt_server
> CKPT_SERVER_LOG = $(LOG)/CkptServerLog
> MAX_CKPT_SERVER_LOG = 1000000
> CKPT_SERVER_DEBUG = D_ALWAYS
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Then, my MasterLog file says
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 01/25 17:47:50 Started process "/condor/sbin/condor_ckpt_server", pid and
> pgroup = 9895
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> and CkptServerLog file says
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 01/25 17:47:50 ******************************************************
> 01/25 17:47:50 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
> 01/25 17:47:50 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> 01/25 17:47:50 ** $CondorPlatform: I386-LINUX_RHEL3 $
> 01/25 17:47:50 ** PID = 9895
> 01/25 17:47:50 ******************************************************
> 01/25 17:47:50 CKPT_SERVER running in directory /data/ckpt_server
> 01/25 17:47:50 Server Initializing
> 01/25 17:47:50 Server:
> 01/25 17:47:50 pheko09
> 01/25 17:47:50 Store Request Port: 5651
> 01/25 17:47:50 Store Request Socket Descriptor: 3
> 01/25 17:47:50 Store Request Buffer Size: 87380
> 01/25 17:47:50 Restore Request Port: 5652
> 01/25 17:47:50 Restore Request Socket Descriptor: 4
> 01/25 17:47:50 Restore Request Buffer Size: 87380
> 01/25 17:47:50 Service Request Port: 5653
> 01/25 17:47:50 Service Request Socket Descriptor: 5
> 01/25 17:47:50 Service Request Buffer Size: 87380
> 01/25 17:47:50 Signal handlers installed: SIGCHLD
> 01/25 17:47:50 SIGUSR1
> 01/25 17:47:50 SIGUSR2
> 01/25 17:47:50 SIGALRM
> 01/25 17:47:50 Total allowable transfers: 50
> 01/25 17:47:50 Number of storing transfers: 50
> 01/25 17:47:50 Number of restoring transfers: 50
> 01/25 17:47:50 Sending initial ckpt server ad to collector
> 01/25 17:47:50 ----------------------------------------------------
> 01/25 17:47:50 Begin removing stale checkpoint files.
> 01/25 17:47:50 Done removing stale checkpoint files.
> 01/25 17:47:50 Next stale checkpoint file check in 86400 seconds.
> 01/25 17:52:50 Sending ckpt server ad to collector...
> 01/25 17:57:50 Sending ckpt server ad to collector...
> 01/25 18:02:50 Sending ckpt server ad to collector...
> 01/25 18:07:50 Sending ckpt server ad to collector...
> 01/25 18:12:50 Sending ckpt server ad to collector...
> 01/25 18:17:50 Sending ckpt server ad to collector...
> 01/25 18:22:50 Sending ckpt server ad to collector...
> 01/25 18:27:50 Sending ckpt server ad to collector...
> 01/25 18:32:50 Sending ckpt server ad to collector...
> 01/25 18:37:50 Sending ckpt server ad to collector...
> 01/25 18:42:50 Sending ckpt server ad to collector...
> 01/25 18:47:50 Sending ckpt server ad to collector...
> 01/25 18:52:50 Sending ckpt server ad to collector...
> 01/25 18:57:50 Sending ckpt server ad to collector...
> 01/25 19:02:50 Sending ckpt server ad to collector...
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> And there's no file in /data/ckpt_server directory, even though condor has
> it and in 755 permission.
>
> What I did wrong?
>
> Thanks for reading this long mail.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
--
Preston M. Smith
psmith@xxxxxxxxxx
Sr. UNIX Systems Administrator
Rosen Center for Advanced Computing, Purdue University