--- Begin Message ---
- Date: Tue, 18 Mar 2008 18:40:42 +0530
- From: JohnsonKoilraj <johnson.raj@xxxxxxxxx>
- Subject: Re: [Condor-users] VMGAHP_ERR_CRITICAL
Hi Yoon,
Thanks Its Working... Thank for your explanation....
Then If I submit a Job it says
1. 1 match but reject the job for unknown reasons
were i can find out the reasons (No VMGAPHLOG file was created)
I can't find it out from dameonlog files also...
2. 1 are rejected by your job's requirements
When I used condor_q -analyze its pointing above statement and
showing some statements which i have giving in job description
files. tell me how can i trace which requirement is not correct.
Then I this was the set up I need to set up
1.Zeus (Central manager,submitter) idealgrid - username
2.Pluto (executor) idealgrid - username
before making this setup i tried to start VM in Pluto from Pluto
and also from Zeus.. It showing Error No 2(I am using same Job file
which i sent to you)
till now i used user Johnson to submit job but I now changed user to
idealgrid the VMGAPHLOG.idealgrid shows same VMGAHP_ERR_CRITICAL
this was happend when i am trying to submit job from Zeus to Zeus.
(i have attached the file)
On Mon, 2008-03-17 at 12:22 -0500, Jaeyoung Yoon wrote:
> Hi Johnson,
>
> Let me clarify your problem in your system environment. As you said,
> you have two machines (Zeus, Pluto) for Condor.
>
> 1. First of all, because your job submit description file has
> 'Requirements = (Machine == "zeus.pesgrid.wipro.com")', your VM job
> can be executed on only Zeus.
>
> 2. When you try to submit a VM job from Zeus, the job should be
> assigned to Zeus. And I think you should have NO problem.
>
> 3. When you try to submit a VM job from Pluto, the job should also be
> assigned to Zeus due to your job requirements. And I think you must
> have VMGAHP problem.
>
> Here are my observations for your case.
>
> In your environment, your Condor daemons on Zeus seem to run as Root
> with "CONDOR_IDS=daemon,daemon". So ordinary Condor jobs like Vanilla,
> Standard, JAVA from other machine(Pluto) will run as "Nobody" or "Same
> UID on submit machine".
>
> In result, because VMware requires that a user starting a Virtual
> machine have a writable working directory. Your problem happened
> because the UID=2(daemon) doesn't have a writable working directory as
> "Nobody" doesn't. Unlike ordinary Condor jobs, VM jobs doens't use
> "Nobody" when the UID on submit machine doesn't exist on an execute
> machine. Instead of "Nobody", VM jobs try to use UID of Condor daemon,
> generally "condor".
>
> With VMGAHP log files you sent, you can look at what happened on your
> Zeus.
>
> When you submit a VM job from Zeus to Zeus, VMGAPH.Johnson log says
> that your VM job successfully ran as "UID=Johson".
> But when you submit a VM job from pluto to Zeus, VMGAHP.daemon says
> that your VM jobs tried to run as "UID=daemon" and failed.
>
> So here is solution for you.
>
> If you run Condor as root and you specified CONDOR_IDS=daemon,daemon.
> Please add the following configuration parameter to Condor
> configuration file on Zeus.
> VM_UNIV_NOBODY_USER = "login name of a user who has home directory"
>
> With above parameter, VM jobs from pluto will use the UID specified in
> "VM_UNIV_NOBODY_USER".
>
> In Condor manual section 3.3.26, you can see the configuration
> parameters for VM universe.
>
> If you have questions, please let me know.
>
> Best,
>
> -Jaeyoung
>
>
> On Mon, Mar 17, 2008 at 8:01 AM, JohnsonKoilraj
> <johnson.raj@xxxxxxxxx> wrote:
> Hi Yoon,
>
> How are you.
> Here is the scenario.I am having 2 system in my condor pool.
> 1.Zeus (Central manager,submitter,executor) Johnson -
> username
>
> 2.Pluto (Submitter,executor) condor - username (who submit
> job)
>
> Now, I can start in Zeus from Zeus..
> Then when I try to start VM in Pluto from Zeus (no match
> found).
>
> Then When I try to start Vm in Zeus From Pluto (the error
> occurs)
>
> I am using - Condor 7.0.1
> Vmware Server - Vmware 1.0.4
>
> 1. I have Attached Job Description files (firstvm.sh)
>
> 2. I have attached VMGAHPLOG.daemon(I think condor updated on
> that file
> because when i submit job from Pluto(condor) to Zeus)
>
> 3. I have attached VMGAHPLOG.Johnson(while Vm was started in
> Zeus from
> Zeus(Johnson) this file was updated.)
>
> 4. I have attached log file created by Job description file
>
> Thank you for your response
>
>
>
>
3/18 18:25:23 ******************************************************
3/18 18:25:23 ** condor_vm-gahp (CONDOR_VM_GAHP) STARTING UP
3/18 18:25:23 ** /opt/condor-7.0.1/sbin/condor_vm-gahp
3/18 18:25:23 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/18 18:25:23 ** $CondorPlatform: I386-LINUX_RHEL5 $
3/18 18:25:23 ** PID = 3584
3/18 18:25:23 ** Log last touched time unavailable (No such file or directory)
3/18 18:25:23 ******************************************************
3/18 18:25:23 Using config source: /opt/condor-7.0.1/etc/condor_config
3/18 18:25:23 Using local config sources:
3/18 18:25:23 /opt/condor-7.0.1/local.zeus/condor_config.local
3/18 18:25:23 DaemonCore: Command Socket at <10.201.40.155:47380>
3/18 18:25:24 Initialized the following authorization table:
3/18 18:25:24 host 10.201.40.155: user *: WRITE,NEGOTIATOR,ADMINISTRATOR,OWNER,DAEMON,ADVERTISE_STARTD,ADVERTISE_SCHEDD,ADVERTISE_MASTER
3/18 18:25:24 Will use UDP to update collector zeus.pesgrid.wipro.com <10.201.40.155:9618>
3/18 18:25:24 VMGAHP[3584]: VM-GAHP initialized with run-mode 1
3/18 18:25:24 VMGAHP[3584]: Initial UID/GUID=49527/49527, EUID/EGUID=49527/49527, Condor UID/GID=49527,49527
3/18 18:25:24 VMGAHP[3584]: Initialize Uids: caller=idealgrid, job user=idealgrid
3/18 18:25:24 VMGAHP[3584]: VM_HARDWARE_VT is undefined, using default value of False
3/18 18:25:24 VMGAHP[3584]: Worker Env = VMGAHP_WORKING_DIR=/opt/condor-7.0.1/local.zeus/execute/dir_3573 VMGAHP_USER_GID=49527 CONDOR_IDS=49527.49527 VMGAHP_VMTYPE=vmware VMGAHP_USER_UID=49527 VMGAHP_CONFIG=/opt/condor-7.0.1/etc/condor_vmgahp_config.vmware
3/18 18:25:24 VMGAHP[3584]: Starting worker : /opt/condor-7.0.1/sbin/condor_vm-gahp -f -t -M 2
3/18 18:25:24 Create_Process: using fast clone() to create child process.
3/18 18:25:24 VMGAHP[3584]: Worker pid=3585
3/18 18:25:24 VMGAHP[3584]: Command: COMMANDS
3/18 18:25:24 Getting monitoring info for pid 3584
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ******************************************************
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** condor_vm-gahp (CONDOR_VM_GAHP) STARTING UP
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** /opt/condor-7.0.1/sbin/condor_vm-gahp
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** $CondorPlatform: I386-LINUX_RHEL5 $
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** PID = 3585
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** Log last touched time unavailable (Success)
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ******************************************************
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Using config source: /opt/condor-7.0.1/etc/condor_config
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Using local config sources:
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 /opt/condor-7.0.1/local.zeus/condor_config.local
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 DaemonCore: Command Socket at <10.201.40.155:34018>
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Initialized the following authorization table:
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 host 10.201.40.155: user *: WRITE,NEGOTIATOR,ADMINISTRATOR,OWNER,DAEMON,ADVERTISE_STARTD,ADVERTISE_SCHEDD,ADVERTISE_MASTER
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Will use UDP to update collector zeus.pesgrid.wipro.com <10.201.40.155:9618>
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: VM-GAHP initialized with run-mode 2
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: Initial UID/GUID=49527/49527, EUID/EGUID=49527/49527, Condor UID/GID=49527,49527
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: Initialize Uids: caller=idealgrid, job user=idealgrid
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: VM_HARDWARE_VT is undefined, using default value of False
3/18 18:25:25 VMGAHP[3584]: Command: SUPPORT_VMS
3/18 18:25:25 DaemonCore: in SendAliveToParent()
3/18 18:25:25 DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3573 is alive.
3/18 18:25:45 condor_read(): timeout reading 5 bytes from <10.201.40.155:33190>.
3/18 18:25:45 IO: Failed to read packet header
3/18 18:25:45 Failed to read ClassAd size.
3/18 18:25:45 DaemonCore: Leaving SendAliveToParent() - success
3/18 18:25:45 VMGAHP[3584]: Command: ASYNC_MODE_ON
3/18 18:25:45 DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3573 is alive.
3/18 18:25:45 VMGAHP[3584]: Command: CLASSAD
3/18 18:25:45 DaemonCore: Command received via UDP from host <10.201.40.155:32836>, access level IMMEDIATE_FAMILY
3/18 18:25:45 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
3/18 18:25:46 VMGAHP[3584]: Sending Job ClassAd to worker
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Command: CLASSAD
3/18 18:25:48 VMGAHP[3584]: Command: CONDOR_VM_START
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Command: CONDOR_VM_START
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: USE_SCRIPT_TO_CREATE_CONFIG is undefined, using default value of False
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Start
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Snapshot
3/18 18:25:49 VMGAHP[3584]: Command: RESULTS
3/18 18:25:50 VMGAHP[3584]: Worker[3585]: register(/opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx) = 1
3/18 18:25:52 VMGAHP[3584]: Command: RESULTS
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: /usr/bin/vmware-cmd: Could not connect to VM /opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: (VMControl error -14: Unexpected response from vmware-authd: The process exited with an error:
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: End of error message)
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: (ERROR) Can't create vm with /opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Failed to execute my_system: perl /opt/condor-7.0.1/sbin/condor_vm_vmware.pl start /opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx /opt/condor-7.0.1/local.zeus/execute/dir_3573/vmware_status.condor
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Unregister
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: unregister(/opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx) = 1
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Result "2 1 VMGAHP_ERR_CRITICAL"
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Shutdown
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: executeStart fail!
3/18 18:25:55 VMGAHP[3584]: Command: RESULTS
3/18 18:25:57 VMGAHP[3584]: Command: QUIT
3/18 18:25:57 VMGAHP[3584]: Started timer to call quitFast in 30 seconds
3/18 18:25:57 VMGAHP[3584]: Worker[3585]: Command: QUIT
3/18 18:25:59 VMGAHP[3584]: EOF reached on DaemonCore pipe 65539
3/18 18:25:59 VMGAHP[3584]: VM GAHP Worker result buffer closed, exiting...
3/18 18:25:59 VMGAHP[3584]: Inside VMwareType::killVMFast
3/18 18:25:59 VMGAHP[3584]: killVMFast is called
--- End Message ---