Hi all,
I apologize for the multiple issues in a single email. I have multiple jobs in similar condition and I hoping for enough insight to break this down into separate issues.
I am working on creating a hveto job that runs on a weeks worth of data as well as the current 24 hrs. It is a 3 sequential job DAG.
The memory required by Job is data dependent.
Running 7 DAGs with RequestMemory=4G, 2 of them completed the other 5 are hung in the [condor] Running state. condor_ssh_to_job showed the python executables in the [ps] interruptible wait state at about 14hr runtime now at 20hrs trying to ssh to the job puts it on hold withe the message:
Job has gone over cgroup memory limit of 4096 megabytes. Peak usage: 4097 megabytes. Consider resubmitting with a higher request_memory.
the same thing happened when I used condor_vacate_job on one of them.
1102.000: Run analysis summary ignoring user priority. Of 550 machines,
ÂÂÂ 550 are rejected by your job's requirements
ÂÂÂÂÂ 0 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂÂÂ 0 are able to run your job
WARNING:Â Be advised:
ÂÂ No machines matched the jobs's constraints
While setting it to 6000:
1102.000: Run analysis summary ignoring user priority. Of 550 machines,
ÂÂÂÂÂ 1 are rejected by your job's requirements
ÂÂÂÂ 14 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂ 535 are able to run your job