[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MPI issues on Windows



My MPI jobs were "completing" and giving me output, but was it the expected output?  I hadn't a clue, but something seemed  off. Fortunately, Jaime mentioned, in "Shadow exit with status 108", to compare ShadowLog with StartLogs - Thank you! 


I am running 23.0.1 on a small pool: a CM w/credD host, a dedicated AP host, 2 EP hosts.
MPI Job:   machine_count=12

What shall I make of these discrepancies? 

  1. On EP1 I get: New machine resource allocated
     on EP2 I get: New machine resource of type -1 allocated
       --- what is the significance of "type -1"?

  2. The first "Request to run..." includes the slot, but the others don't
    Request to run on slot1_2@ep1<...> was ACCEPTED vs 
    Request to run on <...> <...> was ACCEPTED 
        ---I assume, the first is the "head" while the other 11 are "workers"

  3. Each slot, shows "Got activate_claim request from shadow (ip-address), but then, on AP, ...
    ERROR in ParallelShadow::updateFromStarter: no Node defined in update ad, can't process!
         --- what's the story with update_ad?  I don't see such a file.
    followed by, for each slot: 
    on EP1 and EP2:  Called deactivate_claim_forcibly()
    on EP1 only:  Failed to open .job.ad file: 3
    on EP1 only:  Failed to write ToE tag to .job.ad file (0): No error
         --- why do EP1 and EP2 behave differently?

  4. on EP1: 
    condor_write(): Socket closed when trying to write 559 bytes to daemon at <EP1-ip-addr:9618>, fd is 1852
    Buf::write(): condor_write() failed
    SECMAN: failed to end classad message
    Send_Signal: Warning: could not send signal 3 (SIGQUIT) to pid 12936 (no longer exists)
    slot1_5[90.0]: Error sending signal to starter, errno = 0 (No error)
    meanwhile, on the AP:
    (90.0) (4744): condor_read(): Socket closed abnormally when trying to read 5 bytes from startd at <EP1-ip-addr:9618>, errno=10054 
    (90.0) (4744): Job 90.0 terminated: exited with status 0
    (90.0) (4744): Reporting job exit reason 100 and attempting to fetch new job.
    (90.0) (4744): ERROR: SharedPortEndpoint: Named pipe does not exist.
    (90.0) (4744): SharedPortEndpoint: StopListener could not stop the listener thread: 0
    (90.0) (4744): **** condor_shadow (condor_SHADOW) pid 4744 EXITING WITH STATUS 100

  5. On both EP1 and EP2:
    PERMISSION DENIED to AP-HOST$@mydomain from host AP-ip-addr for command 403 (DEACTIVATE_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request;
    DC_AUTHENTICATE: Command not authorized, done!

  6. some of the output and error files get copied back to the AP, while others don't. Which was my first clue that things weren't quite right.

Are there additional resources for testing exploring the MPI "Science" example? 
Any ideas how to overcome these obstacles?

Thanks, 
Sam



NOTICE: This email message and all attachments transmitted with it may contain privileged and confidential information, and information that is protected by, and proprietary to, Parsons Corporation, and is intended solely for the use of the addressee for the specific purpose set forth in this communication. If the reader of this message is not the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this message or its attachments is strictly prohibited, and you should delete this message and all copies and backups thereof. The recipient may not further distribute or use any of the information contained herein without the express written authorization of the sender. If you have received this message in error, or if you have any questions regarding the use of the proprietary information contained therein, please contact the sender of this message immediately, and the sender will provide you with further instructions.