Hi Ben,
sorry for a long silence. The silence doesn't mean everything went
successfully but was caused by nonavailability of our systems. We had a
security incident and all our systems went offline two weeks ago. Next
week I can (hopefully) test mutateLibcuda again.
Thanks for your reply and I'll notify about my results soonish.
Best wishes,
Ilya
On 11.05.20 22:34, Benjamin Welton wrote:
>> Do I need to use a compute node?
>
> Yes you will need to use a compute node to run the tool, It executes a
> small cuda program to determine the location of the synchronization
> function in libcuda. Without a CUDA capable graphics card, this test
> program will likely exit immediately and would give the error you are
> seeing. I would try running this first on a compute node before doing
> any other debugging.
>
> I have submitted a bug report on this issue because we should print a
> warning when the tool is run on a system without a CUDA capable graphics
> card instead of failing with a random error
> (Âhttps://github.com/dyninst/tools/issues/15Â;).Â
>
>> X86 with GCC 8.3.0
>
> This should be fine in terms of there not being any known issues with
> the tool or Dyninst with GCC 8.3. However, I have CC'd Tim Haines on
> here in case there is some issue with Dyninst and GCC 8.3 that I am not
> aware of.Â
>
>> What else can go wrong here?
>
> There should be no issue. As mentioned, the kernel runtime limit was
> very unlikely to apply to your machine but i figured it was worth
> mentioning in case the machine had some really strange setup.
>
> Ben
>
>
>
>
> On Mon, May 11, 2020 at 2:52 PM Ilya Zhukov <i.zhukov@xxxxxxxxxxxxx
> <mailto:i.zhukov@xxxxxxxxxxxxx>> wrote:
>
> Hello Ben and Nisarg,
>
> thank you for your help.
>
> > This test program is rewritten by the tool (using dyninst) and
> executed. Was there a core file that was created for a program
> called hang_devsync?I do not have any core file for "hang_devsync".
>
> > In any case there are three likely causes of this test program
> crashing: 1) injecting the wrong libcuda.so into the test program.
> This can occur if a parallel file system is in use and it contains a
> libcuda that differs from the driver version in use by a compute
> node (note: despite it's name, libcuda is not part of the CUDA
> toolkit, it is part of the GPU driver package itself). Check to make
> sure the libcuda the tool is detecting and injecting into the
> program matches the libcuda version applications run on the node
> actually use (simplest way to check this is to manually run
> hang_devsync on the computer node under GDB and check using info
> shared what libcuda was dlopen'd by libcudart, this path should
> match what was displayed by the tool in it's log).
> In both cases I use the same library. My installation was on the login
> nodes where I do not have GPUs. Do I need to use a compute node?
>
> > 2) Dyninst instrumentation error. What platform (x86,PPC, etc) are
> you using this tool on?Â
> x86. I use JUWELS [1].
> > What version of Dyninst are you using?v10.1.0-41-g194dda7
> > What version of GCC/Clang is being used for compilation of Dyninst?
> GCC 8.3.0
> (cmake/make logs in attach)
>
> > 3) (unlikely given that you appear to be running on a cluster) as
> Nisarg mentioned, there is a timeout for cuda kernels that run
> longer than 5 second on machines that are using the Nvidia card as a
> display adapter. This is a problem for the test program which spin
> locks on a single kernel for a long time. You can test if this is an
> issue by directly launching hang_devsync and seeing if it exits
> (this program will never return if it is working
> correctly)."hang_devsync" exits immediately when I execute it. And
> our GPU experts
> say that there is no such thing as a kernel runtime limit on JUWELS.
> What else can go wrong here?
>
> Thanks,
> Ilya
>
> [1]
> https://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUWELS/Configuration/Configuration_node.html
>
> On 11.05.20 16:33, Benjamin Welton wrote:
> > Hello llya,Â
> >
> > As Nisarg mentioned, the likely issue here is that the test
> program that
> > is launched to determine the location of the internal synchronization
> > function (hang_devsync) did not complete (most likely it crashed).Â
> >
> > This test program is rewritten by the tool (using dyninst) and
> executed.
> > Was there a core file that was created for a program called
> hang_devsync?/Â
> >
> > In any case there are three likely causes of this test program
> crashing:
> > 1) injecting the wrong libcuda.so into the test program. This can
> occur
> > if a parallel file system is in use and it contains a libcuda that
> > differs from the driver version in use by a compute node (note:
> despite
> > it's name, libcuda is not part of the CUDA toolkit, it is part of the
> > GPU driver package itself). Check to make sure the libcuda the tool is
> > detecting and injecting into the program matches the libcuda version
> > applications run on the node actually use (simplest way to check
> this is
> > to manually run hang_devsync on the computer node under GDB and check
> > using info shared what libcuda was dlopen'd by libcudart, this path
> > should match what was displayed by the tool in it's log).
> >
> > 2) Dyninst instrumentation error. What platform (x86,PPC, etc) are you
> > using this tool on? What version of Dyninst are you using? What
> version
> > of GCC/Clang is being used for compilation of Dyninst?
> >
> > 3) (unlikely given that you appear to be running on a cluster) as
> Nisarg
> > mentioned, there is a timeout for cuda kernels that run longer than 5
> > second on machines that are using the Nvidia card as a display
> adapter.
> > This is a problem for the test program which spin locks on a single
> > kernel for a long time. You can test if this is an issue by directly
> > launching hang_devsync and seeing if it exits (this program will never
> > return if it is working correctly).
> >
> > Ben
> >
> > On Mon, May 11, 2020, 12:21 AM NISARG SHAH <nisargs@xxxxxxxxxxx
> <mailto:nisargs@xxxxxxxxxxx>
> > <mailto:nisargs@xxxxxxxxxxx <mailto:nisargs@xxxxxxxxxxx>>> wrote:
> >
> >Â Â ÂThanks Ilya!
> >
> >Â Â ÂIt looks like the instrumentation that figures out synchronization
> >Â Â Âfunction in CUDA did not run completely to the end (it takes
> around
> >Â Â Â20-30 minutes to finish).
> >
> >Â Â ÂDo you know if the segfault occurs immediately (within 4-5s) after
> >Â Â Âthe last line is printed to screen ("Inserting signal start instra
> >Â Â Âin main")? If this is so, the cause of error might be CUDA's
> kernel
> >Â Â Âruntime limit. You might need to increase or disable it
> altogether.
> >
> >
> >Â Â ÂRegards
> >Â Â ÂNisarg
> >
> >Â Â
> Â------------------------------------------------------------------------
> >Â Â Â*From:* Ilya Zhukov
> >Â Â Â*Sent:* Sunday, May 10, 2020 4:52 AM
> >Â Â Â*To:* NISARG SHAH; dyninst-api@xxxxxxxxxxx
> <mailto:dyninst-api@xxxxxxxxxxx>
> >Â Â Â<mailto:dyninst-api@xxxxxxxxxxx <mailto:dyninst-api@xxxxxxxxxxx>>
> >Â Â Â*Subject:* Re: [DynInst_API:] mutateLibcuda segfaults
> >
> >Â Â ÂHi Nisarg,
> >
> >Â Â ÂI do not have "MS_outputids.bin" directory but I have 5 *.dot
> files in
> >Â Â Âthe directory I ran the program.
> >
> >Â Â ÂCheers,
> >Â Â ÂIlya
> >
> >Â Â ÂOn 09.05.20 00:15, NISARG SHAH wrote:
> >Â Â Â> Hi Ilya,
> >Â Â Â>
> >Â Â Â> From the backtrace, it looks like the error is due to the
> program not
> >Â Â Â> being able to read from a temporary file "MS_outputids.bin"
> that is
> >Â Â Â> creates initially. Can you check if it exists in the
> directory from
> >Â Â Â> where you ran the program? Also, can you check if 5 *.dot
> files are
> >Â Â Â> present in the same directory?
> >Â Â Â>
> >Â Â Â> Thanks
> >Â Â Â> Nisarg
> >Â Â Â>
> >Â Â Â>
> ------------------------------------------------------------------------
> >Â Â Â> *From:* Dyninst-api <dyninst-api-bounces@xxxxxxxxxxx
> <mailto:dyninst-api-bounces@xxxxxxxxxxx>
> >Â Â Â<mailto:dyninst-api-bounces@xxxxxxxxxxx
> <mailto:dyninst-api-bounces@xxxxxxxxxxx>>> on behalf of Ilya
> >Â Â Â> Zhukov <i.zhukov@xxxxxxxxxxxxx
> <mailto:i.zhukov@xxxxxxxxxxxxx> <mailto:i.zhukov@xxxxxxxxxxxxx
> <mailto:i.zhukov@xxxxxxxxxxxxx>>>
> >Â Â Â> *Sent:* Wednesday, May 6, 2020 7:16 AM
> >Â Â Â> *To:* dyninst-api@xxxxxxxxxxx
> <mailto:dyninst-api@xxxxxxxxxxx> <mailto:dyninst-api@xxxxxxxxxxx
> <mailto:dyninst-api@xxxxxxxxxxx>>
> >Â Â Â<dyninst-api@xxxxxxxxxxx <mailto:dyninst-api@xxxxxxxxxxx>
> <mailto:dyninst-api@xxxxxxxxxxx <mailto:dyninst-api@xxxxxxxxxxx>>>
> >Â Â Â> *Subject:* [DynInst_API:] mutateLibcuda segfaults
> >Â Â Â> Â
> >Â Â Â> Dear dyinst developers,
> >Â Â Â>
> >Â Â Â> I'm testing your cuda_sync_analyze tool on our cluster for
> CUDA/10.1.105. <http://10.1.105.> <http://10.1.105.>
> >Â Â Â>
> >Â Â Â> I installed dyinst and cuda_sync_analyze (cmake and make
> logs in attach)
> >Â Â Â> successfully. But I get segmentation fault when I create
> fake CUDA library.
> >Â Â Â>
> >Â Â Â> Here is a backtrace
> >Â Â Â>> #0Â 0x00002b0a9658c4bc in fseek () from /usr/lib64/libc.so.6
> >Â Â Â>> #1Â 0x00002b0a93b7eb29 in
> LaunchIdentifySync::PostProcessing (this=this@entry=0x7fff1af88af0,
> allFound=...) at
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/LaunchIdentifySync.cpp:90
> >Â Â Â>> #2Â 0x00002b0a93b7c00f in
> CSA_FindSyncAddress(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >&) () at
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/FindCudaSync.cpp:34
> >Â Â Â>> #3Â 0x00000000004021fb in main () at
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/main.cpp:15
> >Â Â Â>> #4Â 0x00002b0a96537505 in __libc_start_main () from
> /usr/lib64/libc.so.6
> >Â Â Â>> #5Â 0x000000000040253e in _start () at
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/main.cpp:38
> >Â Â Â>
> >Â Â Â> Any help will be appreciated. If you need anything else let
> me know.
> >Â Â Â>
> >Â Â Â> Best wishes,
> >Â Â Â> Ilya
> >Â Â Â> --
> >Â Â Â> Ilya Zhukov
> >Â Â Â> Juelich Supercomputing Centre
> >Â Â Â> Institute for Advanced Simulation
> >Â Â Â> Forschungszentrum Juelich GmbH
> >Â Â Â> 52425 Juelich, Germany
> >Â Â Â>
> >Â Â Â> Phone: +49-2461-61-2054
> >Â Â Â> Fax: +49-2461-61-2810
> >Â Â Â> E-mail: i.zhukov@xxxxxxxxxxxxx
> <mailto:i.zhukov@xxxxxxxxxxxxx> <mailto:i.zhukov@xxxxxxxxxxxxx
> <mailto:i.zhukov@xxxxxxxxxxxxx>>
> >Â Â Â> WWW: http://www.fz-juelich.de/jsc
> >
>
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
|