Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...
- Date: Mon, 17 Oct 2016 14:28:37 +0100
- From: Antonio Dorta <adorta@xxxxxx>
- Subject: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...
Hi!
after executing journalctl I can see some errors like next ones:
Oct 15 14:34:46 vial kernel: NMI watchdog: BUG: soft lockup - CPU#0
stuck for 22s! [condor_starter:685]
Oct 15 14:34:46 vial kernel: Modules linked in: bnep bluetooth fuse
nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptabl
Oct 15 14:34:46 vial kernel: drm_kms_helper e1000e drm serio_raw ptp
pps_core video vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE)
Oct 15 14:34:46 vial kernel: CPU: 0 PID: 685 Comm: condor_starter
Tainted: G W OEL 4.1.5-100.fc21.x86_64 #1
Oct 15 14:34:46 vial kernel: Hardware name: ASUS All Series/Q87M-E,
BIOS 1303 10/17/2014
Oct 15 14:34:46 vial kernel: task: ffff8801e7b3b160 ti:
ffff880103d84000 task.ti: ffff880103d84000
Oct 15 14:34:46 vial kernel: RIP: 0010:[<ffffffff813b6535>]
[<ffffffff813b6535>] copy_user_enhanced_fast_string+0x5/0x10
Oct 15 14:34:46 vial kernel: RSP: 0018:ffff880103d87c00 EFLAGS: 00010286
Oct 15 14:34:46 vial kernel: RAX: 00007ffdddbed000 RBX:
ffffea00031cccc0 RCX: 0000000000000760
Oct 15 14:34:46 vial kernel: RDX: 0000000000001000 RSI:
00007ffdddbed8a0 RDI: ffff8800c73338a0
Oct 15 14:34:46 vial kernel: RBP: ffff880103d87c38 R08:
0000000000001000 R09: ffff88020f908000
Oct 15 14:34:46 vial kernel: R10: ffff880103d879b8 R11:
ffffea00031cccc0 R12: ffff8802156174e0
Oct 15 14:34:46 vial kernel: R13: ffffea00031cccc0 R14:
00000000a2bb9665 R15: ffff880103d87b78
Oct 15 14:34:46 vial kernel: FS: 00007f1d4908db80(0000)
GS:ffff88021fa00000(0000) knlGS:0000000000000000
Oct 15 14:34:46 vial kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 15 14:34:46 vial kernel: CR2: 00007f1d490b8000 CR3:
000000020640b000 CR4: 00000000001406f0
Oct 15 14:34:46 vial kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 15 14:34:46 vial kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 15 14:34:46 vial kernel: Stack:
Oct 15 14:34:46 vial kernel: ffffffff813bc0ca 0000000000010286
0000000000001000 00000000000d9000
Oct 15 14:34:46 vial kernel: ffff880103d87e68 0000000000000000
ffff880211d8f448 ffff880103d87ce8
Oct 15 14:34:46 vial kernel: ffffffff811aa1d6 ffff880103d87ca8
ffffffff8124a254 ffff880103d87c78
Oct 15 14:34:46 vial kernel: Call Trace:
Oct 15 14:34:46 vial kernel: [<ffffffff813bc0ca>] ?
iov_iter_copy_from_user_atomic+0x8a/0x210
Oct 15 14:34:46 vial kernel: [<ffffffff811aa1d6>]
generic_perform_write+0xe6/0x1e0
Oct 15 14:34:46 vial kernel: [<ffffffff8124a254>] ? mntput+0x24/0x40
Oct 15 14:34:46 vial kernel: [<ffffffff811ac738>]
__generic_file_write_iter+0x188/0x1d0
Oct 15 14:34:46 vial kernel: [<ffffffff8109e3dd>] ? get_task_mm+0x1d/0x50
Oct 15 14:34:46 vial kernel: [<ffffffff812b0ec5>]
ext4_file_write_iter+0x255/0x4c0
Oct 15 14:34:46 vial kernel: [<ffffffff81297574>] ?
proc_single_show+0x54/0xa0
Oct 15 14:34:46 vial kernel: [<ffffffff81798936>] ? mutex_lock+0x16/0x40
Oct 15 14:34:46 vial kernel: [<ffffffff8124dd7d>] ? seq_read+0xbd/0x3d0
Oct 15 14:34:46 vial kernel: [<ffffffff81269b3c>] ? fsnotify+0x3ac/0x580
Oct 15 14:34:46 vial kernel: [<ffffffff81227f21>] __vfs_write+0xd1/0x110
Oct 15 14:34:46 vial kernel: [<ffffffff812285f9>] vfs_write+0xa9/0x1b0
Oct 15 14:34:46 vial kernel: [<ffffffff81798936>] ? mutex_lock+0x16/0x40
Oct 15 14:34:46 vial kernel: [<ffffffff812294b5>] SyS_write+0x55/0xd0
Oct 15 14:34:46 vial kernel: [<ffffffff8179a8ee>]
system_call_fastpath+0x12/0x71
Oct 15 14:34:46 vial kernel: Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1
c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00
00 00 00 0f 1f 00
If I check the HTCondor StarterLog file, I can only see the next
related event with few info:
10/15/16 14:35:53 Starter pid 32644 died on signal 11 (signal 11
(Segmentation fault))
10/15/16 14:35:53 slot1: State change: starter exited
Please, do you know what the problem is and how it can be fixed?
I'm running HTCondor on Linux Fedora21 with the last stable version of
HTCondor 8.4.9 (it was updated last week, although this problem also
happened with previous versions).
Thank you very much,
--
Antonio Dorta
Servicios InformÃticos EspecÃficos (SIE)
InvestigaciÃn y EnseÃanza
Instituto de AstrofÃsica de Canarias (IAC)
C/ VÃa LÃctea, s/n. 38205 - La Laguna, Santa Cruz de Tenerife
Despacho: 1124. Tfno: 922 60 5278. email: adorta@xxxxxx
Supercomputing at IAC:
http://www.iac.es/sieinvens/SINFIN/Main/supercomputing.php
----------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de
Proteccion de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law
concerning the Protection of Data, consult
http://www.iac.es/disclaimer.php?lang=en