Mailing List Archive

Kernel panic with 2.6.32-30 under network activity
Hello,

I've got several kernel panic on a domU under network activity (multiple
rsync using rsh). I didn't manage to reproduce it manually, but it happened
5times during the last month.
Each time, it is the same kernel trace.

I am using Debian 5.0.8 with kernel/hypervisor :

ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 for
64-bit PCs
ii xen-hypervisor-4.0-amd64 4.0.1-2 The
Xen Hypervisor on AMD64

Here is the trace :

[469390.126691] alignment check: 0000 [#1] SMP
[469390.126711] last sysfs file: /sys/devices/virtual/net/lo/operstate
[469390.126718] CPU 0
[469390.126725] Modules linked in: snd_pcsp xen_netfront snd_pcm evdev
snd_timer snd soundcore snd_page_alloc ext3 jbd mbcache dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod xen_blkfront thermal_sys
[469390.126772] Pid: 22077, comm: rsh Not tainted 2.6.32-bpo.5-amd64 #1
[469390.126779] RIP: e030:[<ffffffff8126093d>] [<ffffffff8126093d>]
eth_header+0x61/0x9c
[469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286
[469390.126802] RAX: 00000000090f0900 RBX: 0000000000000008 RCX:
ffff88001ecd0cee
[469390.126811] RDX: 0000000000000800 RSI: 000000000000000e RDI:
ffff88001ecd0cee
[469390.126820] RBP: ffff8800029016d0 R08: 0000000000000000 R09:
0000000000000034
[469390.126829] R10: 000000000000000e R11: ffffffff81255821 R12:
ffff880002935144
[469390.126838] R13: 0000000000000034 R14: ffff88001fe80000 R15:
ffff88001fe80000
[469390.126851] FS: 00007f340c2276e0(0000) GS:ffff880002f4d000(0000)
knlGS:0000000000000000
[469390.126860] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[469390.126867] CR2: 00007fffb8f33a8c CR3: 000000001d875000 CR4:
0000000000002660
[469390.126877] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[469390.126886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[469390.126895] Process rsh (pid: 22077, threadinfo ffff88001ec3e000, task
ffff88001ea61530)
[469390.126904] Stack:
[469390.126908] 0000000000000000 0000000000000000 ffff88001ecd0cfc
ffff88001f1a4ae8
[469390.126921] <0> ffff880002935100 ffff880002935140 0000000000000000
ffffffff81255a20
[469390.126937] <0> 0000000000000000 ffffffff8127743d 0000000000000000
ffff88001ecd0cfc
[469390.126954] Call Trace:
[469390.126963] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284
[469390.126974] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b
[469390.126983] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386
[469390.126994] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
[469390.127003] [<ffffffff8100e242>] ? check_events+0x12/0x20
[469390.127013] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687
[469390.127022] [<ffffffff8100e242>] ? check_events+0x12/0x20
[469390.127031] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.127040] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c
[469390.127049] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53
[469390.127059] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0
[469390.127069] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54
[469390.127079] [<ffffffff812410d1>] ? sock_release+0x19/0x66
[469390.127087] [<ffffffff81241140>] ? sock_close+0x22/0x26
[469390.127097] [<ffffffff810ef879>] ? __fput+0x100/0x1af
[469390.127106] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62
[469390.127116] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1
[469390.127127] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22
[469390.127135] [<ffffffff81051141>] ? do_exit+0x236/0x6c6
[469390.127144] [<ffffffff8100c241>] ?
__raw_callee_save_xen_pud_val+0x11/0x1e
[469390.127154] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.127163] [<ffffffff8100c205>] ?
__raw_callee_save_xen_pmd_val+0x11/0x1e
[469390.127173] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d
[469390.127183] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343
[469390.127193] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f
[469390.127202] [<ffffffff812fbf45>] ? page_fault+0x25/0x30
[469390.127211] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60
[469390.127219] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
[469390.127228] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.127240] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30
[469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17
[469390.128009] Code: 89 e8 86 e0 66 89 47 0c 48 85 ed 75 07 49 8b ae 20 02
00 00 8b 45 00 4d 85 e4 89 47 06 66 8b 45 04 66 89 47 0a 74 12 41 8b 04 24
<89> 07 66 41 8b 44 24 04 66 89 47 04 eb 18 41 f6 86 60 01 00 00
[469390.128009] RIP [<ffffffff8126093d>] eth_header+0x61/0x9c
[469390.128009] RSP <ffff88001ec3f9b8>
[469390.128009] ---[ end trace dd6b1396ef9d9a96 ]---
[469390.128009] Kernel panic - not syncing: Fatal exception in interrupt
[469390.128009] Pid: 22077, comm: rsh Tainted: G D
2.6.32-bpo.5-amd64 #1
[469390.128009] Call Trace:
[469390.128009] [<ffffffff812f9d03>] ? panic+0x86/0x143
[469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe
[469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe
[469390.128009] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af
[469390.128009] [<ffffffff812fca65>] ? oops_end+0xa7/0xb4
[469390.128009] [<ffffffff81012416>] ? do_alignment_check+0x88/0x92
[469390.128009] [<ffffffff81011a75>] ? alignment_check+0x25/0x30
[469390.128009] [<ffffffff81255821>] ? neigh_resolve_output+0x0/0x284
[469390.128009] [<ffffffff8126093d>] ? eth_header+0x61/0x9c
[469390.128009] [<ffffffff81260900>] ? eth_header+0x24/0x9c
[469390.128009] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284
[469390.128009] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b
[469390.128009] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386
[469390.128009] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
[469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20
[469390.128009] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687
[469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20
[469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.128009] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c
[469390.128009] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53
[469390.128009] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0
[469390.128009] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54
[469390.128009] [<ffffffff812410d1>] ? sock_release+0x19/0x66
[469390.128009] [<ffffffff81241140>] ? sock_close+0x22/0x26
[469390.128009] [<ffffffff810ef879>] ? __fput+0x100/0x1af
[469390.128009] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62
[469390.128009] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1
[469390.128009] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22
[469390.128009] [<ffffffff81051141>] ? do_exit+0x236/0x6c6
[469390.128009] [<ffffffff8100c241>] ?
__raw_callee_save_xen_pud_val+0x11/0x1e
[469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.128009] [<ffffffff8100c205>] ?
__raw_callee_save_xen_pmd_val+0x11/0x1e
[469390.128009] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d
[469390.128009] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343
[469390.128009] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f
[469390.128009] [<ffffffff812fbf45>] ? page_fault+0x25/0x30
[469390.128009] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60
[469390.128009] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
[469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[469390.128009] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30
[469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17

I found another post, which may be the same bug (same kernel, network
activity ... ) :

http://jira.mongodb.org/browse/SERVER-2383

Any ideas ?

Regards

Olivier
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:
> Hello,
>
> I've got several kernel panic on a domU under network activity (multiple
> rsync using rsh). I didn't manage to reproduce it manually, but it happened
> 5times during the last month.

Does it happend all the time?
> Each time, it is the same kernel trace.
>
> I am using Debian 5.0.8 with kernel/hypervisor :
>
> ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 for
> 64-bit PCs
> ii xen-hypervisor-4.0-amd64 4.0.1-2 The
> Xen Hypervisor on AMD64
>
> Here is the trace :
>
> [469390.126691] alignment check: 0000 [#1] SMP

aligment check? Was there anything else in the log before this? Was there
anything in the Dom0 log?

> [469390.126711] last sysfs file: /sys/devices/virtual/net/lo/operstate
> [469390.126718] CPU 0
> [469390.126725] Modules linked in: snd_pcsp xen_netfront snd_pcm evdev
> snd_timer snd soundcore snd_page_alloc ext3 jbd mbcache dm_mirror
> dm_region_hash dm_log dm_snapshot dm_mod xen_blkfront thermal_sys
> [469390.126772] Pid: 22077, comm: rsh Not tainted 2.6.32-bpo.5-amd64 #1
> [469390.126779] RIP: e030:[<ffffffff8126093d>] [<ffffffff8126093d>]
> eth_header+0x61/0x9c
> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286
> [469390.126802] RAX: 00000000090f0900 RBX: 0000000000000008 RCX:
> ffff88001ecd0cee
> [469390.126811] RDX: 0000000000000800 RSI: 000000000000000e RDI:
> ffff88001ecd0cee
> [469390.126820] RBP: ffff8800029016d0 R08: 0000000000000000 R09:
> 0000000000000034
> [469390.126829] R10: 000000000000000e R11: ffffffff81255821 R12:
> ffff880002935144
> [469390.126838] R13: 0000000000000034 R14: ffff88001fe80000 R15:
> ffff88001fe80000
> [469390.126851] FS: 00007f340c2276e0(0000) GS:ffff880002f4d000(0000)
> knlGS:0000000000000000
> [469390.126860] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [469390.126867] CR2: 00007fffb8f33a8c CR3: 000000001d875000 CR4:
> 0000000000002660
> [469390.126877] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [469390.126886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [469390.126895] Process rsh (pid: 22077, threadinfo ffff88001ec3e000, task
> ffff88001ea61530)
> [469390.126904] Stack:
> [469390.126908] 0000000000000000 0000000000000000 ffff88001ecd0cfc
> ffff88001f1a4ae8
> [469390.126921] <0> ffff880002935100 ffff880002935140 0000000000000000
> ffffffff81255a20
> [469390.126937] <0> 0000000000000000 ffffffff8127743d 0000000000000000
> ffff88001ecd0cfc
> [469390.126954] Call Trace:
> [469390.126963] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284
> [469390.126974] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b
> [469390.126983] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386
> [469390.126994] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
> [469390.127003] [<ffffffff8100e242>] ? check_events+0x12/0x20
> [469390.127013] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687
> [469390.127022] [<ffffffff8100e242>] ? check_events+0x12/0x20
> [469390.127031] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.127040] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c
> [469390.127049] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53
> [469390.127059] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0
> [469390.127069] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54
> [469390.127079] [<ffffffff812410d1>] ? sock_release+0x19/0x66
> [469390.127087] [<ffffffff81241140>] ? sock_close+0x22/0x26
> [469390.127097] [<ffffffff810ef879>] ? __fput+0x100/0x1af
> [469390.127106] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62
> [469390.127116] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1
> [469390.127127] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22
> [469390.127135] [<ffffffff81051141>] ? do_exit+0x236/0x6c6
> [469390.127144] [<ffffffff8100c241>] ?
> __raw_callee_save_xen_pud_val+0x11/0x1e
> [469390.127154] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.127163] [<ffffffff8100c205>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
> [469390.127173] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d
> [469390.127183] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343
> [469390.127193] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f
> [469390.127202] [<ffffffff812fbf45>] ? page_fault+0x25/0x30
> [469390.127211] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60
> [469390.127219] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
> [469390.127228] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.127240] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30
> [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17
> [469390.128009] Code: 89 e8 86 e0 66 89 47 0c 48 85 ed 75 07 49 8b ae 20 02
> 00 00 8b 45 00 4d 85 e4 89 47 06 66 8b 45 04 66 89 47 0a 74 12 41 8b 04 24
> <89> 07 66 41 8b 44 24 04 66 89 47 04 eb 18 41 f6 86 60 01 00 00
> [469390.128009] RIP [<ffffffff8126093d>] eth_header+0x61/0x9c
> [469390.128009] RSP <ffff88001ec3f9b8>
> [469390.128009] ---[ end trace dd6b1396ef9d9a96 ]---
> [469390.128009] Kernel panic - not syncing: Fatal exception in interrupt
> [469390.128009] Pid: 22077, comm: rsh Tainted: G D
> 2.6.32-bpo.5-amd64 #1
> [469390.128009] Call Trace:
> [469390.128009] [<ffffffff812f9d03>] ? panic+0x86/0x143
> [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe
> [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe
> [469390.128009] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af
> [469390.128009] [<ffffffff812fca65>] ? oops_end+0xa7/0xb4
> [469390.128009] [<ffffffff81012416>] ? do_alignment_check+0x88/0x92
> [469390.128009] [<ffffffff81011a75>] ? alignment_check+0x25/0x30
> [469390.128009] [<ffffffff81255821>] ? neigh_resolve_output+0x0/0x284
> [469390.128009] [<ffffffff8126093d>] ? eth_header+0x61/0x9c
> [469390.128009] [<ffffffff81260900>] ? eth_header+0x24/0x9c
> [469390.128009] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284
> [469390.128009] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b
> [469390.128009] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386
> [469390.128009] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
> [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20
> [469390.128009] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687
> [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20
> [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.128009] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c
> [469390.128009] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53
> [469390.128009] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0
> [469390.128009] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54
> [469390.128009] [<ffffffff812410d1>] ? sock_release+0x19/0x66
> [469390.128009] [<ffffffff81241140>] ? sock_close+0x22/0x26
> [469390.128009] [<ffffffff810ef879>] ? __fput+0x100/0x1af
> [469390.128009] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62
> [469390.128009] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1
> [469390.128009] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22
> [469390.128009] [<ffffffff81051141>] ? do_exit+0x236/0x6c6
> [469390.128009] [<ffffffff8100c241>] ?
> __raw_callee_save_xen_pud_val+0x11/0x1e
> [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.128009] [<ffffffff8100c205>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
> [469390.128009] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d
> [469390.128009] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343
> [469390.128009] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f
> [469390.128009] [<ffffffff812fbf45>] ? page_fault+0x25/0x30
> [469390.128009] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60
> [469390.128009] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
> [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> [469390.128009] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30
> [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17
>
> I found another post, which may be the same bug (same kernel, network
> activity ... ) :
>
> http://jira.mongodb.org/browse/SERVER-2383
>
> Any ideas ?

None.. What type of CPU do you have? Are you pinning your
guest to a specific CPU?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
>>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:
>> [469390.126691] alignment check: 0000 [#1] SMP
>
> aligment check? Was there anything else in the log before this? Was there
> anything in the Dom0 log?

This together with

>> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286

makes me wonder if either eflags got restored from a corrupted
stack slot somewhere, or whether something in the kernel or one
of the modules intentionally played with EFLAGS.AC.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
Hello,

Yes, this bug happens quite often.

About my CPU, I am using :

model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz

There is no log at all before this message on the domU. I got this message
from xen console.

This guest isn't pinned to a specific cpu :

Name ID VCPU CPU State Time(s) CPU Affinity
Domain-0 0 0 0 r-- 18098.1 0
domU 15 0 1 -b- 3060.8 any cpu
domU 15 1 4 -b- 1693.4 any cpu

My dom0 is pinned :

release : 2.6.32-bpo.5-xen-amd64
version : #1 SMP Mon Jan 17 22:05:11 UTC 2011
machine : x86_64
nr_cpus : 8
nr_nodes : 1
cores_per_socket : 4
threads_per_core : 1
cpu_mhz : 2493
hw_caps :
bfebfbff:20000800:00000000:00000940:000ce3bd:00000000:00000001:00000000
virt_caps : hvm
total_memory : 10239
free_memory : 405
node_to_cpu : node0:0-7
node_to_memory : node0:405
node_to_dma32_mem : node0:405
max_node_id : 0
xen_major : 4
xen_minor : 0
xen_extra : .1
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : unavailable
xen_commandline : dom0_mem=512M loglvl=all guest_loglvl=all
dom0_max_vcpus=1 dom0_vcpus_pin console=vga,com1 com1=19200,8n1
clocksource=pit cpuidle=0
cc_compiler : gcc version 4.4.5 (Debian 4.4.5-10)
cc_compile_by : waldi
cc_compile_domain : debian.org
cc_compile_date : Wed Jan 12 14:04:06 UTC 2011
xend_config_format : 4

I was running top/vmstat before this crash, I saw nothing strange (kernel
not swapping, no load, not a lot of IOs ... just a network rsync).

About log in Dom0, in "xm dmesg"

(XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935
(XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935
(XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935
(XEN) grant_table.c:204:d0 Increased maptrack size to 2 frames.
(XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935
(XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935

I don't know if this is relevant or not. I will check at the next kernel
panic, if another line is appended.

Hope this helps.

Olivier


2011/3/16 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

> On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:
> > Hello,
> >
> > I've got several kernel panic on a domU under network activity (multiple
> > rsync using rsh). I didn't manage to reproduce it manually, but it
> happened
> > 5times during the last month.
>
> Does it happend all the time?
> > Each time, it is the same kernel trace.
> >
> > I am using Debian 5.0.8 with kernel/hypervisor :
> >
> > ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32
> for
> > 64-bit PCs
> > ii xen-hypervisor-4.0-amd64 4.0.1-2
> The
> > Xen Hypervisor on AMD64
> >
> > Here is the trace :
> >
> > [469390.126691] alignment check: 0000 [#1] SMP
>
> aligment check? Was there anything else in the log before this? Was there
> anything in the Dom0 log?
>
> > [469390.126711] last sysfs file: /sys/devices/virtual/net/lo/operstate
> > [469390.126718] CPU 0
> > [469390.126725] Modules linked in: snd_pcsp xen_netfront snd_pcm evdev
> > snd_timer snd soundcore snd_page_alloc ext3 jbd mbcache dm_mirror
> > dm_region_hash dm_log dm_snapshot dm_mod xen_blkfront thermal_sys
> > [469390.126772] Pid: 22077, comm: rsh Not tainted 2.6.32-bpo.5-amd64 #1
> > [469390.126779] RIP: e030:[<ffffffff8126093d>] [<ffffffff8126093d>]
> > eth_header+0x61/0x9c
> > [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286
> > [469390.126802] RAX: 00000000090f0900 RBX: 0000000000000008 RCX:
> > ffff88001ecd0cee
> > [469390.126811] RDX: 0000000000000800 RSI: 000000000000000e RDI:
> > ffff88001ecd0cee
> > [469390.126820] RBP: ffff8800029016d0 R08: 0000000000000000 R09:
> > 0000000000000034
> > [469390.126829] R10: 000000000000000e R11: ffffffff81255821 R12:
> > ffff880002935144
> > [469390.126838] R13: 0000000000000034 R14: ffff88001fe80000 R15:
> > ffff88001fe80000
> > [469390.126851] FS: 00007f340c2276e0(0000) GS:ffff880002f4d000(0000)
> > knlGS:0000000000000000
> > [469390.126860] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [469390.126867] CR2: 00007fffb8f33a8c CR3: 000000001d875000 CR4:
> > 0000000000002660
> > [469390.126877] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [469390.126886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > 0000000000000400
> > [469390.126895] Process rsh (pid: 22077, threadinfo ffff88001ec3e000,
> task
> > ffff88001ea61530)
> > [469390.126904] Stack:
> > [469390.126908] 0000000000000000 0000000000000000 ffff88001ecd0cfc
> > ffff88001f1a4ae8
> > [469390.126921] <0> ffff880002935100 ffff880002935140 0000000000000000
> > ffffffff81255a20
> > [469390.126937] <0> 0000000000000000 ffffffff8127743d 0000000000000000
> > ffff88001ecd0cfc
> > [469390.126954] Call Trace:
> > [469390.126963] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284
> > [469390.126974] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b
> > [469390.126983] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386
> > [469390.126994] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
> > [469390.127003] [<ffffffff8100e242>] ? check_events+0x12/0x20
> > [469390.127013] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687
> > [469390.127022] [<ffffffff8100e242>] ? check_events+0x12/0x20
> > [469390.127031] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.127040] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c
> > [469390.127049] [<ffffffff8128a00e>] ?
> __tcp_push_pending_frames+0x22/0x53
> > [469390.127059] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0
> > [469390.127069] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54
> > [469390.127079] [<ffffffff812410d1>] ? sock_release+0x19/0x66
> > [469390.127087] [<ffffffff81241140>] ? sock_close+0x22/0x26
> > [469390.127097] [<ffffffff810ef879>] ? __fput+0x100/0x1af
> > [469390.127106] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62
> > [469390.127116] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1
> > [469390.127127] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22
> > [469390.127135] [<ffffffff81051141>] ? do_exit+0x236/0x6c6
> > [469390.127144] [<ffffffff8100c241>] ?
> > __raw_callee_save_xen_pud_val+0x11/0x1e
> > [469390.127154] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.127163] [<ffffffff8100c205>] ?
> > __raw_callee_save_xen_pmd_val+0x11/0x1e
> > [469390.127173] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d
> > [469390.127183] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343
> > [469390.127193] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f
> > [469390.127202] [<ffffffff812fbf45>] ? page_fault+0x25/0x30
> > [469390.127211] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60
> > [469390.127219] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
> > [469390.127228] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.127240] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30
> > [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17
> > [469390.128009] Code: 89 e8 86 e0 66 89 47 0c 48 85 ed 75 07 49 8b ae 20
> 02
> > 00 00 8b 45 00 4d 85 e4 89 47 06 66 8b 45 04 66 89 47 0a 74 12 41 8b 04
> 24
> > <89> 07 66 41 8b 44 24 04 66 89 47 04 eb 18 41 f6 86 60 01 00 00
> > [469390.128009] RIP [<ffffffff8126093d>] eth_header+0x61/0x9c
> > [469390.128009] RSP <ffff88001ec3f9b8>
> > [469390.128009] ---[ end trace dd6b1396ef9d9a96 ]---
> > [469390.128009] Kernel panic - not syncing: Fatal exception in interrupt
> > [469390.128009] Pid: 22077, comm: rsh Tainted: G D
> > 2.6.32-bpo.5-amd64 #1
> > [469390.128009] Call Trace:
> > [469390.128009] [<ffffffff812f9d03>] ? panic+0x86/0x143
> > [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe
> > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe
> > [469390.128009] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af
> > [469390.128009] [<ffffffff812fca65>] ? oops_end+0xa7/0xb4
> > [469390.128009] [<ffffffff81012416>] ? do_alignment_check+0x88/0x92
> > [469390.128009] [<ffffffff81011a75>] ? alignment_check+0x25/0x30
> > [469390.128009] [<ffffffff81255821>] ? neigh_resolve_output+0x0/0x284
> > [469390.128009] [<ffffffff8126093d>] ? eth_header+0x61/0x9c
> > [469390.128009] [<ffffffff81260900>] ? eth_header+0x24/0x9c
> > [469390.128009] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284
> > [469390.128009] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b
> > [469390.128009] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386
> > [469390.128009] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
> > [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20
> > [469390.128009] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687
> > [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20
> > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.128009] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c
> > [469390.128009] [<ffffffff8128a00e>] ?
> __tcp_push_pending_frames+0x22/0x53
> > [469390.128009] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0
> > [469390.128009] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54
> > [469390.128009] [<ffffffff812410d1>] ? sock_release+0x19/0x66
> > [469390.128009] [<ffffffff81241140>] ? sock_close+0x22/0x26
> > [469390.128009] [<ffffffff810ef879>] ? __fput+0x100/0x1af
> > [469390.128009] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62
> > [469390.128009] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1
> > [469390.128009] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22
> > [469390.128009] [<ffffffff81051141>] ? do_exit+0x236/0x6c6
> > [469390.128009] [<ffffffff8100c241>] ?
> > __raw_callee_save_xen_pud_val+0x11/0x1e
> > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.128009] [<ffffffff8100c205>] ?
> > __raw_callee_save_xen_pmd_val+0x11/0x1e
> > [469390.128009] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d
> > [469390.128009] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343
> > [469390.128009] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f
> > [469390.128009] [<ffffffff812fbf45>] ? page_fault+0x25/0x30
> > [469390.128009] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60
> > [469390.128009] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
> > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
> > [469390.128009] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30
> > [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17
> >
> > I found another post, which may be the same bug (same kernel, network
> > activity ... ) :
> >
> > http://jira.mongodb.org/browse/SERVER-2383
> >
> > Any ideas ?
>
> None.. What type of CPU do you have? Are you pinning your
> guest to a specific CPU?
>
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
On Wed, 2011-03-16 at 09:34 +0000, Jan Beulich wrote:
> >>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:
> >> [469390.126691] alignment check: 0000 [#1] SMP
> >
> > aligment check? Was there anything else in the log before this? Was there
> > anything in the Dom0 log?
>
> This together with
>
> >> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286
>
> makes me wonder if either eflags got restored from a corrupted
> stack slot somewhere, or whether something in the kernel or one
> of the modules intentionally played with EFLAGS.AC.

Can a PV kernel running in ring-3 change AC?

The Intel manual says "They should not be modified by application
programs" over a list including AC but the list also includes e.g. IOPL
and IF so I suspect it meant "can not" rather than "should not"? In
which case it can't happen by accident.

The hypervisor appears to clear the guest's EFLAGS.AC on context switch
to a guest and failsafe bounce but not in e.g. do_iret so it's not
entirely clear what his policy is...

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
>>> On 16.03.11 at 11:11, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2011-03-16 at 09:34 +0000, Jan Beulich wrote:
>> >>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>> > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:
>> >> [469390.126691] alignment check: 0000 [#1] SMP
>> >
>> > aligment check? Was there anything else in the log before this? Was there
>> > anything in the Dom0 log?
>>
>> This together with
>>
>> >> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286
>>
>> makes me wonder if either eflags got restored from a corrupted
>> stack slot somewhere, or whether something in the kernel or one
>> of the modules intentionally played with EFLAGS.AC.
>
> Can a PV kernel running in ring-3 change AC?

Yes. We had this problem until we cleared the flag in
create_bounce_frame().

> The Intel manual says "They should not be modified by application
> programs" over a list including AC but the list also includes e.g. IOPL
> and IF so I suspect it meant "can not" rather than "should not"? In
> which case it can't happen by accident.

No, afaik "should not" is the correct term.

> The hypervisor appears to clear the guest's EFLAGS.AC on context switch
> to a guest and failsafe bounce but not in e.g. do_iret so it's not
> entirely clear what his policy is...

do_iret() isn't increasing privilege, and hence restoring whatever
the outer context of iret had in place is correct. The important
thing is that on the transition to kernel mode the flag must always
get cleared (which I think has been the case since the problem
in create_bounce_frame() was fixed).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
It happens again a few minutes ago. It is the same kernel stack each time
(alignment check: 0000 [#1] SMP etc ...)

The dom0 where all the faulty domU are running is a dual Xeon 5420 so 8 real
cores available.
20 domUs are running on it, 35 vcpus are set up, is that too much ? The bug
happens randomly on domUs
I was running the same config with xen3.2 without any issue.

I found this old post :
http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html

It may be related, no issue with 2.6.24, and issue with 2.6.32.


2011/3/16 Jan Beulich <JBeulich@novell.com>

> >>> On 16.03.11 at 11:11, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2011-03-16 at 09:34 +0000, Jan Beulich wrote:
> >> >>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> wrote:
> >> > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:
> >> >> [469390.126691] alignment check: 0000 [#1] SMP
> >> >
> >> > aligment check? Was there anything else in the log before this? Was
> there
> >> > anything in the Dom0 log?
> >>
> >> This together with
> >>
> >> >> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286
> >>
> >> makes me wonder if either eflags got restored from a corrupted
> >> stack slot somewhere, or whether something in the kernel or one
> >> of the modules intentionally played with EFLAGS.AC.
> >
> > Can a PV kernel running in ring-3 change AC?
>
> Yes. We had this problem until we cleared the flag in
> create_bounce_frame().
>
> > The Intel manual says "They should not be modified by application
> > programs" over a list including AC but the list also includes e.g. IOPL
> > and IF so I suspect it meant "can not" rather than "should not"? In
> > which case it can't happen by accident.
>
> No, afaik "should not" is the correct term.
>
> > The hypervisor appears to clear the guest's EFLAGS.AC on context switch
> > to a guest and failsafe bounce but not in e.g. do_iret so it's not
> > entirely clear what his policy is...
>
> do_iret() isn't increasing privilege, and hence restoring whatever
> the outer context of iret had in place is correct. The important
> thing is that on the transition to kernel mode the flag must always
> get cleared (which I think has been the case since the problem
> in create_bounce_frame() was fixed).
>
> Jan
>
>
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
>>> On 17.03.11 at 11:34, Olivier Hanesse <olivier.hanesse@gmail.com> wrote:
> It happens again a few minutes ago. It is the same kernel stack each time
> (alignment check: 0000 [#1] SMP etc ...)
>
> The dom0 where all the faulty domU are running is a dual Xeon 5420 so 8 real
> cores available.
> 20 domUs are running on it, 35 vcpus are set up, is that too much ? The bug
> happens randomly on domUs
> I was running the same config with xen3.2 without any issue.

Are we to read this as "same kernels in DomU-s and Dom0"? If so,
that would hint at some subtle Xen regression. If not, you'd need
to be more precise as to what works and what doesn't, and would
possibly want to try intermediate versions to narrow when this
got introduced.

> I found this old post :
> http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html
>
> It may be related, no issue with 2.6.24, and issue with 2.6.32.

Yes, that indeed looks very similar. Nevertheless, without this
being generally reproducible, we'll have to rely on you doing
some analysis/debugging work on this.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: Kernel panic with 2.6.32-30 under network activity [ In reply to ]
2011/3/17 Jan Beulich <JBeulich@novell.com>

> >>> On 17.03.11 at 11:34, Olivier Hanesse <olivier.hanesse@gmail.com>
> wrote:
> > It happens again a few minutes ago. It is the same kernel stack each time
> > (alignment check: 0000 [#1] SMP etc ...)
> >
> > The dom0 where all the faulty domU are running is a dual Xeon 5420 so 8
> real
> > cores available.
> > 20 domUs are running on it, 35 vcpus are set up, is that too much ? The
> bug
> > happens randomly on domUs
> > I was running the same config with xen3.2 without any issue.
>
> Are we to read this as "same kernels in DomU-s and Dom0"? If so,
> that would hint at some subtle Xen regression. If not, you'd need
> to be more precise as to what works and what doesn't, and would
> possibly want to try intermediate versions to narrow when this
> got introduced.
>
>
Dom0 et DomU are using different kernels (both coming from Debian
repository, but same version)

domU :
ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1
Linux 2.6.32 for 64-bit PCs

dom0:
ii linux-image-2.6.32-bpo.5-xen-amd64 2.6.32-30~bpo50+1
Linux 2.6.32 for 64-bit PCs, Xen dom0 suppor

I was running Debian Lenny's version for xen 3.2, so it was 2.6.26


> > I found this old post :
> > http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html
> >
> > It may be related, no issue with 2.6.24, and issue with 2.6.32.
>
> Yes, that indeed looks very similar. Nevertheless, without this
> being generally reproducible, we'll have to rely on you doing
> some analysis/debugging work on this.
>

I "pinned" all domUs cpus in order that they don't use the same cpu as dom0
(which is pinned to cpu0).
I can run any analysis/debugging tools you want.
I will also try an older kernel (for example 2.6.32-10) and see what
happens.



>
> Jan
>
>
Regards

Olivier