Mailing List Archive

blktap bug: kernel oops when tapdisk process is terminated
Hello,

I've been trying to get blktap to work on kernel version 3.17.3, running debian wheezy, and came across what I think is a bug in the blktap module.
If a tapdisk process is killed while the block device backed by it is being accessed I get a kernel oops. This usually happens in 2-3 retries.

My setup:

Kernel version: 3.17.3
blktap-utils version: 2.0.90-1
xcp-xapi version: 1:1.3.2-15ubuntu4.18

The blktap kernel driver found on xapi-project github page (https://github.com/xapi-project/blktap-dkms) failed to compile,
so I used debian's 2.0.93 version with modifications that allowed the module to compile. The diff against 2.0.93 is attached (blktap.patch).

All testing is performed outside of Xen, ie. all commands are run in Dom0. What I do is set up a VHD-backed block device
with rate limiting using td-rated, then run dd in background, killing tapdisk while dd is running.

Preparation:
$ modprobe blktap
$ mkdir /var/run/blktap
$ vhd-util create -s 1024 -n /root/test.vhd
$ td-rated /var/run/blktap/x.sock -t token -- --rate=5M
$ cat > /var/tmp/limit.chain << EOF
valve:/var/run/blktap/x.sock
vhd:/root/test.vhd
EOF

These commands trigger the oops after a few cycles:

$ tap-ctl create -a x-chain:/var/tmp/limit.chain
/dev/xen/blktap-2/tapdevX
$ tap-ctl list # retrieve tapdisk PID
$ dd if=/dev/urandom of=/dev/xen/blktap-2/tapdevX bs=1M count=200 &
$ kill -9 tapdisk_PID # check dmesg, try again if bug didn't occur

Relevant dmesg part is pasted below, blktap-dkms patch attached.

I'll gladly provide more information if need be. Also, if xcp-xapi is not a good place for this report, kindly point me to a more suitable list.

Thanks,
Krzysztof Godlewski

[Tue Jan 13 15:36:10 2015] block tda: sector-size: 512/512+0 capacity: 2097152 discard: 0+0 flush: 0x0
[Tue Jan 13 15:36:31 2015] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[Tue Jan 13 15:36:31 2015] IP: [<ffffffff81784593>] down_write+0x23/0x50
[Tue Jan 13 15:36:31 2015] PGD 172bf067 PUD 1728b067 PMD 0
[Tue Jan 13 15:36:31 2015] Oops: 0002 [#1] SMP
[Tue Jan 13 15:36:31 2015] Modules linked in: blktap(OE) xen_blkback(E) xen_netback(E) openvswitch(E) gre(E) vxlan(E) udp_tunnel(E) libcrc32c(E) xen_gntdev(E) xen_evtchn(E) xenfs(E) xen_privcmd(E) iscsi_tcp(E) libiscsi_tcp(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) fscache(E) sunrpc(E) ib_umad(E) ib_iser(E) rdma_cm(E) iw_cm(E) libiscsi(E) scsi_transport_iscsi(E) ib_ipoib(E) ib_cm(E) ib_sa(E) snd_pcm(E) snd_timer(E) snd(E) soundcore(E) iTCO_wdt(E) i7core_edac(E) ipmi_si(E) ib_mthca(E) ast(E) ttm(E) ib_mad(E) ipmi_msghandler(E) drm_kms_helper(E) ib_core(E) drm(E) edac_core(E) ioatdma(E) ib_addr(E) iTCO_vendor_support(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) pcspkr(E) evbug(E) lpc_ich(E) dcdbas(E) joydev(E) i2c_i801(E) mac_hid(E) hid_generic(E) usbkbd(E) usbmouse(E) usbhid(E) hid(E) ahci(E) libahci(E) igb(E) i2c_algo_bit(E) dca(E) ptp(E) pps_core(E)
[Tue Jan 13 15:36:31 2015] CPU: 4 PID: 9800 Comm: tapdisk Tainted: G OE 3.17.3 #1
[Tue Jan 13 15:36:31 2015] Hardware name: Dell XS23-TY3 / , BIOS 1.71 09/17/2013
[Tue Jan 13 15:36:31 2015] task: ffff8802e440da00 ti: ffff8800170ac000 task.ti: ffff8800170ac000
[Tue Jan 13 15:36:31 2015] RIP: e030:[<ffffffff81784593>] [<ffffffff81784593>] down_write+0x23/0x50
[Tue Jan 13 15:36:31 2015] RSP: e02b:ffff8800170afaf8 EFLAGS: 00010246
[Tue Jan 13 15:36:31 2015] RAX: 0000000000000060 RBX: 0000000000000060 RCX: 0000000000000001
[Tue Jan 13 15:36:31 2015] RDX: ffffffff00000001 RSI: 000000000000b000 RDI: 0000000000000060
[Tue Jan 13 15:36:31 2015] RBP: ffff8800170afb08 R08: ffff8800170bfcc0 R09: 00000001802c001c
[Tue Jan 13 15:36:31 2015] R10: ffffea00005c2f80 R11: 0000000000000001 R12: 0000000000000000
[Tue Jan 13 15:36:31 2015] R13: 00007f1882ef2000 R14: 000000000000b000 R15: ffff88001731f400
[Tue Jan 13 15:36:31 2015] FS: 00007f1883065740(0000) GS:ffff880032c80000(0000) knlGS:0000000000000000
[Tue Jan 13 15:36:31 2015] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[Tue Jan 13 15:36:31 2015] CR2: 0000000000000060 CR3: 0000000012522000 CR4: 0000000000002660
[Tue Jan 13 15:36:31 2015] Stack:
[Tue Jan 13 15:36:31 2015] ffff8800170afba8 0000000000000060 ffff8800170afb38 ffffffff811a27a0
[Tue Jan 13 15:36:31 2015] ffff8802be6052c0 ffff88001731f400 ffff88001731f400 00000000fffffffb
[Tue Jan 13 15:36:31 2015] ffff8800170afb58 ffffffffc062ff88 ffff8802e2cc9cc0 ffff8802be6052c0
[Tue Jan 13 15:36:31 2015] Call Trace:
[Tue Jan 13 15:36:31 2015] [<ffffffff811a27a0>] vm_munmap+0x40/0x70
[Tue Jan 13 15:36:31 2015] [<ffffffffc062ff88>] blktap_ring_unmap_request+0x48/0x90 [blktap]
[Tue Jan 13 15:36:31 2015] [<ffffffffc06306bf>] blktap_device_end_request+0x2f/0xe0 [blktap]
[Tue Jan 13 15:36:31 2015] [<ffffffffc062fe05>] blktap_ring_vm_close+0xb5/0x140 [blktap]
[Tue Jan 13 15:36:31 2015] [<ffffffff811a05b2>] remove_vma+0x32/0x70
[Tue Jan 13 15:36:31 2015] [<ffffffff811a38f4>] exit_mmap+0xf4/0x170
[Tue Jan 13 15:36:31 2015] [<ffffffff8122c29a>] ? exit_aio+0xca/0xe0
[Tue Jan 13 15:36:31 2015] [<ffffffff8106fa48>] mmput+0x68/0x120
[Tue Jan 13 15:36:31 2015] [<ffffffff81074f0c>] do_exit+0x27c/0xa70
[Tue Jan 13 15:36:31 2015] [<ffffffff8107578f>] do_group_exit+0x3f/0xa0
[Tue Jan 13 15:36:31 2015] [<ffffffff81081682>] get_signal+0x1d2/0x720
[Tue Jan 13 15:36:31 2015] [<ffffffff810134c3>] do_signal+0x33/0xac0
[Tue Jan 13 15:36:31 2015] [<ffffffff81662658>] ? SYSC_sendto+0x128/0x180
[Tue Jan 13 15:36:31 2015] [<ffffffff810df5ce>] ? ktime_get_ts64+0x4e/0xf0
[Tue Jan 13 15:36:31 2015] [<ffffffff811f60fc>] ? poll_select_copy_remaining+0xec/0x140
[Tue Jan 13 15:36:31 2015] [<ffffffff81013fc1>] do_notify_resume+0x71/0xc0
[Tue Jan 13 15:36:31 2015] [<ffffffff8178686a>] int_signal+0x12/0x17
[Tue Jan 13 15:36:31 2015] Code: 00 00 00 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 e8 8a dd ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 af 9b c1 ff 65 48 8b 04 25 00 c8
[Tue Jan 13 15:36:31 2015] RIP [<ffffffff81784593>] down_write+0x23/0x50
[Tue Jan 13 15:36:31 2015] RSP <ffff8800170afaf8>
[Tue Jan 13 15:36:31 2015] CR2: 0000000000000060
[Tue Jan 13 15:36:31 2015] ---[ end trace 56d4ea1ff18354f1 ]---
[Tue Jan 13 15:36:31 2015] Fixing recursive fault but reboot is needed!