Mailing List Archive

[syzbot] INFO: rcu detected stall in tx
Hello,

syzbot found the following issue on:

HEAD commit: 50987bec Merge tag 'trace-v5.12-rc7' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1065c5fcd00000
kernel config: https://syzkaller.appspot.com/x/.config?x=398c4d0fe6f66e68
dashboard link: https://syzkaller.appspot.com/bug?extid=e2eae5639e7203360018

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com

usbtmc 5-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 5-1:0.0: unknown status received: -71
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 1-...!: (8580 ticks this GP) idle=72e/1/0x4000000000000000 softirq=20679/20679 fqs=0
(t=10500 jiffies g=27129 q=416)
rcu: rcu_preempt kthread starved for 10500 jiffies! g27129 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:29168 pid: 14 ppid: 2 flags:0x00004000
Call Trace:
context_switch kernel/sched/core.c:4322 [inline]
__schedule+0x911/0x21b0 kernel/sched/core.c:5073
schedule+0xcf/0x270 kernel/sched/core.c:5152
schedule_timeout+0x14a/0x250 kernel/time/timer.c:1892
rcu_gp_fqs_loop kernel/rcu/tree.c:2005 [inline]
rcu_gp_kthread+0xd07/0x2250 kernel/rcu/tree.c:2178
kthread+0x3b1/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 3232 Comm: aoe_tx0 Not tainted 5.12.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:native_apic_mem_write+0x8/0x10 arch/x86/include/asm/apic.h:110
Code: c7 40 d9 36 8f e8 c8 11 86 00 eb b0 66 0f 1f 44 00 00 be 01 00 00 00 e9 36 c7 2c 00 cc cc cc cc cc cc 89 ff 89 b7 00 c0 5f ff <c3> 0f 1f 80 00 00 00 00 48 b8 00 00 00 00 00 fc ff df 53 89 fb 48
RSP: 0018:ffffc90000007ea8 EFLAGS: 00000046
RAX: dffffc0000000000 RBX: ffffffff8b0a78c0 RCX: 0000000000000020
RDX: 1ffffffff1614f1a RSI: 000000000001c285 RDI: 0000000000000380
RBP: ffff8880b9c1f2c0 R08: 000000000000003f R09: 0000000000000000
R10: ffffffff8166ecf7 R11: 0000000000000000 R12: 000000000001c285
R13: 0000000000000020 R14: ffff8880b9c26340 R15: 0000006120792e26
FS: 0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb9e6cdb380 CR3: 0000000018792000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
apic_write arch/x86/include/asm/apic.h:393 [inline]
lapic_next_event+0x4d/0x80 arch/x86/kernel/apic/apic.c:472
clockevents_program_event+0x254/0x370 kernel/time/clockevents.c:334
tick_program_event+0xac/0x140 kernel/time/tick-oneshot.c:44
hrtimer_interrupt+0x414/0xa00 kernel/time/hrtimer.c:1676
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1089 [inline]
__sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1106
sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1100
</IRQ>
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0010:preempt_count arch/x86/include/asm/preempt.h:27 [inline]
RIP: 0010:check_kcov_mode kernel/kcov.c:163 [inline]
RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x60 kernel/kcov.c:197
Code: f0 4d 89 03 e9 f2 fc ff ff b9 ff ff ff ff ba 08 00 00 00 4d 8b 03 48 0f bd ca 49 8b 45 00 48 63 c9 e9 64 ff ff ff 0f 1f 40 00 <65> 8b 05 39 fe 8d 7e 89 c1 48 8b 34 24 81 e1 00 01 00 00 65 48 8b
RSP: 0018:ffffc900030cf6f8 EFLAGS: 00000293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88801aff1c40 RSI: ffffffff815c2e4f RDI: 0000000000000003
RBP: ffffc900030cf738 R08: 0000000000000000 R09: ffffffff8fa9a96f
R10: ffffffff815c2e45 R11: 0000000000000000 R12: 000000000000002d
R13: ffff8880113db880 R14: 0000000000000000 R15: 0000000000000200
console_trylock_spinning kernel/printk/printk.c:1818 [inline]
vprintk_emit+0x3a5/0x560 kernel/printk/printk.c:2097
dev_vprintk_emit+0x36e/0x3b2 drivers/base/core.c:4434
dev_printk_emit+0xba/0xf1 drivers/base/core.c:4445
__netdev_printk+0x1c6/0x27a net/core/dev.c:11292
netdev_warn+0xd7/0x109 net/core/dev.c:11345
ieee802154_subif_start_xmit.cold+0x17/0x27 net/mac802154/tx.c:125
__netdev_start_xmit include/linux/netdevice.h:4825 [inline]
netdev_start_xmit include/linux/netdevice.h:4839 [inline]
xmit_one net/core/dev.c:3605 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3621
sch_direct_xmit+0x2e1/0xbd0 net/sched/sch_generic.c:313
qdisc_restart net/sched/sch_generic.c:376 [inline]
__qdisc_run+0x4ba/0x15f0 net/sched/sch_generic.c:384
qdisc_run include/net/pkt_sched.h:136 [inline]
qdisc_run include/net/pkt_sched.h:128 [inline]
__dev_xmit_skb net/core/dev.c:3807 [inline]
__dev_queue_xmit+0x14b9/0x2e00 net/core/dev.c:4162
tx+0x68/0xb0 drivers/block/aoe/aoenet.c:63
kthread+0x1e7/0x3a0 drivers/block/aoe/aoecmd.c:1230
kthread+0x3b1/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
NMI backtrace for cpu 1
CPU: 1 PID: 37 Comm: kworker/1:1 Not tainted 5.12.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_dev_trap_report_work
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
nmi_cpu_backtrace.cold+0x44/0xd7 lib/nmi_backtrace.c:105
nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62
trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
rcu_dump_cpu_stacks+0x222/0x2a7 kernel/rcu/tree_stall.h:341
print_cpu_stall kernel/rcu/tree_stall.h:622 [inline]
check_cpu_stall kernel/rcu/tree_stall.h:697 [inline]
rcu_pending kernel/rcu/tree.c:3830 [inline]
rcu_sched_clock_irq.cold+0x4f7/0x11dd kernel/rcu/tree.c:2650
update_process_times+0x16d/0x200 kernel/time/timer.c:1796
tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226
tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1369
__run_hrtimer kernel/time/hrtimer.c:1537 [inline]
__hrtimer_run_queues+0x1c0/0xe40 kernel/time/hrtimer.c:1601
hrtimer_interrupt+0x330/0xa00 kernel/time/hrtimer.c:1663
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1089 [inline]
__sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1106
sysvec_apic_timer_interrupt+0x40/0xc0 arch/x86/kernel/apic/apic.c:1100
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:161 [inline]
RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:191
Code: 74 24 10 e8 ba 19 54 f8 48 89 ef e8 f2 cf 54 f8 81 e3 00 02 00 00 75 25 9c 58 f6 c4 02 75 2d 48 85 db 74 01 fb bf 01 00 00 00 <e8> d3 9d 48 f8 65 8b 05 7c 68 fc 76 85 c0 74 0a 5b 5d c3 e8 40 59
RSP: 0018:ffffc90000dc0b28 EFLAGS: 00000206
RAX: 0000000000000002 RBX: 0000000000000200 RCX: 1ffffffff1f5f34a
RDX: 0000000000000000 RSI: 0000000000000103 RDI: 0000000000000001
RBP: ffff888144fa8000 R08: 0000000000000001 R09: ffffffff8fa9a99f
R10: 0000000000000001 R11: ffffc90013880000 R12: ffff888145047440
R13: ffff88801ee8e500 R14: dffffc0000000000 R15: ffff888011f69c00
spin_unlock_irqrestore include/linux/spinlock.h:409 [inline]
dummy_timer+0x12f1/0x32a0 drivers/usb/gadget/udc/dummy_hcd.c:1985
call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1431
expire_timers kernel/time/timer.c:1476 [inline]
__run_timers.part.0+0x67c/0xa50 kernel/time/timer.c:1745
__run_timers kernel/time/timer.c:1726 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1758
__do_softirq+0x29b/0x9f6 kernel/softirq.c:345
do_softirq.part.0+0xd9/0x130 kernel/softirq.c:248
</IRQ>
do_softirq kernel/softirq.c:240 [inline]
__local_bh_enable_ip+0x102/0x120 kernel/softirq.c:198
spin_unlock_bh include/linux/spinlock.h:399 [inline]
nsim_dev_trap_report drivers/net/netdevsim/dev.c:585 [inline]
nsim_dev_trap_report_work+0x867/0xbd0 drivers/net/netdevsim/dev.c:611
process_one_work+0x98d/0x1600 kernel/workqueue.c:2275
worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
kthread+0x3b1/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 5-1:0.0: unknown status received: -71
usbtmc 5-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 5-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 5-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 2-1:0.0: unknown status received: -71
usbtmc 4-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: unknown status received: -71
usbtmc 3-1:0.0: usb_submit_urb failed: -19
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: unknown status received: -71
usbtmc 6-1:0.0: usb_submit_urb failed: -19


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Mon, Apr 19, 2021 at 9:19 AM syzbot
<syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 50987bec Merge tag 'trace-v5.12-rc7' of git://git.kernel.o..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1065c5fcd00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=398c4d0fe6f66e68
> dashboard link: https://syzkaller.appspot.com/bug?extid=e2eae5639e7203360018
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com
>
> usbtmc 5-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 5-1:0.0: unknown status received: -71

The log shows an infinite stream of these before the stall, so I
assume it's an infinite loop in usbtmc.
+usbtmc maintainers

[ 370.171634][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.177799][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.183912][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.190076][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.196194][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.202387][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.208460][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.214615][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.220736][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.226902][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.233005][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.239168][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.245271][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.251426][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.257552][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.263715][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.269819][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.275974][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.282100][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.288262][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.294399][ C0] usbtmc 4-1:0.0: unknown status received: -71



> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 1-...!: (8580 ticks this GP) idle=72e/1/0x4000000000000000 softirq=20679/20679 fqs=0
> (t=10500 jiffies g=27129 q=416)
> rcu: rcu_preempt kthread starved for 10500 jiffies! g27129 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:R running task stack:29168 pid: 14 ppid: 2 flags:0x00004000
> Call Trace:
> context_switch kernel/sched/core.c:4322 [inline]
> __schedule+0x911/0x21b0 kernel/sched/core.c:5073
> schedule+0xcf/0x270 kernel/sched/core.c:5152
> schedule_timeout+0x14a/0x250 kernel/time/timer.c:1892
> rcu_gp_fqs_loop kernel/rcu/tree.c:2005 [inline]
> rcu_gp_kthread+0xd07/0x2250 kernel/rcu/tree.c:2178
> kthread+0x3b1/0x4a0 kernel/kthread.c:292
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
> rcu: Stack dump where RCU GP kthread last ran:
> Sending NMI from CPU 1 to CPUs 0:
> NMI backtrace for cpu 0
> CPU: 0 PID: 3232 Comm: aoe_tx0 Not tainted 5.12.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:native_apic_mem_write+0x8/0x10 arch/x86/include/asm/apic.h:110
> Code: c7 40 d9 36 8f e8 c8 11 86 00 eb b0 66 0f 1f 44 00 00 be 01 00 00 00 e9 36 c7 2c 00 cc cc cc cc cc cc 89 ff 89 b7 00 c0 5f ff <c3> 0f 1f 80 00 00 00 00 48 b8 00 00 00 00 00 fc ff df 53 89 fb 48
> RSP: 0018:ffffc90000007ea8 EFLAGS: 00000046
> RAX: dffffc0000000000 RBX: ffffffff8b0a78c0 RCX: 0000000000000020
> RDX: 1ffffffff1614f1a RSI: 000000000001c285 RDI: 0000000000000380
> RBP: ffff8880b9c1f2c0 R08: 000000000000003f R09: 0000000000000000
> R10: ffffffff8166ecf7 R11: 0000000000000000 R12: 000000000001c285
> R13: 0000000000000020 R14: ffff8880b9c26340 R15: 0000006120792e26
> FS: 0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fb9e6cdb380 CR3: 0000000018792000 CR4: 00000000001506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <IRQ>
> apic_write arch/x86/include/asm/apic.h:393 [inline]
> lapic_next_event+0x4d/0x80 arch/x86/kernel/apic/apic.c:472
> clockevents_program_event+0x254/0x370 kernel/time/clockevents.c:334
> tick_program_event+0xac/0x140 kernel/time/tick-oneshot.c:44
> hrtimer_interrupt+0x414/0xa00 kernel/time/hrtimer.c:1676
> local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1089 [inline]
> __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1106
> sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1100
> </IRQ>
> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
> RIP: 0010:preempt_count arch/x86/include/asm/preempt.h:27 [inline]
> RIP: 0010:check_kcov_mode kernel/kcov.c:163 [inline]
> RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x60 kernel/kcov.c:197
> Code: f0 4d 89 03 e9 f2 fc ff ff b9 ff ff ff ff ba 08 00 00 00 4d 8b 03 48 0f bd ca 49 8b 45 00 48 63 c9 e9 64 ff ff ff 0f 1f 40 00 <65> 8b 05 39 fe 8d 7e 89 c1 48 8b 34 24 81 e1 00 01 00 00 65 48 8b
> RSP: 0018:ffffc900030cf6f8 EFLAGS: 00000293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88801aff1c40 RSI: ffffffff815c2e4f RDI: 0000000000000003
> RBP: ffffc900030cf738 R08: 0000000000000000 R09: ffffffff8fa9a96f
> R10: ffffffff815c2e45 R11: 0000000000000000 R12: 000000000000002d
> R13: ffff8880113db880 R14: 0000000000000000 R15: 0000000000000200
> console_trylock_spinning kernel/printk/printk.c:1818 [inline]
> vprintk_emit+0x3a5/0x560 kernel/printk/printk.c:2097
> dev_vprintk_emit+0x36e/0x3b2 drivers/base/core.c:4434
> dev_printk_emit+0xba/0xf1 drivers/base/core.c:4445
> __netdev_printk+0x1c6/0x27a net/core/dev.c:11292
> netdev_warn+0xd7/0x109 net/core/dev.c:11345
> ieee802154_subif_start_xmit.cold+0x17/0x27 net/mac802154/tx.c:125
> __netdev_start_xmit include/linux/netdevice.h:4825 [inline]
> netdev_start_xmit include/linux/netdevice.h:4839 [inline]
> xmit_one net/core/dev.c:3605 [inline]
> dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3621
> sch_direct_xmit+0x2e1/0xbd0 net/sched/sch_generic.c:313
> qdisc_restart net/sched/sch_generic.c:376 [inline]
> __qdisc_run+0x4ba/0x15f0 net/sched/sch_generic.c:384
> qdisc_run include/net/pkt_sched.h:136 [inline]
> qdisc_run include/net/pkt_sched.h:128 [inline]
> __dev_xmit_skb net/core/dev.c:3807 [inline]
> __dev_queue_xmit+0x14b9/0x2e00 net/core/dev.c:4162
> tx+0x68/0xb0 drivers/block/aoe/aoenet.c:63
> kthread+0x1e7/0x3a0 drivers/block/aoe/aoecmd.c:1230
> kthread+0x3b1/0x4a0 kernel/kthread.c:292
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
> NMI backtrace for cpu 1
> CPU: 1 PID: 37 Comm: kworker/1:1 Not tainted 5.12.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Workqueue: events nsim_dev_trap_report_work
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:79 [inline]
> dump_stack+0x141/0x1d7 lib/dump_stack.c:120
> nmi_cpu_backtrace.cold+0x44/0xd7 lib/nmi_backtrace.c:105
> nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62
> trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
> rcu_dump_cpu_stacks+0x222/0x2a7 kernel/rcu/tree_stall.h:341
> print_cpu_stall kernel/rcu/tree_stall.h:622 [inline]
> check_cpu_stall kernel/rcu/tree_stall.h:697 [inline]
> rcu_pending kernel/rcu/tree.c:3830 [inline]
> rcu_sched_clock_irq.cold+0x4f7/0x11dd kernel/rcu/tree.c:2650
> update_process_times+0x16d/0x200 kernel/time/timer.c:1796
> tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226
> tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1369
> __run_hrtimer kernel/time/hrtimer.c:1537 [inline]
> __hrtimer_run_queues+0x1c0/0xe40 kernel/time/hrtimer.c:1601
> hrtimer_interrupt+0x330/0xa00 kernel/time/hrtimer.c:1663
> local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1089 [inline]
> __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1106
> sysvec_apic_timer_interrupt+0x40/0xc0 arch/x86/kernel/apic/apic.c:1100
> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
> RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:161 [inline]
> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:191
> Code: 74 24 10 e8 ba 19 54 f8 48 89 ef e8 f2 cf 54 f8 81 e3 00 02 00 00 75 25 9c 58 f6 c4 02 75 2d 48 85 db 74 01 fb bf 01 00 00 00 <e8> d3 9d 48 f8 65 8b 05 7c 68 fc 76 85 c0 74 0a 5b 5d c3 e8 40 59
> RSP: 0018:ffffc90000dc0b28 EFLAGS: 00000206
> RAX: 0000000000000002 RBX: 0000000000000200 RCX: 1ffffffff1f5f34a
> RDX: 0000000000000000 RSI: 0000000000000103 RDI: 0000000000000001
> RBP: ffff888144fa8000 R08: 0000000000000001 R09: ffffffff8fa9a99f
> R10: 0000000000000001 R11: ffffc90013880000 R12: ffff888145047440
> R13: ffff88801ee8e500 R14: dffffc0000000000 R15: ffff888011f69c00
> spin_unlock_irqrestore include/linux/spinlock.h:409 [inline]
> dummy_timer+0x12f1/0x32a0 drivers/usb/gadget/udc/dummy_hcd.c:1985
> call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1431
> expire_timers kernel/time/timer.c:1476 [inline]
> __run_timers.part.0+0x67c/0xa50 kernel/time/timer.c:1745
> __run_timers kernel/time/timer.c:1726 [inline]
> run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1758
> __do_softirq+0x29b/0x9f6 kernel/softirq.c:345
> do_softirq.part.0+0xd9/0x130 kernel/softirq.c:248
> </IRQ>
> do_softirq kernel/softirq.c:240 [inline]
> __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:198
> spin_unlock_bh include/linux/spinlock.h:399 [inline]
> nsim_dev_trap_report drivers/net/netdevsim/dev.c:585 [inline]
> nsim_dev_trap_report_work+0x867/0xbd0 drivers/net/netdevsim/dev.c:611
> process_one_work+0x98d/0x1600 kernel/workqueue.c:2275
> worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
> kthread+0x3b1/0x4a0 kernel/kthread.c:292
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 5-1:0.0: unknown status received: -71
> usbtmc 5-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 5-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 5-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 2-1:0.0: unknown status received: -71
> usbtmc 4-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: unknown status received: -71
> usbtmc 3-1:0.0: usb_submit_urb failed: -19
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: unknown status received: -71
> usbtmc 6-1:0.0: usb_submit_urb failed: -19
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b79905c04e25a0%40google.com.
RE: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
Hi all,

The error is in usbtmc_interrupt(struct urb *urb) since five years. The status code EPROTO is not handled correctly.
It's not a showstopper, but we should fix it and check the status code according to usbtmc_read_bulk_cb() or
usb_skeleton.c.
@Dave: Do you have time? Otherwise I can do it.
@Greg: Is it urgent?

- Guido

-----Original Message-----
From: Dmitry
Sent: Monday, April 19, 2021 9:27 AM
Subject: Re: [syzbot] INFO: rcu detected stall in tx

On Mon, Apr 19, 2021 at 9:19 AM syzbot
<syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 50987bec Merge tag 'trace-v5.12-rc7' of git://git.kernel.o..
> git tree: upstream
> console output:
> https://syzkaller.appspot.com/x/log.txt?x=1065c5fcd00000
> kernel config:
> https://syzkaller.appspot.com/x/.config?x=398c4d0fe6f66e68
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=e2eae5639e7203360018
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com
>
> usbtmc 5-1:0.0: unknown status received: -71 usbtmc 3-1:0.0: unknown
> status received: -71 usbtmc 5-1:0.0: unknown status received: -71

The log shows an infinite stream of these before the stall, so I assume it's an infinite loop in usbtmc.
+usbtmc maintainers

[ 370.171634][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.177799][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.183912][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.190076][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.196194][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.202387][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.208460][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.214615][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.220736][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.226902][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.233005][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.239168][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.245271][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.251426][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.257552][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.263715][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.269819][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.275974][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.282100][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.288262][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.294399][ C0] usbtmc 4-1:0.0: unknown status received: -71



Content provided within this e-mail including any attachments, is for the use of the intended recipients and may contain Rohde & Schwarz company restricted information. Any unauthorized use, disclosure, or distribution of this communication in whole or in part is strictly prohibited. If you are not the intended recipient, please notify the sender by reply email or by telephone and delete the communication in its entirety.
Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Mon, Apr 19, 2021 at 08:56:19PM +0000, Guido Kiener wrote:
> Hi all,
>
> The error is in usbtmc_interrupt(struct urb *urb) since five years. The status code EPROTO is not handled correctly.
> It's not a showstopper, but we should fix it and check the status code according to usbtmc_read_bulk_cb() or
> usb_skeleton.c.
> @Dave: Do you have time? Otherwise I can do it.
> @Greg: Is it urgent?

No idea, but patches for known problems are always good to get completed
as soon as possible :)

thanks,

greg k-h
RE: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
Hi all,

Dave and I discussed the "self-detected stall on CPU" caused by the usbtmc driver.

What happened?
The callback handler usbtmc_interrupt(struct urb *urb) for the INT pipe receives an erroneous urb with status -EPROTO (-71).
See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/class/usbtmc.c?h=v5.12#n2340
-EPROTO does not abort/shutdown the pipe and the urb is resubmitted to receive the next packet. However the callback handler usbtmc_interrupt is called again with the same erroneous status -EPROTO and this seems to result in an endless loop.
According to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/driver-api/usb/error-codes.rst?h=v5.12#n177
the error -EPROTO indicates a hardware problem or a bad cable.

Most usb drivers do not react in a specific way on this hardware problems and resubmit the urb. We assume these drivers will run into the same endless loop. Some other driver samples are:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/class/cdc-acm.c?h=v5.12#n379
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hid/usbhid/usbmouse.c?h=v5.12#n65

Possible solutions:
Hardware defects or bad cables seems to be a common problem for most usb drivers and I assume we do not want to fix this problem in all class specific drivers, but in lower level host drivers, e.g:
1. Using a counter and close the pipe after some detected errors
2. Delay the resubmission of the urb to avoid high cpu usage
3. Do nothing, since it is just a rare problem.

We've never seen this problem in our products and we do not dare to change anything.

- Guido

-----Original Message-----
From: Dmitry
Sent: Monday, April 19, 2021 9:27 AM
Subject: Re: [syzbot] INFO: rcu detected stall in tx

On Mon, Apr 19, 2021 at 9:19 AM syzbot
<syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 50987bec Merge tag 'trace-v5.12-rc7' of git://git.kernel.o..
> git tree: upstream
> console output:
> https://syzkaller.appspot.com/x/log.txt?x=1065c5fcd00000
> kernel config:
> https://syzkaller.appspot.com/x/.config?x=398c4d0fe6f66e68
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=e2eae5639e7203360018
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+e2eae5639e7203360018@syzkaller.appspotmail.com
>
> usbtmc 5-1:0.0: unknown status received: -71 usbtmc 3-1:0.0: unknown
> status received: -71 usbtmc 5-1:0.0: unknown status received: -71

The log shows an infinite stream of these before the stall, so I assume it's an infinite loop in usbtmc.
+usbtmc maintainers

[ 370.171634][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.177799][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.183912][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.190076][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.196194][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.202387][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.208460][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.214615][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.220736][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.226902][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.233005][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.239168][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.245271][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.251426][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.257552][ C0] usbtmc 4-1:0.0: unknown status received: -71
[ 370.263715][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.269819][ C0] usbtmc 2-1:0.0: unknown status received: -71
[ 370.275974][ C1] usbtmc 3-1:0.0: unknown status received: -71
[ 370.282100][ C0] usbtmc 6-1:0.0: unknown status received: -71
[ 370.288262][ C1] usbtmc 5-1:0.0: unknown status received: -71
[ 370.294399][ C0] usbtmc 4-1:0.0: unknown status received: -71
Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Mon, May 03, 2021 at 09:56:05PM +0000, Guido Kiener wrote:
> Hi all,
>
> Dave and I discussed the "self-detected stall on CPU" caused by the usbtmc driver.
>
> What happened?
> The callback handler usbtmc_interrupt(struct urb *urb) for the INT pipe receives an erroneous urb with status -EPROTO (-71).
> See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/class/usbtmc.c?h=v5.12#n2340
> -EPROTO does not abort/shutdown the pipe and the urb is resubmitted to receive the next packet. However the callback handler usbtmc_interrupt is called again with the same erroneous status -EPROTO and this seems to result in an endless loop.
> According to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/driver-api/usb/error-codes.rst?h=v5.12#n177
> the error -EPROTO indicates a hardware problem or a bad cable.
>
> Most usb drivers do not react in a specific way on this hardware problems and resubmit the urb. We assume these drivers will run into the same endless loop. Some other driver samples are:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/class/cdc-acm.c?h=v5.12#n379
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hid/usbhid/usbmouse.c?h=v5.12#n65
>
> Possible solutions:
> Hardware defects or bad cables seems to be a common problem for most usb drivers and I assume we do not want to fix this problem in all class specific drivers, but in lower level host drivers, e.g:
> 1. Using a counter and close the pipe after some detected errors
> 2. Delay the resubmission of the urb to avoid high cpu usage
> 3. Do nothing, since it is just a rare problem.
>
> We've never seen this problem in our products and we do not dare to change anything.

Drivers are not consistent in the way they handle these errors, as you
have seen. A few try to take active measures, such as retrys with
increasing timeouts. Many drivers just ignore them, which is not a very
good idea.

The general feeling among kernel USB developers is that a -EPROTO,
-EILSEQ, or -ETIME error should be regarded as fatal, much the same as
an unplug event. The driver should avoid resubmitting URBs and just
wait to be unbound from the device.

If you would like to audit drivers and fix them up to behave this way,
that would be great.

(FYI, by far the most common causes of these errors are: The user has
unplugged the USB cable, or the device's firmware has crashed. It is
quite rare for the cause to be intermittent, although not entirely
unheard of -- for example, someone once reported errors resulting from
EM or power-line interference caused by flickering fluorescent lights or
something of that sort. It's pretty safe to ignore this possibility.)

Alan Stern
RE: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
> -----Original Message-----
> From: Alan Stern <stern@rowland.harvard.edu>
> Sent: Tuesday, May 4, 2021 5:14 PM
> To: Kiener Guido 14DS1
> Subject: Re: Re: [syzbot] INFO: rcu detected stall in tx
>
> On Mon, May 03, 2021 at 09:56:05PM +0000, Guido Kiener wrote:
> > Hi all,
> >
> > Dave and I discussed the "self-detected stall on CPU" caused by the usbtmc
> driver.
> >
> > What happened?
> > The callback handler usbtmc_interrupt(struct urb *urb) for the INT pipe receives
> an erroneous urb with status -EPROTO (-71).
> > See
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > e/drivers/usb/class/usbtmc.c?h=v5.12#n2340
> > -EPROTO does not abort/shutdown the pipe and the urb is resubmitted to receive
> the next packet. However the callback handler usbtmc_interrupt is called again with
> the same erroneous status -EPROTO and this seems to result in an endless loop.
> > According to
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > e/Documentation/driver-api/usb/error-codes.rst?h=v5.12#n177
> > the error -EPROTO indicates a hardware problem or a bad cable.
> >
> > Most usb drivers do not react in a specific way on this hardware problems and
> resubmit the urb. We assume these drivers will run into the same endless loop.
> Some other driver samples are:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > e/drivers/usb/class/cdc-acm.c?h=v5.12#n379
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > e/drivers/hid/usbhid/usbmouse.c?h=v5.12#n65
> >
> > Possible solutions:
> > Hardware defects or bad cables seems to be a common problem for most usb
> drivers and I assume we do not want to fix this problem in all class specific drivers,
> but in lower level host drivers, e.g:
> > 1. Using a counter and close the pipe after some detected errors 2.
> > Delay the resubmission of the urb to avoid high cpu usage 3. Do
> > nothing, since it is just a rare problem.
> >
> > We've never seen this problem in our products and we do not dare to change
> anything.
>
> Drivers are not consistent in the way they handle these errors, as you have seen. A
> few try to take active measures, such as retrys with increasing timeouts. Many
> drivers just ignore them, which is not a very good idea.
>
> The general feeling among kernel USB developers is that a -EPROTO, -EILSEQ, or
> -ETIME error should be regarded as fatal, much the same as an unplug event. The
> driver should avoid resubmitting URBs and just wait to be unbound from the device.

Thanks for your assessment. I agree with the general feeling. I counted about hundred
specific usb drivers, so wouldn't it be better to fix the problem in some of the host drivers (e.g. urb.c)?
We could return an error when calling usb_submit_urb() on an erroneous pipe.
I cannot estimate the side effects and we need to check all drivers again how they deal with the
error situation. Maybe there are some special driver that need a specialized error handling.
In this case these drivers could reset the (new?) error flag to allow calling usb_submit_urb()
again without error. This could work, isn't it?

> If you would like to audit drivers and fix them up to behave this way, that would be
> great.

Currently not. I cannot pull the USB cable in home office :-), but I will keep an eye on it.
When I'm more involved in the next USB driver issue than I will test bad cables and
maybe get more ideas how we could test and fix this rare error.

> (FYI, by far the most common causes of these errors are: The user has unplugged
> the USB cable, or the device's firmware has crashed. It is quite rare for the cause to
> be intermittent, although not entirely unheard of -- for example, someone once
> reported errors resulting from EM or power-line interference caused by flickering
> fluorescent lights or something of that sort. It's pretty safe to ignore this possibility.)

I fear I may not use the 75 kW TV transmitter to interfere the USB cable :-)

-Guido
Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Wed, May 05, 2021 at 10:22:24PM +0000, Guido Kiener wrote:
> > -----Original Message-----
> > From: Alan Stern <stern@rowland.harvard.edu>
> > Sent: Tuesday, May 4, 2021 5:14 PM
> > To: Kiener Guido 14DS1
> > Subject: Re: Re: [syzbot] INFO: rcu detected stall in tx
> >
> > On Mon, May 03, 2021 at 09:56:05PM +0000, Guido Kiener wrote:
> > > Hi all,
> > >
> > > Dave and I discussed the "self-detected stall on CPU" caused by the usbtmc
> > driver.
> > >
> > > What happened?
> > > The callback handler usbtmc_interrupt(struct urb *urb) for the INT pipe receives
> > an erroneous urb with status -EPROTO (-71).
> > > See
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/drivers/usb/class/usbtmc.c?h=v5.12#n2340
> > > -EPROTO does not abort/shutdown the pipe and the urb is resubmitted to receive
> > the next packet. However the callback handler usbtmc_interrupt is called again with
> > the same erroneous status -EPROTO and this seems to result in an endless loop.
> > > According to
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/Documentation/driver-api/usb/error-codes.rst?h=v5.12#n177
> > > the error -EPROTO indicates a hardware problem or a bad cable.
> > >
> > > Most usb drivers do not react in a specific way on this hardware problems and
> > resubmit the urb. We assume these drivers will run into the same endless loop.
> > Some other driver samples are:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/drivers/usb/class/cdc-acm.c?h=v5.12#n379
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/drivers/hid/usbhid/usbmouse.c?h=v5.12#n65
> > >
> > > Possible solutions:
> > > Hardware defects or bad cables seems to be a common problem for most usb
> > drivers and I assume we do not want to fix this problem in all class specific drivers,
> > but in lower level host drivers, e.g:
> > > 1. Using a counter and close the pipe after some detected errors 2.
> > > Delay the resubmission of the urb to avoid high cpu usage 3. Do
> > > nothing, since it is just a rare problem.
> > >
> > > We've never seen this problem in our products and we do not dare to change
> > anything.
> >
> > Drivers are not consistent in the way they handle these errors, as you have seen. A
> > few try to take active measures, such as retrys with increasing timeouts. Many
> > drivers just ignore them, which is not a very good idea.
> >
> > The general feeling among kernel USB developers is that a -EPROTO, -EILSEQ, or
> > -ETIME error should be regarded as fatal, much the same as an unplug event. The
> > driver should avoid resubmitting URBs and just wait to be unbound from the device.
>
> Thanks for your assessment. I agree with the general feeling. I counted about hundred
> specific usb drivers, so wouldn't it be better to fix the problem in some of the host drivers (e.g. urb.c)?
> We could return an error when calling usb_submit_urb() on an erroneous pipe.
> I cannot estimate the side effects and we need to check all drivers again how they deal with the
> error situation. Maybe there are some special driver that need a specialized error handling.
> In this case these drivers could reset the (new?) error flag to allow calling usb_submit_urb()
> again without error. This could work, isn't it?

That is feasible, although it would be an awkward approach. As you
said, the side effects aren't clear. But it might work.

> > If you would like to audit drivers and fix them up to behave this way, that would be
> > great.
>
> Currently not. I cannot pull the USB cable in home office :-), but I will keep an eye on it.
> When I'm more involved in the next USB driver issue than I will test bad cables and
> maybe get more ideas how we could test and fix this rare error.

Will you be able to test patches?

Alan Stern
RE: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
> -----Original Message-----
> From: Alan Stern
> Sent: Thursday, May 6, 2021 3:49 PM
> To: Kiener Guido 14DS1 <Guido.Kiener@rohde-schwarz.com>
>
> On Wed, May 05, 2021 at 10:22:24PM +0000, Guido Kiener wrote:
> > > Drivers are not consistent in the way they handle these errors, as
> > > you have seen. A few try to take active measures, such as retrys
> > > with increasing timeouts. Many drivers just ignore them, which is not a very
> good idea.
> > >
> > > The general feeling among kernel USB developers is that a -EPROTO,
> > > -EILSEQ, or -ETIME error should be regarded as fatal, much the same
> > > as an unplug event. The driver should avoid resubmitting URBs and just wait to
> be unbound from the device.
> >
> > Thanks for your assessment. I agree with the general feeling. I
> > counted about hundred specific usb drivers, so wouldn't it be better to fix the
> problem in some of the host drivers (e.g. urb.c)?
> > We could return an error when calling usb_submit_urb() on an erroneous pipe.
> > I cannot estimate the side effects and we need to check all drivers
> > again how they deal with the error situation. Maybe there are some special driver
> that need a specialized error handling.
> > In this case these drivers could reset the (new?) error flag to allow
> > calling usb_submit_urb() again without error. This could work, isn't it?
>
> That is feasible, although it would be an awkward approach. As you said, the side
> effects aren't clear. But it might work.

Otherwise I see only the other approach to change hundred drivers and add the
cases EPROTO, EILSEQ and ETIME in each callback handler. The usbtmc driver
already respects the EILSEQ and ETIME, and only EPROTO is missing.
The rest should be more a management task.
BTW do you assume it is only a problem for INT pipes or is it also a problem
for isochronous and bulk transfers?

> > > If you would like to audit drivers and fix them up to behave this
> > > way, that would be great.
> >
> > Currently not. I cannot pull the USB cable in home office :-), but I will keep an eye
> on it.
> > When I'm more involved in the next USB driver issue than I will test
> > bad cables and maybe get more ideas how we could test and fix this rare error.
>
> Will you be able to test patches?

I only can test the USBTMC function in some different PCs. I do not have automated
regression tests for USB drivers or Linux kernels.
Maybe there is company who could do that.

-Guido
Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Thu, May 06, 2021 at 05:44:55PM +0000, Guido Kiener wrote:
> > -----Original Message-----
> > From: Alan Stern
> > Sent: Thursday, May 6, 2021 3:49 PM
> > To: Kiener Guido 14DS1 <Guido.Kiener@rohde-schwarz.com>
> > >
> > > Thanks for your assessment. I agree with the general feeling. I
> > > counted about hundred specific usb drivers, so wouldn't it be better to fix the
> > problem in some of the host drivers (e.g. urb.c)?
> > > We could return an error when calling usb_submit_urb() on an erroneous pipe.
> > > I cannot estimate the side effects and we need to check all drivers
> > > again how they deal with the error situation. Maybe there are some special driver
> > that need a specialized error handling.
> > > In this case these drivers could reset the (new?) error flag to allow
> > > calling usb_submit_urb() again without error. This could work, isn't it?
> >
> > That is feasible, although it would be an awkward approach. As you said, the side
> > effects aren't clear. But it might work.
>
> Otherwise I see only the other approach to change hundred drivers and add the
> cases EPROTO, EILSEQ and ETIME in each callback handler. The usbtmc driver
> already respects the EILSEQ and ETIME, and only EPROTO is missing.
> The rest should be more a management task.
> BTW do you assume it is only a problem for INT pipes or is it also a problem
> for isochronous and bulk transfers?

All of them. Control too.

> > Will you be able to test patches?
>
> I only can test the USBTMC function in some different PCs. I do not have automated
> regression tests for USB drivers or Linux kernels.
> Maybe there is company who could do that.

Well then, if I do find time to write a patch, I'll ask you to try it
out with the usbtmc driver.

Alan Stern
RE: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
> -----Original Message-----
> From: Alan Stern
> Sent: Thursday, May 6, 2021 8:32 PM
> To: Kiener Guido 14DS1
>
> On Thu, May 06, 2021 at 05:44:55PM +0000, Guido Kiener wrote:
> > > -----Original Message-----
> > > From: Alan Stern
> > > Sent: Thursday, May 6, 2021 3:49 PM
> > > To: Kiener Guido 14DS1 <Guido.Kiener@rohde-schwarz.com>
> > > >
> > > > Thanks for your assessment. I agree with the general feeling. I
> > > > counted about hundred specific usb drivers, so wouldn't it be
> > > > better to fix the
> > > problem in some of the host drivers (e.g. urb.c)?
> > > > We could return an error when calling usb_submit_urb() on an erroneous
> pipe.
> > > > I cannot estimate the side effects and we need to check all
> > > > drivers again how they deal with the error situation. Maybe there
> > > > are some special driver
> > > that need a specialized error handling.
> > > > In this case these drivers could reset the (new?) error flag to
> > > > allow calling usb_submit_urb() again without error. This could work, isn't it?
> > >
> > > That is feasible, although it would be an awkward approach. As you
> > > said, the side effects aren't clear. But it might work.
> >
> > Otherwise I see only the other approach to change hundred drivers and
> > add the cases EPROTO, EILSEQ and ETIME in each callback handler. The
> > usbtmc driver already respects the EILSEQ and ETIME, and only EPROTO is
> missing.
> > The rest should be more a management task.
> > BTW do you assume it is only a problem for INT pipes or is it also a
> > problem for isochronous and bulk transfers?
>
> All of them. Control too.
>
> > > Will you be able to test patches?
> >
> > I only can test the USBTMC function in some different PCs. I do not
> > have automated regression tests for USB drivers or Linux kernels.
> > Maybe there is company who could do that.
>
> Well then, if I do find time to write a patch, I'll ask you to try it out with the usbtmc
> driver.

You mean that you will do a patch in urb.c or a host driver? Or just add a line in usbtmc.c?
Anyhow there is no hurry. On May 20 I will send you a mail if I'm able to
provoke one of these hardware errors EPROTO, EILSQ, or ETIME. Otherwise
it doesn't make sense to test it.

-Guido
Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Thu, 6 May 2021 at 22:31, Guido Kiener
<Guido.Kiener@rohde-schwarz.com> wrote:
>
> > -----Original Message-----
> > From: Alan Stern
> > Sent: Thursday, May 6, 2021 8:32 PM
> > To: Kiener Guido 14DS1
> >
> > On Thu, May 06, 2021 at 05:44:55PM +0000, Guido Kiener wrote:
> > > > -----Original Message-----
> > > > From: Alan Stern
> > > > Sent: Thursday, May 6, 2021 3:49 PM
> > > > To: Kiener Guido 14DS1 <Guido.Kiener@rohde-schwarz.com>
> > > > >
> > > > > Thanks for your assessment. I agree with the general feeling. I
> > > > > counted about hundred specific usb drivers, so wouldn't it be
> > > > > better to fix the
> > > > problem in some of the host drivers (e.g. urb.c)?
> > > > > We could return an error when calling usb_submit_urb() on an erroneous
> > pipe.
> > > > > I cannot estimate the side effects and we need to check all
> > > > > drivers again how they deal with the error situation. Maybe there
> > > > > are some special driver
> > > > that need a specialized error handling.
> > > > > In this case these drivers could reset the (new?) error flag to
> > > > > allow calling usb_submit_urb() again without error. This could work, isn't it?
> > > >
> > > > That is feasible, although it would be an awkward approach. As you
> > > > said, the side effects aren't clear. But it might work.
> > >
> > > Otherwise I see only the other approach to change hundred drivers and
> > > add the cases EPROTO, EILSEQ and ETIME in each callback handler. The
> > > usbtmc driver already respects the EILSEQ and ETIME, and only EPROTO is
> > missing.
> > > The rest should be more a management task.
> > > BTW do you assume it is only a problem for INT pipes or is it also a
> > > problem for isochronous and bulk transfers?
> >
> > All of them. Control too.
> >
> > > > Will you be able to test patches?
> > >
> > > I only can test the USBTMC function in some different PCs. I do not
> > > have automated regression tests for USB drivers or Linux kernels.
> > > Maybe there is company who could do that.
> >
> > Well then, if I do find time to write a patch, I'll ask you to try it out with the usbtmc
> > driver.
>
> You mean that you will do a patch in urb.c or a host driver? Or just add a line in usbtmc.c?
> Anyhow there is no hurry. On May 20 I will send you a mail if I'm able to
> provoke one of these hardware errors EPROTO, EILSQ, or ETIME. Otherwise
> it doesn't make sense to test it.
>
> -Guido

EPROTO is a link level issue and needs to be handled by the host driver.
When the host driver detects a protocol error while processing an URB
it completes the URB with EPROTO status and marks the endpoint as
halted.
When the class driver resubmits the URB and the if the host driver
finds the endpoint still marked as halted it should return EPIPE
status on the resubmitted URB
When the class driver and usbtmc in particular receives an URB with
EPIPE status it cleans up and does not resubmit.
Can someone from syzbot land please confirm whether usbtmc running on
the xhci host driver causes an RCU stall to be detected ?
-Dave
Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> On Thu, 6 May 2021 at 22:31, Guido Kiener
> <Guido.Kiener@rohde-schwarz.com> wrote:
> >
> > > -----Original Message-----
> > > From: Alan Stern
> > > Sent: Thursday, May 6, 2021 8:32 PM
> > > To: Kiener Guido 14DS1
> > >
> > > On Thu, May 06, 2021 at 05:44:55PM +0000, Guido Kiener wrote:
> > > > > -----Original Message-----
> > > > > From: Alan Stern
> > > > > Sent: Thursday, May 6, 2021 3:49 PM
> > > > > To: Kiener Guido 14DS1 <Guido.Kiener@rohde-schwarz.com>
> > > > > >
> > > > > > Thanks for your assessment. I agree with the general feeling. I
> > > > > > counted about hundred specific usb drivers, so wouldn't it be
> > > > > > better to fix the
> > > > > problem in some of the host drivers (e.g. urb.c)?
> > > > > > We could return an error when calling usb_submit_urb() on an erroneous
> > > pipe.
> > > > > > I cannot estimate the side effects and we need to check all
> > > > > > drivers again how they deal with the error situation. Maybe there
> > > > > > are some special driver
> > > > > that need a specialized error handling.
> > > > > > In this case these drivers could reset the (new?) error flag to
> > > > > > allow calling usb_submit_urb() again without error. This could work, isn't it?
> > > > >
> > > > > That is feasible, although it would be an awkward approach. As you
> > > > > said, the side effects aren't clear. But it might work.
> > > >
> > > > Otherwise I see only the other approach to change hundred drivers and
> > > > add the cases EPROTO, EILSEQ and ETIME in each callback handler. The
> > > > usbtmc driver already respects the EILSEQ and ETIME, and only EPROTO is
> > > missing.
> > > > The rest should be more a management task.
> > > > BTW do you assume it is only a problem for INT pipes or is it also a
> > > > problem for isochronous and bulk transfers?
> > >
> > > All of them. Control too.
> > >
> > > > > Will you be able to test patches?
> > > >
> > > > I only can test the USBTMC function in some different PCs. I do not
> > > > have automated regression tests for USB drivers or Linux kernels.
> > > > Maybe there is company who could do that.
> > >
> > > Well then, if I do find time to write a patch, I'll ask you to try it out with the usbtmc
> > > driver.
> >
> > You mean that you will do a patch in urb.c or a host driver? Or just add a line in usbtmc.c?
> > Anyhow there is no hurry. On May 20 I will send you a mail if I'm able to
> > provoke one of these hardware errors EPROTO, EILSQ, or ETIME. Otherwise
> > it doesn't make sense to test it.
> >
> > -Guido
>
> EPROTO is a link level issue and needs to be handled by the host driver.

Are you referring to the host controller driver, or to the class device
driver running on the host? The host controller driver is responsible
for creating the -EPROTO error code in the first place. The class
device driver is responsible for taking an appropriate action in
response.

> When the host driver detects a protocol error while processing an URB
> it completes the URB with EPROTO status and marks the endpoint as
> halted.

Not true. It does not mark the endpoint as halted, not unless it
receives a STALL handshake from the device. A STALL is not a protocol
error.

> When the class driver resubmits the URB and the if the host driver
> finds the endpoint still marked as halted it should return EPIPE
> status on the resubmitted URB

Irrelevant.

> When the class driver and usbtmc in particular receives an URB with
> EPIPE status it cleans up and does not resubmit.
> Can someone from syzbot land please confirm whether usbtmc running on
> the xhci host driver causes an RCU stall to be detected ?

That is not an easy thing to test, and syzbot is not capable of testing
it. You would need a USB device which could deliberately be set to
create a protocol error; I don't know of any devices like that.

Alan Stern
Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > On Thu, 6 May 2021 at 22:31, Guido Kiener
> > <Guido.Kiener@rohde-schwarz.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Alan Stern
> > > > Sent: Thursday, May 6, 2021 8:32 PM
> > > > To: Kiener Guido 14DS1
> > > >
> > > > On Thu, May 06, 2021 at 05:44:55PM +0000, Guido Kiener wrote:
> > > > > > -----Original Message-----
> > > > > > From: Alan Stern
> > > > > > Sent: Thursday, May 6, 2021 3:49 PM
> > > > > > To: Kiener Guido 14DS1 <Guido.Kiener@rohde-schwarz.com>
> > > > > > >
> > > > > > > Thanks for your assessment. I agree with the general feeling. I
> > > > > > > counted about hundred specific usb drivers, so wouldn't it be
> > > > > > > better to fix the
> > > > > > problem in some of the host drivers (e.g. urb.c)?
> > > > > > > We could return an error when calling usb_submit_urb() on an erroneous
> > > > pipe.
> > > > > > > I cannot estimate the side effects and we need to check all
> > > > > > > drivers again how they deal with the error situation. Maybe there
> > > > > > > are some special driver
> > > > > > that need a specialized error handling.
> > > > > > > In this case these drivers could reset the (new?) error flag to
> > > > > > > allow calling usb_submit_urb() again without error. This could work, isn't it?
> > > > > >
> > > > > > That is feasible, although it would be an awkward approach. As you
> > > > > > said, the side effects aren't clear. But it might work.
> > > > >
> > > > > Otherwise I see only the other approach to change hundred drivers and
> > > > > add the cases EPROTO, EILSEQ and ETIME in each callback handler. The
> > > > > usbtmc driver already respects the EILSEQ and ETIME, and only EPROTO is
> > > > missing.
> > > > > The rest should be more a management task.
> > > > > BTW do you assume it is only a problem for INT pipes or is it also a
> > > > > problem for isochronous and bulk transfers?
> > > >
> > > > All of them. Control too.
> > > >
> > > > > > Will you be able to test patches?
> > > > >
> > > > > I only can test the USBTMC function in some different PCs. I do not
> > > > > have automated regression tests for USB drivers or Linux kernels.
> > > > > Maybe there is company who could do that.
> > > >
> > > > Well then, if I do find time to write a patch, I'll ask you to try it out with the usbtmc
> > > > driver.
> > >
> > > You mean that you will do a patch in urb.c or a host driver? Or just add a line in usbtmc.c?
> > > Anyhow there is no hurry. On May 20 I will send you a mail if I'm able to
> > > provoke one of these hardware errors EPROTO, EILSQ, or ETIME. Otherwise
> > > it doesn't make sense to test it.
> > >
> > > -Guido
> >
> > EPROTO is a link level issue and needs to be handled by the host driver.
>
> Are you referring to the host controller driver, or to the class device
> driver running on the host? The host controller driver is responsible
> for creating the -EPROTO error code in the first place. The class
> device driver is responsible for taking an appropriate action in
> response.
host controller driver
>
> > When the host driver detects a protocol error while processing an URB
> > it completes the URB with EPROTO status and marks the endpoint as
> > halted.
>
> Not true. It does not mark the endpoint as halted, not unless it
> receives a STALL handshake from the device. A STALL is not a protocol
> error.
>
> > When the class driver resubmits the URB and the if the host driver
> > finds the endpoint still marked as halted it should return EPIPE
> > status on the resubmitted URB
>
> Irrelevant.
Not at all. The point is that when an application is talking to an
instrument over the usbtmc driver, the underlying host controller and
its driver will detect and silence a babbling endpoint.
Hence no EPROTO loop will ensue in this case and therefore no changes
are needed in usbtmc.
>
> > When the class driver and usbtmc in particular receives an URB with
> > EPIPE status it cleans up and does not resubmit.
> > Can someone from syzbot land please confirm whether usbtmc running on
> > the xhci host driver causes an RCU stall to be detected ?
>
> That is not an easy thing to test, and syzbot is not capable of testing
> it. You would need a USB device which could deliberately be set to
> create a protocol error; I don't know of any devices like that.
>
> Alan Stern
Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > When the host driver detects a protocol error while processing an URB
> > > it completes the URB with EPROTO status and marks the endpoint as
> > > halted.
> >
> > Not true. It does not mark the endpoint as halted, not unless it
> > receives a STALL handshake from the device. A STALL is not a protocol
> > error.
> >
> > > When the class driver resubmits the URB and the if the host driver
> > > finds the endpoint still marked as halted it should return EPIPE
> > > status on the resubmitted URB
> >
> > Irrelevant.
> Not at all. The point is that when an application is talking to an
> instrument over the usbtmc driver, the underlying host controller and
> its driver will detect and silence a babbling endpoint.

No, they won't. That is, they will detect a babble error and return an
error status, but they won't silence the endpoint. What makes you think
they will?

> Hence no EPROTO loop will ensue in this case and therefore no changes
> are needed in usbtmc.

Since this conclusion relies on the incorrect assumption above, it also
is incorrect.

Alan Stern
RE: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > When the host driver detects a protocol error while processing an
> > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > as halted.
> > >
> > > Not true. It does not mark the endpoint as halted, not unless it
> > > receives a STALL handshake from the device. A STALL is not a
> > > protocol error.
> > >
> > > > When the class driver resubmits the URB and the if the host driver
> > > > finds the endpoint still marked as halted it should return EPIPE
> > > > status on the resubmitted URB
> > >
> > > Irrelevant.
> > Not at all. The point is that when an application is talking to an
> > instrument over the usbtmc driver, the underlying host controller and
> > its driver will detect and silence a babbling endpoint.
>
> No, they won't. That is, they will detect a babble error and return an error status, but
> they won't silence the endpoint. What makes you think they will?

Maybe there is a misunderstanding. I guess that Dave wanted to propose:
"EPROTO is a link level issue and needs to be handled by the host driver.
When the host driver detects a protocol error while processing an
URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
as halted."
Is this a realistic fix for all host drivers?

-Guido
Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
> > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > > When the host driver detects a protocol error while processing an
> > > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > > as halted.
> > > >
> > > > Not true. It does not mark the endpoint as halted, not unless it
> > > > receives a STALL handshake from the device. A STALL is not a
> > > > protocol error.
> > > >
> > > > > When the class driver resubmits the URB and the if the host driver
> > > > > finds the endpoint still marked as halted it should return EPIPE
> > > > > status on the resubmitted URB
> > > >
> > > > Irrelevant.
> > > Not at all. The point is that when an application is talking to an
> > > instrument over the usbtmc driver, the underlying host controller and
> > > its driver will detect and silence a babbling endpoint.
> >
> > No, they won't. That is, they will detect a babble error and return an error status, but
> > they won't silence the endpoint. What makes you think they will?
>
> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
> "EPROTO is a link level issue and needs to be handled by the host driver.
> When the host driver detects a protocol error while processing an
> URB it SHOULD complete the URB with EPROTO status

The host controller drivers _do_ complete URBs with -EPROTO (or similar)
status when a link-level error occurs...

> and SHOULD mark the endpoint
> as halted."

but they don't mark the endpoint as halted. Even if they did, it
wouldn't fix anything because the kernel allows URBs to be submitted to
halted endpoints. In fact, it doesn't even keep track of which
endpoints are or are not halted.

> Is this a realistic fix for all host drivers?

No, it isn't.

An endpoint shouldn't be marked as halted unless it really is halted.
Otherwise a driver might be tempted to clear the Halt feature, and
some devices do not like to receive a Clear-Halt request for an endpoint
that isn't halted.

What we could do is what you suggested earlier: Note the fact that the
endpoint is in some sort of fault condition and disallow further
communication with the endpoint until the fault condition has been
cleared. (It isn't entirely obvious exactly what actions should clear
such a fault... I guess resetting or re-enabling the endpoint, or
resetting the entire device.)

Alan Stern
Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Wed, 19 May 2021, Guido Kiener wrote:

> > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > > When the host driver detects a protocol error while processing an
> > > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > > as halted.
> > > >
> > > > Not true. It does not mark the endpoint as halted, not unless it
> > > > receives a STALL handshake from the device. A STALL is not a
> > > > protocol error.
> > > >
> > > > > When the class driver resubmits the URB and the if the host driver
> > > > > finds the endpoint still marked as halted it should return EPIPE
> > > > > status on the resubmitted URB
> > > >
> > > > Irrelevant.
> > > Not at all. The point is that when an application is talking to an
> > > instrument over the usbtmc driver, the underlying host controller and
> > > its driver will detect and silence a babbling endpoint.
> >
> > No, they won't. That is, they will detect a babble error and return an error status, but
> > they won't silence the endpoint. What makes you think they will?
>
> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
> "EPROTO is a link level issue and needs to be handled by the host driver.
> When the host driver detects a protocol error while processing an
> URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
> as halted."
> Is this a realistic fix for all host drivers?
>
> -Guido

Guido, would you mind taking a look at your mailer settings please? I
now have >=7 threads running through my inbox with the same subject.
For some reason your mailer is insisting on creating a new one for
each of your replies.

It's also adding odd "re: re: re: ..." prefixes.

TIA

--
Lee Jones [???]
Senior Technical Lead - Developer Services
Linaro.org ? Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
Alan Stern wrote:
> On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
>>> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
>>>> On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>
>>>>> On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
>>>>>> When the host driver detects a protocol error while processing an
>>>>>> URB it completes the URB with EPROTO status and marks the endpoint
>>>>>> as halted.
>>>>>
>>>>> Not true. It does not mark the endpoint as halted, not unless it
>>>>> receives a STALL handshake from the device. A STALL is not a
>>>>> protocol error.
>>>>>
>>>>>> When the class driver resubmits the URB and the if the host driver
>>>>>> finds the endpoint still marked as halted it should return EPIPE
>>>>>> status on the resubmitted URB
>>>>>
>>>>> Irrelevant.
>>>> Not at all. The point is that when an application is talking to an
>>>> instrument over the usbtmc driver, the underlying host controller and
>>>> its driver will detect and silence a babbling endpoint.
>>>
>>> No, they won't. That is, they will detect a babble error and return an error status, but
>>> they won't silence the endpoint. What makes you think they will?
>>
>> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
>> "EPROTO is a link level issue and needs to be handled by the host driver.
>> When the host driver detects a protocol error while processing an
>> URB it SHOULD complete the URB with EPROTO status
>
> The host controller drivers _do_ complete URBs with -EPROTO (or similar)
> status when a link-level error occurs...
>
>> and SHOULD mark the endpoint
>> as halted."
>
> but they don't mark the endpoint as halted. Even if they did, it
> wouldn't fix anything because the kernel allows URBs to be submitted to
> halted endpoints. In fact, it doesn't even keep track of which
> endpoints are or are not halted.
>
>> Is this a realistic fix for all host drivers?
>
> No, it isn't.
>
> An endpoint shouldn't be marked as halted unless it really is halted.
> Otherwise a driver might be tempted to clear the Halt feature, and
> some devices do not like to receive a Clear-Halt request for an endpoint
> that isn't halted.
>
> What we could do is what you suggested earlier: Note the fact that the
> endpoint is in some sort of fault condition and disallow further
> communication with the endpoint until the fault condition has been
> cleared. (It isn't entirely obvious exactly what actions should clear
> such a fault... I guess resetting or re-enabling the endpoint, or
> resetting the entire device.)
>
> Alan Stern
>

Hi Alan,

Sorry if this diverges from the thread, but I've been wondering whether
to add a change for this also.

For xHCI hosts, after transactions errors, the endpoint will enter
halted state. The driver will attempt a few soft-retries before giving
up. According to the xHCI spec (section 4.6.8), a host may send a
ClearFeature(endpoint_halt) to recover and restart the transfer (see
"reset a pipe" in xhci spec), and the class driver can handle this after
receiving something like -EPROTO from xhci.

However, as you've pointed out, some devices don't like
ClearFeature(ep_halt) and may not properly synchronize with the host on
where it should restart.

Some OS (such as Windows) do this. Not sure if we also want this?
Currently the recovery is just a timeout and a port reset from the class
driver, but the timeout is usually defaulted to a long time (e.g. 30
seconds for storage class driver).

Thanks,
Thinh
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
> Hi Alan,
>
> Sorry if this diverges from the thread, but I've been wondering whether
> to add a change for this also.
>
> For xHCI hosts, after transactions errors, the endpoint will enter
> halted state.

No. You are misreading the xHCI spec. Section 4.6.8 says:

... the state of the associated Endpoint Context is set to
Halted...

Note this carefully. It says "Endpoint Context", not "endpoint".

The endpoint is part of the device, whereas the endpoint context is part
of the host controller. The device doesn't know when a transaction
error has occurred; consequently such errors do not affect the endpoint.
The host controller does know, and consequently such errors do affect
the endpoint context.

> The driver will attempt a few soft-retries before giving
> up. According to the xHCI spec (section 4.6.8), a host may send a
> ClearFeature(endpoint_halt) to recover and restart the transfer (see

Not quite. The section of the spec you're talking about says:

Software shall execute the following sequence to “reset a
pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to
device.

It does not say the host controller will do this; it says that software
will do it.

> "reset a pipe" in xhci spec), and the class driver can handle this after
> receiving something like -EPROTO from xhci.
>
> However, as you've pointed out, some devices don't like
> ClearFeature(ep_halt) and may not properly synchronize with the host on
> where it should restart.
>
> Some OS (such as Windows) do this. Not sure if we also want this?

In general we should do the same thing as Windows does, because most
hardware designers test their equipment on Windows systems but
relatively few test on Linux systems.

> Currently the recovery is just a timeout and a port reset from the class

This depends on the driver. Some perform no recovery at all.

> driver, but the timeout is usually defaulted to a long time (e.g. 30
> seconds for storage class driver).

That 30-second timeout in the mass-storage driver applies in situations
where a command fails to complete, not in situations where it completes
quickly but with a -EPROTO or -EPIPE error.

The fact is that only a small percentage of -EPROTO errors are
recoverable. Some of them can be handled by a port reset, which can be
pretty awkward to perform but does occasionally work. A lot of them
occur because the USB cable has been unplugged; obviously there's no way
to recover from that. With only a few exceptions, the best and simplest
approach is not to try to recover at all.

For the case in question (the syzbot bug report that started this
thread), the class driver doesn't try to perform any recovery. It just
resubmits the URB, getting into a tight retry loop which consumes too
much CPU time. Simply giving up would be preferable.

Alan Stern
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
+Mathias

Alan Stern wrote:
> On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
>> Hi Alan,
>>
>> Sorry if this diverges from the thread, but I've been wondering whether
>> to add a change for this also.
>>
>> For xHCI hosts, after transactions errors, the endpoint will enter
>> halted state.
>
> No. You are misreading the xHCI spec. Section 4.6.8 says:
>
> ... the state of the associated Endpoint Context is set to
> Halted...
>
> Note this carefully. It says "Endpoint Context", not "endpoint".
>
> The endpoint is part of the device, whereas the endpoint context is part
> of the host controller. The device doesn't know when a transaction
> error has occurred; consequently such errors do not affect the endpoint.
> The host controller does know, and consequently such errors do affect
> the endpoint context.
>

You're right, my mistake here.

>> The driver will attempt a few soft-retries before giving
>> up. According to the xHCI spec (section 4.6.8), a host may send a
>> ClearFeature(endpoint_halt) to recover and restart the transfer (see
>
> Not quite. The section of the spec you're talking about says:
>
> Software shall execute the following sequence to “reset a
> pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to
> device.
>
> It does not say the host controller will do this; it says that software
> will do it.

Sorry for being unclear. I meant from the class driver, see my next
sentence.

>
>> "reset a pipe" in xhci spec), and the class driver can handle this after
>> receiving something like -EPROTO from xhci.
>>
>> However, as you've pointed out, some devices don't like
>> ClearFeature(ep_halt) and may not properly synchronize with the host on
>> where it should restart.
>>
>> Some OS (such as Windows) do this. Not sure if we also want this?
>
> In general we should do the same thing as Windows does, because most
> hardware designers test their equipment on Windows systems but
> relatively few test on Linux systems.
>
>> Currently the recovery is just a timeout and a port reset from the class
>
> This depends on the driver. Some perform no recovery at all.
>
>> driver, but the timeout is usually defaulted to a long time (e.g. 30
>> seconds for storage class driver).
>
> That 30-second timeout in the mass-storage driver applies in situations
> where a command fails to complete, not in situations where it completes
> quickly but with a -EPROTO or -EPIPE error.

Hm... looks like we have a couple of issues in the uas storage class
driver and the xhci driver.

We may need to fix that in the uas storage driver because it doesn't
seem to handle it. (check uas_data_cmplt() in uas.c).

As for the xhci driver, there maybe a case where the stream URB never
gets to complete because the transaction err_count is not properly
updated. The err_count for transaction error is stored in ep_ring, but
the xhci driver may not be able to lookup the correct ep_ring based on
TRB address for streams. There are cases for streams where the event
TRBs have their TRB pointer field cleared to '0' (xhci spec section
4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
it automatically does a soft-retry. This is seen from one of our
testings that the driver was repeatedly doing soft-retry until the class
driver timed out.

Hi Mathias, maybe you have some comment on this? Thanks.

>
> The fact is that only a small percentage of -EPROTO errors are
> recoverable. Some of them can be handled by a port reset, which can be
> pretty awkward to perform but does occasionally work. A lot of them
> occur because the USB cable has been unplugged; obviously there's no way
> to recover from that. With only a few exceptions, the best and simplest
> approach is not to try to recover at all.

If the cable is unplugged, then we should get a connection change event
and the driver can handle it properly.

Yes, it's probably simplest to do a port reset and let the transfer be
incomplete/corrupted. However, I think we should give
ClearFeature(ep_halt) some more thoughts as I think it can be a recovery
mechanism for storage class driver, even though that it may not be
foolproof.

>
> For the case in question (the syzbot bug report that started this
> thread), the class driver doesn't try to perform any recovery. It just
> resubmits the URB, getting into a tight retry loop which consumes too
> much CPU time. Simply giving up would be preferable.
>
> Alan Stern
>

I see. By giving up, you mean doing port reset right? Otherwise it needs
some other mechanism to synchronize with the device side.

Thanks,
Thinh
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On 20.5.2021 23.30, Thinh Nguyen wrote:
> +Mathias
>
...

> Hm... looks like we have a couple of issues in the uas storage class
> driver and the xhci driver.
>
> We may need to fix that in the uas storage driver because it doesn't
> seem to handle it. (check uas_data_cmplt() in uas.c).
>
> As for the xhci driver, there maybe a case where the stream URB never
> gets to complete because the transaction err_count is not properly
> updated. The err_count for transaction error is stored in ep_ring, but
> the xhci driver may not be able to lookup the correct ep_ring based on
> TRB address for streams. There are cases for streams where the event
> TRBs have their TRB pointer field cleared to '0' (xhci spec section
> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> it automatically does a soft-retry. This is seen from one of our
> testings that the driver was repeatedly doing soft-retry until the class
> driver timed out.
>
> Hi Mathias, maybe you have some comment on this? Thanks.

This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
We should add one and prevent a loop. after e few soft resets we can end with a
hard reset to clear the host side endpoint halt.

We don't know the URB that was being tansferred during the error, and can't
give it back with a proper error code.
In that sense we still end up waiting for a timeout and someone to cancel
the urb.

-Mathias
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
> On 20.5.2021 23.30, Thinh Nguyen wrote:
> > As for the xhci driver, there maybe a case where the stream URB never
> > gets to complete because the transaction err_count is not properly
> > updated. The err_count for transaction error is stored in ep_ring, but
> > the xhci driver may not be able to lookup the correct ep_ring based on
> > TRB address for streams. There are cases for streams where the event
> > TRBs have their TRB pointer field cleared to '0' (xhci spec section
> > 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> > it automatically does a soft-retry. This is seen from one of our
> > testings that the driver was repeatedly doing soft-retry until the class
> > driver timed out.
> >
> > Hi Mathias, maybe you have some comment on this? Thanks.
>
> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
> We should add one and prevent a loop. after e few soft resets we can end with a
> hard reset to clear the host side endpoint halt.
>
> We don't know the URB that was being tansferred during the error, and can't
> give it back with a proper error code.
> In that sense we still end up waiting for a timeout and someone to cancel
> the urb.

That's not good. There may not be a timeout; drivers expect transfers
to complete with a failure, not to be retried indefinitely.

However, if you do know which endpoint/stream the error is connected to,
you should be able to get the URB. It will be the first one queued for
that endpoint/stream.

Alan Stern
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
Alan Stern wrote:
> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>> As for the xhci driver, there maybe a case where the stream URB never
>>> gets to complete because the transaction err_count is not properly
>>> updated. The err_count for transaction error is stored in ep_ring, but
>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>> TRB address for streams. There are cases for streams where the event
>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>> it automatically does a soft-retry. This is seen from one of our
>>> testings that the driver was repeatedly doing soft-retry until the class
>>> driver timed out.
>>>
>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>
>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>> We should add one and prevent a loop. after e few soft resets we can end with a
>> hard reset to clear the host side endpoint halt.
>>
>> We don't know the URB that was being tansferred during the error, and can't
>> give it back with a proper error code.
>> In that sense we still end up waiting for a timeout and someone to cancel
>> the urb.
>
> That's not good. There may not be a timeout; drivers expect transfers
> to complete with a failure, not to be retried indefinitely.
>
> However, if you do know which endpoint/stream the error is connected to,
> you should be able to get the URB. It will be the first one queued for
> that endpoint/stream.
>

When the xhci can't recover a transfer with soft-retry, no outstanding
transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
we just don't know which stream or endpoint ring it's for, but we know
all the outstanding URBs of an endpoint. Let's may as well return an
error status for all of them after a limited number of soft-retries.

BR,
Thinh
Re: [syzbot] INFO: rcu detected stall in tx [ In reply to ]
On 24.5.2021 22.23, Thinh Nguyen wrote:
> Alan Stern wrote:
>> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>>> As for the xhci driver, there maybe a case where the stream URB never
>>>> gets to complete because the transaction err_count is not properly
>>>> updated. The err_count for transaction error is stored in ep_ring, but
>>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>>> TRB address for streams. There are cases for streams where the event
>>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>>> it automatically does a soft-retry. This is seen from one of our
>>>> testings that the driver was repeatedly doing soft-retry until the class
>>>> driver timed out.
>>>>
>>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>>
>>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>>> We should add one and prevent a loop. after e few soft resets we can end with a
>>> hard reset to clear the host side endpoint halt.
>>>
>>> We don't know the URB that was being tansferred during the error, and can't
>>> give it back with a proper error code.
>>> In that sense we still end up waiting for a timeout and someone to cancel
>>> the urb.
>>
>> That's not good. There may not be a timeout; drivers expect transfers
>> to complete with a failure, not to be retried indefinitely.
>>
>> However, if you do know which endpoint/stream the error is connected to,
>> you should be able to get the URB. It will be the first one queued for
>> that endpoint/stream.
>>
>
> When the xhci can't recover a transfer with soft-retry, no outstanding
> transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
> we just don't know which stream or endpoint ring it's for, but we know
> all the outstanding URBs of an endpoint. Let's may as well return an
> error status for all of them after a limited number of soft-retries.

We get the endpoint, but not the stream.

I guess we could walk through each stream of this endpoint, and return the
first URB of every stream that has a pending URB.
xHCI spec claims to supports 65533 streams per endpoint, but in real life
UAS probably only uses a few per endpoint?

-Mathias

1 2  View All