Add critsec to rp2040 xfer, check endpoint status#2474
Add critsec to rp2040 xfer, check endpoint status#2474gemarcano wants to merge 1 commit intohathach:masterfrom
Conversation
ae077b0 to
bbf9ea0
Compare
bbf9ea0 to
f2f0d25
Compare
f2f0d25 to
5ee7ae9
Compare
|
This did not work so well for me as already mentioned there: #1764 (comment) |
5ee7ae9 to
dcf472c
Compare
|
A lot has change in my project since I first came across this issue and now somehow even with the workarounds I had built-in I found a use case where I would always get So I eventually decided to checkout your branch gemarcano:rp2040_irq_lock_fix and sure enough it worked 🥳 Then I decided to remove the workaround I had in place to take care of the worst cases. It's basically filtering out some HID reports causing multiple USB remounts. Without it still did not panic but the USB connection seems to be broken somehow, not responding. I'll have to leave that workaround for now but it looks like I'll keep using your patch for the time being. |
|
As I was trying to rebase that on the tip of master I realised the tip of master works just fine too. Same symptom as described above. So I'm afraid I can't tell if that patch helps in any way after all. What's sure is that it works just as well as the tip of master apparently. I could rebase it with no conflict too. |
|
Tip of master still gave me that Panic after all, so I've started trying that patch again. Sure enough, I have not seen that Panic since. A couple of crashes and reboots maybe but possibly unrelated. |
5441fb3 to
2318575
Compare
2318575 to
1eb62ba
Compare
1eb62ba to
f5b2459
Compare
|
There were some recent changes in the rp2040 code adding ISO endpoints, so I had to tweak this PR a little bit to carry over the null-pointer dereference checks. The main change is that Thinking about it more, I should probably split the locking implementation from the null-pointer deref checks/fixes. Thoughts? |
f5b2459 to
ca0d53e
Compare
|
@gemarcano The last commit ca0d53e Is working well for me. |
ca0d53e to
056520e
Compare
|
I have been working on building CircuitPython for the Olimex RP2350pc and have modified the CircuitPython core to allow an rp2350 build to use the native usb pins for usb host instead of the normal CircuitPython device usage. I believe this is the first board attempting to use tinyUSB/RP2040 host with CircuitPython (non-PIO). I seem to be running into the issue this PR is attempting to address as my initial build immediately would panic in rp2040_usb.c with "ep xx was already available". When I applied the changes from this PR the panic moved to hcd_rp2040.c with "Seq Error: [0] = 0xnnnn [1] = 0xnnnn". Which makes me think that there is still an unguarded transfer being initiated somewhere. I don't know if it's any help but I also discovered that simply commenting out the original panic statement in rp2040_usb.c ( I was also able to largely mitigate the original panic by introducing "wait" loops in the CircuitPython core which effectively turned the usb (mouse) reads blocking. CircuitPython was using a short timeout period with the read transactions to provide a non-blocking read. This results in lots of transactions being started and then aborted. Adding wait loops to the core avoids the panic, however for reasons I'm not entirely sure (but I suspect it has to do with a callback routine not be used for transfer aborts) the wait ends up blocking any other USB transfers effectively disabling USB keyboard input when mouse reads are being performed. I'll continue to poke around and see if I can identify any unguarded transfers but the USB protocol is all a little out of my depth. I do suspect/hope that this PR is on the right path though. Edit: After 2-3 hours of exercising (playing minesweeper) the firmware that had the panic statement commented out finally crashed into safe mode, as I probably should have expected..... |
056520e to
71cb582
Compare
- Implemented a critical section for the rp2040 port, which now prevents an IRQ from clearing the endpoint data structures while a transfer was in progress, which would then lead to a null pointer derefence.
71cb582 to
48ae658
Compare
|
From what I can tell, latest changes in the rp2040 code fix the null pointer problems, so in theory I can now just focus on the actual issue, the race condition between the endpoint structs being cleared. I'm not sure what's the best way to keep track if the structs have all been initialized, so I just pulled out the Right now all I can currently do is make sure this compiles. The hardware I had been using for this is currently packed away. It would be good for people using |
Describe the PR
hw_endpoint_lock_updateis unimplemented for the rp2040 port. This PR takes a stab at implementing it. Without it, if the USB port is disconnected, say withtud_disconnect, a race condition can happen where a transfer is in progress and the USB IRQ can fire, callingreset_non_control_endpointsand leaving the ongoing transfer in a bad state, as it tries to access endpoint data structures that are now zero'd.While debugging that main issue, I also found some cases of the rp2040 port assuming that an endpoint data structure exists for all endpoints, which is not the case as endpoint 0 has none. I've also included a few simple fixes for these.
Additional context
I've documented a lot of my debugging attempts in this discussion: #1764
I'm not super familiar with the tinyusb coding style, and I'm not sure my lock implementation is right (very naive, likely bad reference counting, might need atomic_int instead of int?).