Skip to content

seq: improve seq_queue implementation and tests#8

Open
ralphlange wants to merge 1 commit intoepics-modules:mainfrom
ralphlange:fix-queue
Open

seq: improve seq_queue implementation and tests#8
ralphlange wants to merge 1 commit intoepics-modules:mainfrom
ralphlange:fix-queue

Conversation

@ralphlange
Copy link
Copy Markdown
Member

Fix possible race condition in seq_queue,
where internal state was checked outside of a mutex.
Fixes unit test failure in MacOS builds.

Co-authored-by: google-labs-jules[bot]

@anderslindho
Copy link
Copy Markdown
Contributor

Tested locally with macOS and arm64. I reproduced the failure 2/2 on queueTest.t with Bail out! 4589<=4587. I passed 1/1 with full runtests and 3/3 on queueTest.t specifically after the patch.

@anderslindho
Copy link
Copy Markdown
Contributor

One follow-up comment. The queue status helpers do not take q->mutex. Most uses seem read-only/status-only, but seqQueueIsEmpty() is used by seq_pvGetQ. Should we maybe lock seqQueueIsEmpty() too, or all of seqQueueFree(), seqQueueUsed(), seqQueueIsEmpty(), seqQueueIsFull() for consistency?

@simon-ess
Copy link
Copy Markdown

I see the same behaviour as @anderslindho, and agree with his assessment of adding mutex locks.

There's also the other question: why does this fail on MacOS but not on various linuces?

@ralphlange
Copy link
Copy Markdown
Member Author

There's also the other question: why does this fail on MacOS but not on various linuces?

Timing, I would think. These multi-core arms are just a whole bit faster.

@ralphlange
Copy link
Copy Markdown
Member Author

... agree with his assessment of adding mutex locks.

Let me check a bit further in the commit history.

The original author(s?) were obviously very concerned about lock contention - the extra gymnastics that this PR removes are telling a story... In principle, read-only/status-only could be fine (off-by-one is ok in those cases), unless a race would produce crap data.

I agree that locking would increase consistency and beauty - just wondering about the price tag.

@ralphlange
Copy link
Copy Markdown
Member Author

Commit ede72542

seq: complete redesign of broken queue adt

  • use size_t for all indices and sizes
  • state and ensure invariants
  • an additional boolean is needed to indicate overflow
  • added a mutex (used only when queue is overflowing)
  • added generic (unsafe) access methods

@ralphlange
Copy link
Copy Markdown
Member Author

ralphlange commented Apr 27, 2026

So the idea was to only lock in overflow - interesting.

Nice article on the topic (not on locks, though):
https://www.snellman.net/blog/archive/2016-12-13-ring-buffers

@anderslindho
Copy link
Copy Markdown
Contributor

Indeed a nice article!

I just noticed that there is a header comment that will need updating if we accept this fix (seq_queue.h):

The implementation allows one reader and one writer to access the queue
without taking a mutex, except where unavoidable, i.e. when the queue is
full.

Re costs, haven't we taken the brunt of them already with the change in this PR?

@simon-ess
Copy link
Copy Markdown

I am sort of surprised at this. There is a mutex to block concurrent put/get when full which... fine? But why not concurrent gets or puts? Isn't that also a race condition? i.e. two puts at the same time would write to the same spot in memory with one of them being overwritten, no?

@ralphlange
Copy link
Copy Markdown
Member Author

i.e. two puts at the same time would write to the same spot in memory with one of them being overwritten, no?

See the comment that Anders mentioned: It's an attempt to implement a lock-free "SPSC" (single producer, single consumer) queue.

@simon-ess
Copy link
Copy Markdown

Ah, I think I missed that it was a single producer and consumer. This does make more sense then.

@simon-ess
Copy link
Copy Markdown

simon-ess commented Apr 29, 2026

As an aside, the test bail out message is wrong, which is sort of funny:

        if (j<i) {
            testAbort("%d<=%d", i, j);
        }

The i/j is backwards in the testAbort message.

(Actually, I guess what they're trying to say is that "i should be less than or equal to j but it is not", but it doesn't look that way in the log at all)

@ralphlange
Copy link
Copy Markdown
Member Author

:-) Right.

Stay tuned - I will have an updated PR in a sec or two (or 14400).

- use epicsAtomic and memory barriers to allow concurrent get/put
  without a mutex when the queue is neither full nor empty
- ensure all accesses to shared indices (rd, wr) are atomic
  to prevent data races and ensure consistency (fix MacOS issues)
- extend queueTest with a high-iteration concurrent test
  (verify data integrity using bit negation and sequence tracking)

Co-authored-by: google-labs-jules[bot]
@ralphlange
Copy link
Copy Markdown
Member Author

@simon-ess, @anderslindho, @anjohnson:
Here is - cleaned-up - what Jules eventually came up with. Not bad at all.

@ralphlange ralphlange changed the title seq: fix possible race in seq_queue seq: improve seq_queue implementation and tests Apr 29, 2026
@ralphlange ralphlange removed the help wanted Extra attention is needed label Apr 29, 2026
@ralphlange ralphlange linked an issue Apr 29, 2026 that may be closed by this pull request
@anderslindho
Copy link
Copy Markdown
Contributor

Tested on arm64 macOS, and all 252 queueTest assertions pass.

One question: is the queue still SPSC, or are we changing the contract to MPSC? The Put comment says "to support Multi-Producer" but the Get comment says "Single-Consumer", and the header still describes a one-reader-one-writer design.

Honestly not sure this is better than your first revision... I'm not sure what we're gaining (or introducing) over just always locking.

@ralphlange
Copy link
Copy Markdown
Member Author

The AI insisted that there are multiple producers, and I don't understand the sequencer code well enough to counterargue. What do you think?

I agree that the main gain still is pulling the lock in front of the index check (https://github.com/epics-modules/sequencer/pull/8/changes#diff-2d908f295da82c41ba4e921917956f1fe83821391cca5703230173a1f4f96a3bR136).
However, I like the use of epicsAtomic for things that need to be atomic (not just by chance). Also, as far as I understand, the memory barriers are actually needed for lock-free get, and we might just be lucky that no serious code resorting happened yet.

@anderslindho
Copy link
Copy Markdown
Contributor

The AI insisted that there are multiple producers, and I don't understand the sequencer code well enough to counterargue. What do you think?

We can probably instruct another AI to argue against yours, if that's what you want. :)

TBH I don't feel confident speaking about singular or multiple producers for this code base as it stands, BUT seq_ca.c has an explicit comment suggesting that we needn't worry: "no need to lock against other writers, because named and anonymous PVs are disjoint". I would, generally, practice some extra caution when it comes to the confidence of an AI's statements. I will generally get a 2nd (AI) opinion.

I agree that the main gain is on pulling the lock in front of the index check (but I also think this failure on macOS+arm64 might not even matter), and I agree that using atomics where atomics are wanted always is a good idea.

@simon-ess
Copy link
Copy Markdown

Tbh if the sequencer code is not understood well enough by us I'm not comfortable with trusting an LLM to suggest changes like this.

If we have to make a choice I would prefer the mutex in front of the index check. If the issue is the additional cost of having the mutex, is this something we can run some benchmarks against?

@simon-ess
Copy link
Copy Markdown

As an example of this: If you remove all the epicsAtomics except the memory barriers the tests pass for me. Removing those as well makes it fail. So do we actually need all of this change?

I think that without one of us clearly understanding the issue and the correct fix, this seems unwise to merge in.

@ralphlange
Copy link
Copy Markdown
Member Author

ralphlange commented May 6, 2026

Well, there are two separate issues:

  1. On arm, code is reordered more drastically on execution than with x86 processors. The memory barriers are correct, necessary and common for lock-free queue concepts. (This is what broke the macOS builds.)
  2. The original code (checking index, then taking lock if deemed necessary) only works for single-producer queues. Gemini spots the access to the index outside the lock, which is usually a problem, with usually meaning having multiple producers. The code comment claims this queue to be single-producer. (And I tend to believe that, as there would have been problems much earlier if there were multiple producers.)

The indices in question are data types that are accessed atomically on usual processor architectures. However, for lock-free use they must be atomic, so using epicsAtomic is more explicit, i.e., IMHO better.

I'm leaning towards taking epicsAtomic for accessing indices and for the memory barriers, and leaving the lock as it was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unit test queueTest fails consistently on MacOS

3 participants