diff options
Diffstat (limited to 'seed/0109-comms-buffers.rst')
-rw-r--r-- | seed/0109-comms-buffers.rst | 490 |
1 files changed, 490 insertions, 0 deletions
diff --git a/seed/0109-comms-buffers.rst b/seed/0109-comms-buffers.rst new file mode 100644 index 000000000..fed016259 --- /dev/null +++ b/seed/0109-comms-buffers.rst @@ -0,0 +1,490 @@ +.. _seed-0109: + +=========================== +0109: Communication Buffers +=========================== +.. seed:: + :number: 109 + :name: Communication Buffers + :status: Accepted + :proposal_date: 2023-08-28 + :cl: 168357 + +------- +Summary +------- +This SEED proposes that Pigweed adopt a standard buffer type for network-style +communications. This buffer type will be used in the new sockets API +(see the recently-accepted `Communications SEED 107 +<https://pigweed.dev/seed/0107-communications.html>`_ for more details), and +will be carried throughout the communications stack as-appropriate. + +------------------------ +Top-Level Usage Examples +------------------------ + +Sending a Proto Into a Socket +============================= +.. code-block:: cpp + + Allocator& msg_allocator = socket.Allocator(); + size_t size = EncodedProtoSize(some_proto); + std::optional<pw::MultiBuf> buffer = msg_allocator.Allocate(size); + if (!buffer) { return; } + EncodeProto(some_proto, *buffer); + socket.Send(std::move(*buffer)); + +``Socket`` s provide an allocator which should be used to create a +``pw::MultiBuf``. The data to be sent is then filled into this buffer before +being sent. ``Socket`` must accept ``pw::MultiBuf`` s which were not created +by their own ``Allocator``, but this may incur a performance penalty. + +Zero-Copy Fragmentation +======================= + +The example above has a hidden feature: zero-copy fragmentation! The socket +can provide a ``pw::MultiBuf`` which is divided up into separate MTU-sized +chunks, each of which has reserved space for headers and/or footers: + +.. code-block:: cpp + + std::optional<pw::MultiBuf> Allocate(size_t size) { + if (size == 0) { return pw::MultiBuf(); } + size_t data_per_chunk = mtu_size_ - header_size_; + size_t num_chunks = 1 + ((size - 1) / data_per_chunk); + std::optional<pw::MultiBuf> buffer = pw::MultiBuf::WithCapacity(num_chunks); + if (!buffer) { return std::nullopt; } + for (size_t i = 0; i < num_chunks; i++) { + // Note: we could allocate smaller than `mtu_size_` for the final + // chunk. This is ommitted here for brevity. + std::optional<pw::MultiBuf::Chunk> chunk = internal_alloc_.Allocate(mtu_size_); + if (!chunk) { return std::nullopt; } + // Reserve the header size by discarding the front of each buffer. + chunk->DiscardFront(header_size_); + buffer->Chunks()[i] = std::move(chunk); + } + return *buffer; + } + +This code reserves ``header_size_`` bytes at the beginning of each ``Chunk``. +When the caller writes into this memory and then passes it back to the socket, +these bytes can be claimed for the header using the ``ClaimPrefix`` function. + +One usecase that seems to demand the ability to fragment like this is breaking +up ``SOCK_SEQPACKET`` messages which (at least on Unix / Linux) may be much larger +than the MTU size: up to ``SO_SNDBUF`` (see `this man page +<https://man7.org/linux/man-pages/man7/socket.7.html>`_). + +Multiplexing Incoming Data +========================== + +.. code-block:: cpp + + [[nodiscard]] bool SplitAndSend(pw::MultiBuf buffer) { + std::optional<std::array<pw::MultiBuf, 2>> buffers = + std::move(buffer).Split(split_index); + if (!buffers) { return false; } + socket_1_.Send(std::move(buffers->at(0))); + socket_2_.Send(std::move(buffers->at(1))); + return true; + } + +Incoming buffers can be split without copying, and the results can be forwarded +to multiple different ``Socket`` s, RPC services or clients. + +---------- +Motivation +---------- +Today, a Pigweed communications stack typically involves a number of different +buffers. + +``pw_rpc`` users, for example, frequently use direct-memory access (DMA) or +other system primitives to read data into a buffer, apply some link-layer +protocol such as HDLC which copies data into a second buffer, pass the resulting +data into pw_rpc which parses it into its own buffer. Multiple sets of buffers +are present on the output side as well. Between DMA in and DMA out, it's easy +for data to pass through six or more different buffers. + +These independent buffer systems introduce both time and space overhead. Aside +from the additional CPU time required to move the data around, users need to +ensure that all of the different buffer pools involved along the way have enough +space reserved to contain the entire message. Where caching is present, moving +the memory between locations may create an additional delay by thrashing +between memory regions. + +-------- +Proposal +-------- +``pw::buffers::MultiBuf`` is a handle to a buffer optimized for use within +Pigweed's communications stack. It provides efficient, low-overhead buffer +management, and serves as a standard type for passing data between drivers, +TCP/IP implementations, RPC 2.0, and transfer 2.0. + +A single ``MultiBuf`` is a list of ``Chunk`` s of data. Each ``Chunk`` +points to an exclusively-owned portion of a reference-counted allocation. +``MultiBuf`` s can be easily split, joined, prefixed, postfixed, or infixed +without needing to copy the underlying data. + +The memory pointed to by ``Chunk`` s is typically allocated from a pool +provided by a ``Socket``. This allows the ``Socket`` to provide backpressure to +callers, and to ensure that memory is placed in DMA or shared memory regions +as-necessary. + +In-Memory Layout +================ + +This diagram shows an example in-memory relationship between two buffers: +- ``Buffer1`` references one chunks from region A. +- ``Buffer2`` references two chunk from regions A and B. + +.. mermaid:: + + graph TB; + Buffer1 --> Chunk1A + Chunk1A -- "[0..64]" --> MemoryRegionA(Memory Region A) + Chunk1A --> ChunkRegionTrackerA + Buffer2 --> Chunk2A & Chunk2B + Chunk2A --> ChunkRegionTrackerA + Chunk2A -- "[64..128]" --> MemoryRegionA(Memory Region A) + Chunk2B -- "[0..128]" --> MemoryRegionB + Chunk2B --> ChunkRegionTrackerB + +API +=== + +The primary API is as follows: + +.. code-block:: cpp + + /// An object that manages a single allocated region which is referenced + /// by one or more chunks. + class ChunkRegionTracker { + public: + ChunkRegionTracker(ByteSpan); + + /// Creates the first ``Chunk`` referencing a whole region of memory. + /// This must only be called once per ``ChunkRegionTracker``. + Chunk ChunkForRegion(); + + protected: + pw::Mutex lock(); + + /// Destroys the `ChunkRegionTracker`. + /// + /// Typical implementations will call `std::destroy_at(this)` and then + /// free the memory associated with the tracker. + virtual void Destroy(); + + /// Defines the entire span of the region being managed. ``Chunk`` s + /// referencing this tracker may not expand beyond this region + /// (or into one another). + /// + /// This region must not change for the lifetime of this + /// ``ChunkRegionTracker``. + virtual ByteSpan region(); + + private: + /// Used to manage the internal linked-list of ``Chunk`` s that allows + /// chunks to know whether or not they can expand to fill neighboring + /// regions of memory. + pw::Mutex lock_; + }; + + /// A handle to a contiguous refcounted slice of data. + /// + /// Note: this Chunk may acquire a ``pw::sync::Mutex`` during destruction, and + /// so must not be destroyed within ISR context. + class Chunk { + public: + Chunk(); + Chunk(ChunkRegionTracker&); + Chunk(Chunk&& other) noexcept; + Chunk& operator=(Chunk&& other); + ~Chunk(); + std::byte* data(); + const std::byte* data() const; + ByteSpan span(); + ConstByteSpan span() const; + size_t size() const; + + std::byte& operator[](size_t index); + std::byte operator[](size_t index) const; + + /// Decrements the reference count on the underlying chunk of data and empties + /// this handle so that `span()` now returns an empty (zero-sized) span. + /// + /// Does not modify the underlying data, but may cause it to be + /// released if this was the only remaining ``Chunk`` referring to it. + /// This is equivalent to ``{ Chunk _unused = std::move(chunk_ref); }`` + void Release(); + + /// Attempts to add `prefix_bytes` bytes to the front of this buffer by + /// advancing its range backwards in memory. Returns `true` if the + /// operation succeeded. + /// + /// This will only succeed if this `Chunk` points to a section of a chunk + /// that has unreferenced bytes preceeding it. For example, a `Chunk` + /// which has been shrunk using `DiscardFront` can then be re-expanded using + /// `ClaimPrefix`. + /// + /// Note that this will fail if a preceding `Chunk` has claimed this + /// memory using `ClaimSuffix`. + [[nodiscard]] bool ClaimPrefix(size_t prefix_bytes); + + /// Attempts to add `suffix_bytes` bytes to the back of this buffer by + /// advancing its range forwards in memory. Returns `true` if the + /// operation succeeded. + /// + /// This will only succeed if this `Chunk` points to a section of a chunk + /// that has unreferenced bytes following it. For example, a `Chunk` + /// which has been shrunk using `Truncate` can then be re-expanded using + /// `ClaimSuffix`. + /// + /// Note that this will fail if a following `Chunk` has claimed this + /// memory using `ClaimPrefix`. + [[nodiscard]] bool ClaimSuffix(size_t suffix_bytes); + + /// Shrinks this handle to refer to the data beginning at offset ``new_start``. + /// + /// Does not modify the underlying data. + void DiscardFront(size_t new_start); + + /// Shrinks this handle to refer to data in the range ``begin..<end``. + /// + /// Does not modify the underlying data. + void Slice(size_t begin, size_t end); + + /// Shrinks this handle to refer to only the first ``len`` bytes. + /// + /// Does not modify the underlying data. + void Truncate(size_t len); + + /// Splits this chunk in two, with the second chunk starting at `split_point`. + std::array<Chunk, 2> Split(size_t split_point) &&; + }; + + /// A handle to a sequence of potentially non-contiguous refcounted slices of + /// data. + class MultiBuf { + public: + struct Index { + size_t chunk_index; + size_t byte_index; + }; + + MultiBuf(); + + /// Creates a ``MultiBuf`` pointing to a single, contiguous chunk of data. + /// + /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory. + static std::optional<MultiBuf> FromChunk(Chunk chunk); + + /// Splits the ``MultiBuf`` into two separate buffers at ``split_point``. + /// + /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory. + std::optional<std::array<MultiBuf, 2>> Split(Index split_point) &&; + std::optional<std::array<MultiBuf, 2>> Split(size_t split_point) &&; + + /// Appends the contents of ``suffix`` to this ``MultiBuf`` without copying data. + /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. + [[nodiscard]] bool Append(MultiBuf suffix); + + /// Discards the first elements of ``MultiBuf`` up to (but not including) + /// ``new_start``. + /// + /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. + [[nodiscard]] bool DiscardFront(Index new_start); + [[nodiscard]] bool DiscardFront(size_t new_start); + + /// Shifts and truncates this handle to refer to data in the range + /// ``begin..<stop``. + /// + /// Does not modify the underlying data. + /// + /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. + [[nodiscard]] bool Slice(size_t begin, size_t end); + + /// Discards the tail of this ``MultiBuf`` starting with ``first_index_to_drop``. + /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. + [[nodiscard]] bool Truncate(Index first_index_to_drop); + /// Discards the tail of this ``MultiBuf`` so that only ``len`` elements remain. + /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. + [[nodiscard]] bool Truncate(size_t len); + + /// Discards the elements beginning with ``cut_start`` and resuming at + /// ``resume_point``. + /// + /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. + [[nodiscard]] bool DiscardSegment(Index cut_start, Index resume_point); + + /// Returns an iterable over the ``Chunk``s of memory within this ``MultiBuf``. + auto Chunks(); + auto Chunks() const; + + /// Returns a ``BidirectionalIterator`` over the bytes in this ``MultiBuf``. + /// + /// Note that use of this iterator type may be less efficient than + /// performing chunk-wise operations due to the noncontiguous nature of + /// the iterator elements. + auto begin(); + auto end(); + + /// Counts the total number of ``Chunk`` s. + /// + /// This function is ``O(n)`` in the number of ``Chunk`` s. + size_t CalculateNumChunks() const; + + /// Counts the total size in bytes of all ``Chunk`` s combined. + /// + /// This function is ``O(n)`` in the number of ``Chunk`` s. + size_t CalculateTotalSize() const; + + /// Returns an ``Index`` which can be used to provide constant-time access to + /// the element at ``position``. + /// + /// Any mutation of this ``MultiBuf`` (e.g. ``Append``, ``DiscardFront``, + /// ``Slice``, or ``Truncate``) may invalidate this ``Index``. + Index IndexAt(size_t position) const; + }; + + + class MultiBufAllocationFuture { + public: + Poll<std::optional<Buffer>> Poll(Context&); + }; + + class MultiBufAllocationFuture { + public: + Poll<std::optional<MultiBuf::Chunk>> Poll(Context&); + }; + + class MultiBufAllocator { + public: + std::optional<MultiBuf> Allocate(size_t size); + std::optional<MultiBuf> Allocate(size_t min_size, size_t desired_size); + std::optional<MultiBuf::Chunk> AllocateContiguous(size_t size); + std::optional<MultiBuf::Chunk> AllocateContiguous(size_t min_size, size_t desired_size); + + MultiBufAllocationFuture AllocateAsync(size_t size); + MultiBufAllocationFuture AllocateAsync(size_t min_size, size_t desired_size); + MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t size); + MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t min_size, size_t desired_size); + + protected: + virtual std::optional<MultiBuf> DoAllocate(size_t min_size, size_t desired_size); + virtual std::optional<MultiBuf::Chunk> DoAllocateContiguous(size_t min_size, size_t desired_size); + + /// Invoked by the ``MultiBufAllocator`` to signal waiting futures that buffers of + /// the provided sizes may be available for allocation. + void AllocationAvailable(size_t size_available, size_t contiguous_size_available); + }; + + +The ``Chunk`` s in the prototype are stored in fallible dynamically-allocated +arrays, but they could be stored in vectors of a fixed maximum size. The ``Chunk`` s +cannot be stored as an intrusively-linked list because this would require per-``Chunk`` +overhead in the underlying buffer, and there may be any number of ``Chunk`` s within +the same buffer. + +Multiple ``Chunk`` s may not refer to the same memory, but they may refer to +non-overlapping sections of memory within the same region. When one ``Chunk`` +within a region is deallocated, a neighboring chunk may expand to use its space. + +-------------------- +Vectorized Structure +-------------------- +The most significant design choices made in this API is supporting vectorized +IO via a list of ``Chunk`` s. While this does carry an additional overhead, it +is strongly motivated by the desire to carry data throughout the communications +stack without needing a copy. By carrying a list of ``Chunk`` s, ``MultiBuf`` +allows data to be prefixed, suffixed, infixed, or split without incurring the +overhead of a separate allocation and copy. + +-------------------------------------------------------------------------- +Managing Allocations with ``MultiBufAllocator`` and ``ChunkRegionTracker`` +-------------------------------------------------------------------------- +``pw::MultiBuf`` is not associated with a concrete allocator implementation. +Instead, it delegates the creation of buffers to implementations of +the ``MultiBufAllocator`` base class. This allows different allocator +implementations to vend out ``pw::MultiBuf`` s that are optimized for use with a +particular communications stack. + +For example, a communications stack which runs off of shared memory or specific +DMA'able regions might choose to allocate memory out of those regions to allow +for zero-copy writes. + +Additionally, custom allocator implementations can reserve headers, footers, or +even split a ``pw::MultiBuf`` into multiple chunks in order to provide zero-copy +fragmentation. + +Deallocation of these regions is managed through the ``ChunkRegionTracker``. +When no more ``Chunk`` s associated with a ``ChunkRegionTracker`` exist, +the ``ChunkRegionTracker`` receives a ``Destroy`` call to release both the +region and the ``ChunkRegionTracker``. + +The ``MultiBufAllocator`` can place the ``ChunkRegionTracker`` wherever it +wishes, including as a header or footer for the underlying region allocation. +This is not required, however, as memory in regions like shared or DMA'able +memory might be limited, in which case the ``ChunkRegionTracker`` can be +stored elsewhere. + +----------------------------------------------------- +Compatibility With Existing Communications Interfaces +----------------------------------------------------- + +lwIP +==== +`Lightweight IP stack (lwIP) +<https://www.nongnu.org/lwip>`_ +is a TCP/IP implementation designed for use on small embedded systems. Some +Pigweed platforms may choose to use lwIP as the backend for their ``Socket`` +implementations, so it's important that ``pw::MultiBuf`` interoperate efficiently +with their native buffer type. + +lwIP has its own buffer type, `pbuf +<https://www.nongnu.org/lwip/2_1_x/structpbuf.html>`_ optimized for use with +`zero-copy applications +<https://www.nongnu.org/lwip/2_1_x/zerocopyrx.html>`_. +``pbuf`` represents buffers as linked lists of reference-counted ``pbuf`` s +which each have a pointer to their payload, a length, and some metadata. While +this does not precisely match the representation of ``pw::MultiBuf``, it is +possible to construct a ``pbuf`` list which points at the various chunks of a +``pw::MultiBuf`` without needing to perform a copy of the data. + +Similarly, a ``pw::MultiBuf`` can refer to a ``pbuf`` by using each ``pbuf`` as +a "``Chunk`` region", holding a reference count on the region's ``pbuf`` so +long as any ``Chunk`` continues to reference the data referenced by that +buffer. + +------------------ +Existing Solutions +------------------ + +Linux's ``sk_buff`` +=================== +Linux has a similar buffer structure called `sk_buff +<https://docs.kernel.org/networking/skbuff.html#struct-sk-buff>`_. +It differs in a few significant ways: + +It provides separate ``head``, ``data``, ``tail``, and ``end`` pointers. +Other scatter-gather fragments are supplied using the ``frags[]`` structure. + +Separately, it has a ``frags_list`` of IP fragments which is created via calls to +``ip_push_pending_frames``. Fragments are supplied by the ``frags[]`` payload in +addition to the ``skb_shared_info.frag_list`` pointing to the tail. + +``sk_buff`` reference-counts not only the underlying chunks of memory, but also the +``sk_buff`` structure itself. This allows for clones of ``sk_buff`` without +manipulating the reference counts of the individual chunks. We anticipate that +cloning a whole ``pw::buffers::MultiBuf`` will be uncommon enough that it is +better to keep these structures single-owner (and mutable) rather than sharing +and reference-counting them. + +FreeBSD's ``mbuf`` +================== +FreeBSD uses a design called `mbuf +<https://man.freebsd.org/cgi/man.cgi?query=mbuf>`_ +which interestingly allows data within the middle of a buffer to be given a +specified alignment, allowing data to be prepended within the same buffer. +This is similar to the structure of ``Chunk``, which may reference data in the +middle of an allocated buffer, allowing prepending without a copy. |