1 files changed, 490 insertions, 0 deletions
diff --git a/seed/0109-comms-buffers.rst b/seed/0109-comms-buffers.rst
new file mode 100644
index 000000000..fed016259
--- /dev/null
+++ b/seed/0109-comms-buffers.rst
@@ -0,0 +1,490 @@
+.. _seed-0109:
+
+===========================
+0109: Communication Buffers
+===========================
+.. seed::
+   :number: 109
+   :name: Communication Buffers
+   :status: Accepted
+   :proposal_date: 2023-08-28
+   :cl: 168357
+
+-------
+Summary
+-------
+This SEED proposes that Pigweed adopt a standard buffer type for network-style
+communications. This buffer type will be used in the new sockets API
+(see the recently-accepted `Communications SEED 107
+<https://pigweed.dev/seed/0107-communications.html>`_ for more details), and
+will be carried throughout the communications stack as-appropriate.
+
+------------------------
+Top-Level Usage Examples
+------------------------
+
+Sending a Proto Into a Socket
+=============================
+.. code-block:: cpp
+
+   Allocator& msg_allocator = socket.Allocator();
+   size_t size = EncodedProtoSize(some_proto);
+   std::optional<pw::MultiBuf> buffer = msg_allocator.Allocate(size);
+   if (!buffer) { return; }
+   EncodeProto(some_proto, *buffer);
+   socket.Send(std::move(*buffer));
+
+``Socket`` s provide an allocator which should be used to create a
+``pw::MultiBuf``. The data to be sent is then filled into this buffer before
+being sent. ``Socket`` must accept ``pw::MultiBuf`` s which were not created
+by their own ``Allocator``, but this may incur a performance penalty.
+
+Zero-Copy Fragmentation
+=======================
+
+The example above has a hidden feature: zero-copy fragmentation! The socket
+can provide a ``pw::MultiBuf`` which is divided up into separate MTU-sized
+chunks, each of which has reserved space for headers and/or footers:
+
+.. code-block:: cpp
+
+   std::optional<pw::MultiBuf> Allocate(size_t size) {
+     if (size == 0) { return pw::MultiBuf(); }
+     size_t data_per_chunk = mtu_size_ - header_size_;
+     size_t num_chunks = 1 + ((size - 1) / data_per_chunk);
+     std::optional<pw::MultiBuf> buffer = pw::MultiBuf::WithCapacity(num_chunks);
+     if (!buffer) { return std::nullopt; }
+     for (size_t i = 0; i < num_chunks; i++) {
+       // Note: we could allocate smaller than `mtu_size_` for the final
+       // chunk. This is ommitted here for brevity.
+       std::optional<pw::MultiBuf::Chunk> chunk = internal_alloc_.Allocate(mtu_size_);
+       if (!chunk) { return std::nullopt; }
+       // Reserve the header size by discarding the front of each buffer.
+       chunk->DiscardFront(header_size_);
+       buffer->Chunks()[i] = std::move(chunk);
+     }
+     return *buffer;
+   }
+
+This code reserves ``header_size_`` bytes at the beginning of each ``Chunk``.
+When the caller writes into this memory and then passes it back to the socket,
+these bytes can be claimed for the header using the ``ClaimPrefix`` function.
+
+One usecase that seems to demand the ability to fragment like this is breaking
+up ``SOCK_SEQPACKET`` messages which (at least on Unix / Linux) may be much larger
+than the MTU size: up to ``SO_SNDBUF`` (see `this man page
+<https://man7.org/linux/man-pages/man7/socket.7.html>`_).
+
+Multiplexing Incoming Data
+==========================
+
+.. code-block:: cpp
+
+   [[nodiscard]] bool SplitAndSend(pw::MultiBuf buffer) {
+     std::optional<std::array<pw::MultiBuf, 2>> buffers =
+       std::move(buffer).Split(split_index);
+     if (!buffers) { return false; }
+     socket_1_.Send(std::move(buffers->at(0)));
+     socket_2_.Send(std::move(buffers->at(1)));
+     return true;
+   }
+
+Incoming buffers can be split without copying, and the results can be forwarded
+to multiple different ``Socket`` s, RPC services or clients.
+
+----------
+Motivation
+----------
+Today, a Pigweed communications stack typically involves a number of different
+buffers.
+
+``pw_rpc`` users, for example, frequently use direct-memory access (DMA) or
+other system primitives to read data into a buffer, apply some link-layer
+protocol such as HDLC which copies data into a second buffer, pass the resulting
+data into pw_rpc which parses it into its own buffer. Multiple sets of buffers
+are present on the output side as well. Between DMA in and DMA out, it's easy
+for data to pass through six or more different buffers.
+
+These independent buffer systems introduce both time and space overhead. Aside
+from the additional CPU time required to move the data around, users need to
+ensure that all of the different buffer pools involved along the way have enough
+space reserved to contain the entire message. Where caching is present, moving
+the memory between locations may create an additional delay by thrashing
+between memory regions.
+
+--------
+Proposal
+--------
+``pw::buffers::MultiBuf`` is a handle to a buffer optimized for use within
+Pigweed's communications stack. It provides efficient, low-overhead buffer
+management, and serves as a standard type for passing data between drivers,
+TCP/IP implementations, RPC 2.0, and transfer 2.0.
+
+A single ``MultiBuf`` is a list of ``Chunk`` s of data. Each ``Chunk``
+points to an exclusively-owned portion of a reference-counted allocation.
+``MultiBuf`` s can be easily split, joined, prefixed, postfixed, or infixed
+without needing to copy the underlying data.
+
+The memory pointed to by ``Chunk`` s is typically allocated from a pool
+provided by a ``Socket``. This allows the ``Socket`` to provide backpressure to
+callers, and to ensure that memory is placed in DMA or shared memory regions
+as-necessary.
+
+In-Memory Layout
+================
+
+This diagram shows an example in-memory relationship between two buffers:
+- ``Buffer1`` references one chunks from region A.
+- ``Buffer2`` references two chunk from regions A and B.
+
+.. mermaid::
+
+   graph TB;
+   Buffer1 --> Chunk1A
+   Chunk1A -- "[0..64]" --> MemoryRegionA(Memory Region A)
+   Chunk1A --> ChunkRegionTrackerA
+   Buffer2 --> Chunk2A & Chunk2B
+   Chunk2A --> ChunkRegionTrackerA
+   Chunk2A -- "[64..128]" --> MemoryRegionA(Memory Region A)
+   Chunk2B -- "[0..128]" --> MemoryRegionB
+   Chunk2B --> ChunkRegionTrackerB
+
+API
+===
+
+The primary API is as follows:
+
+.. code-block:: cpp
+
+   /// An object that manages a single allocated region which is referenced
+   /// by one or more chunks.
+   class ChunkRegionTracker {
+    public:
+     ChunkRegionTracker(ByteSpan);
+
+     /// Creates the first ``Chunk`` referencing a whole region of memory.
+     /// This must only be called once per ``ChunkRegionTracker``.
+     Chunk ChunkForRegion();
+
+    protected:
+     pw::Mutex lock();
+
+     /// Destroys the `ChunkRegionTracker`.
+     ///
+     /// Typical implementations will call `std::destroy_at(this)` and then
+     /// free the memory associated with the tracker.
+     virtual void Destroy();
+
+     /// Defines the entire span of the region being managed. ``Chunk`` s
+     /// referencing this tracker may not expand beyond this region
+     /// (or into one another).
+     ///
+     /// This region must not change for the lifetime of this
+     /// ``ChunkRegionTracker``.
+     virtual ByteSpan region();
+
+    private:
+     /// Used to manage the internal linked-list of ``Chunk`` s that allows
+     /// chunks to know whether or not they can expand to fill neighboring
+     /// regions of memory.
+     pw::Mutex lock_;
+   };
+
+   /// A handle to a contiguous refcounted slice of data.
+   ///
+   /// Note: this Chunk may acquire a ``pw::sync::Mutex`` during destruction, and
+   /// so must not be destroyed within ISR context.
+   class Chunk {
+    public:
+     Chunk();
+     Chunk(ChunkRegionTracker&);
+     Chunk(Chunk&& other) noexcept;
+     Chunk& operator=(Chunk&& other);
+     ~Chunk();
+     std::byte* data();
+     const std::byte* data() const;
+     ByteSpan span();
+     ConstByteSpan span() const;
+     size_t size() const;
+
+     std::byte& operator[](size_t index);
+     std::byte operator[](size_t index) const;
+
+     /// Decrements the reference count on the underlying chunk of data and empties
+     /// this handle so that `span()` now returns an empty (zero-sized) span.
+     ///
+     /// Does not modify the underlying data, but may cause it to be
+     /// released if this was the only remaining ``Chunk`` referring to it.
+     /// This is equivalent to ``{ Chunk _unused = std::move(chunk_ref); }``
+     void Release();
+
+     /// Attempts to add `prefix_bytes` bytes to the front of this buffer by
+     /// advancing its range backwards in memory. Returns `true` if the
+     /// operation succeeded.
+     ///
+     /// This will only succeed if this `Chunk` points to a section of a chunk
+     /// that has unreferenced bytes preceeding it. For example, a `Chunk`
+     /// which has been shrunk using `DiscardFront` can then be re-expanded using
+     /// `ClaimPrefix`.
+     ///
+     /// Note that this will fail if a preceding `Chunk` has claimed this
+     /// memory using `ClaimSuffix`.
+     [[nodiscard]] bool ClaimPrefix(size_t prefix_bytes);
+
+     /// Attempts to add `suffix_bytes` bytes to the back of this buffer by
+     /// advancing its range forwards in memory. Returns `true` if the
+     /// operation succeeded.
+     ///
+     /// This will only succeed if this `Chunk` points to a section of a chunk
+     /// that has unreferenced bytes following it. For example, a `Chunk`
+     /// which has been shrunk using `Truncate` can then be re-expanded using
+     /// `ClaimSuffix`.
+     ///
+     /// Note that this will fail if a following `Chunk` has claimed this
+     /// memory using `ClaimPrefix`.
+     [[nodiscard]] bool ClaimSuffix(size_t suffix_bytes);
+
+     /// Shrinks this handle to refer to the data beginning at offset ``new_start``.
+     ///
+     /// Does not modify the underlying data.
+     void DiscardFront(size_t new_start);
+
+     /// Shrinks this handle to refer to data in the range ``begin..<end``.
+     ///
+     /// Does not modify the underlying data.
+     void Slice(size_t begin, size_t end);
+
+     /// Shrinks this handle to refer to only the first ``len`` bytes.
+     ///
+     /// Does not modify the underlying data.
+     void Truncate(size_t len);
+
+     /// Splits this chunk in two, with the second chunk starting at `split_point`.
+     std::array<Chunk, 2> Split(size_t split_point) &&;
+   };
+
+   /// A handle to a sequence of potentially non-contiguous refcounted slices of
+   /// data.
+   class MultiBuf {
+    public:
+     struct Index {
+       size_t chunk_index;
+       size_t byte_index;
+     };
+
+     MultiBuf();
+
+     /// Creates a ``MultiBuf`` pointing to a single, contiguous chunk of data.
+     ///
+     /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory.
+     static std::optional<MultiBuf> FromChunk(Chunk chunk);
+
+     /// Splits the ``MultiBuf`` into two separate buffers at ``split_point``.
+     ///
+     /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory.
+     std::optional<std::array<MultiBuf, 2>> Split(Index split_point) &&;
+     std::optional<std::array<MultiBuf, 2>> Split(size_t split_point) &&;
+
+     /// Appends the contents of ``suffix`` to this ``MultiBuf`` without copying data.
+     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
+     [[nodiscard]] bool Append(MultiBuf suffix);
+
+     /// Discards the first elements of ``MultiBuf`` up to (but not including)
+     /// ``new_start``.
+     ///
+     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
+     [[nodiscard]] bool DiscardFront(Index new_start);
+     [[nodiscard]] bool DiscardFront(size_t new_start);
+
+     /// Shifts and truncates this handle to refer to data in the range
+     /// ``begin..<stop``.
+     ///
+     /// Does not modify the underlying data.
+     ///
+     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
+     [[nodiscard]] bool Slice(size_t begin, size_t end);
+
+     /// Discards the tail of this ``MultiBuf`` starting with ``first_index_to_drop``.
+     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
+     [[nodiscard]] bool Truncate(Index first_index_to_drop);
+     /// Discards the tail of this ``MultiBuf`` so that only ``len`` elements remain.
+     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
+     [[nodiscard]] bool Truncate(size_t len);
+
+     /// Discards the elements beginning with ``cut_start`` and resuming at
+     /// ``resume_point``.
+     ///
+     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
+     [[nodiscard]] bool DiscardSegment(Index cut_start, Index resume_point);
+
+     /// Returns an iterable over the ``Chunk``s of memory within this ``MultiBuf``.
+     auto Chunks();
+     auto Chunks() const;
+
+     /// Returns a ``BidirectionalIterator`` over the bytes in this ``MultiBuf``.
+     ///
+     /// Note that use of this iterator type may be less efficient than
+     /// performing chunk-wise operations due to the noncontiguous nature of
+     /// the iterator elements.
+     auto begin();
+     auto end();
+
+     /// Counts the total number of ``Chunk`` s.
+     ///
+     /// This function is ``O(n)`` in the number of ``Chunk`` s.
+     size_t CalculateNumChunks() const;
+
+     /// Counts the total size in bytes of all ``Chunk`` s combined.
+     ///
+     /// This function is ``O(n)`` in the number of ``Chunk`` s.
+     size_t CalculateTotalSize() const;
+
+     /// Returns an ``Index`` which can be used to provide constant-time access to
+     /// the element at ``position``.
+     ///
+     /// Any mutation of this ``MultiBuf`` (e.g. ``Append``, ``DiscardFront``,
+     /// ``Slice``, or ``Truncate``) may invalidate this ``Index``.
+     Index IndexAt(size_t position) const;
+   };
+
+
+   class MultiBufAllocationFuture {
+    public:
+     Poll<std::optional<Buffer>> Poll(Context&);
+   };
+
+   class MultiBufAllocationFuture {
+    public:
+     Poll<std::optional<MultiBuf::Chunk>> Poll(Context&);
+   };
+
+   class MultiBufAllocator {
+    public:
+     std::optional<MultiBuf> Allocate(size_t size);
+     std::optional<MultiBuf> Allocate(size_t min_size, size_t desired_size);
+     std::optional<MultiBuf::Chunk> AllocateContiguous(size_t size);
+     std::optional<MultiBuf::Chunk> AllocateContiguous(size_t min_size, size_t desired_size);
+
+     MultiBufAllocationFuture AllocateAsync(size_t size);
+     MultiBufAllocationFuture AllocateAsync(size_t min_size, size_t desired_size);
+     MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t size);
+     MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t min_size, size_t desired_size);
+
+    protected:
+     virtual std::optional<MultiBuf> DoAllocate(size_t min_size, size_t desired_size);
+     virtual std::optional<MultiBuf::Chunk> DoAllocateContiguous(size_t min_size, size_t desired_size);
+
+     /// Invoked by the ``MultiBufAllocator`` to signal waiting futures that buffers of
+     /// the provided sizes may be available for allocation.
+     void AllocationAvailable(size_t size_available, size_t contiguous_size_available);
+   };
+
+
+The ``Chunk`` s in the prototype are stored in fallible dynamically-allocated
+arrays, but they could be stored in vectors of a fixed maximum size. The ``Chunk`` s
+cannot be stored as an intrusively-linked list because this would require per-``Chunk``
+overhead in the underlying buffer, and there may be any number of ``Chunk`` s within
+the same buffer.
+
+Multiple ``Chunk`` s may not refer to the same memory, but they may refer to
+non-overlapping sections of memory within the same region. When one ``Chunk``
+within a region is deallocated, a neighboring chunk may expand to use its space.
+
+--------------------
+Vectorized Structure
+--------------------
+The most significant design choices made in this API is supporting vectorized
+IO via a list of ``Chunk`` s. While this does carry an additional overhead, it
+is strongly motivated by the desire to carry data throughout the communications
+stack without needing a copy. By carrying a list of ``Chunk`` s, ``MultiBuf``
+allows data to be prefixed, suffixed, infixed, or split without incurring the
+overhead of a separate allocation and copy.
+
+--------------------------------------------------------------------------
+Managing Allocations with ``MultiBufAllocator`` and ``ChunkRegionTracker``
+--------------------------------------------------------------------------
+``pw::MultiBuf`` is not associated with a concrete allocator implementation.
+Instead, it delegates the creation of buffers to implementations of
+the ``MultiBufAllocator`` base class. This allows different allocator
+implementations to vend out ``pw::MultiBuf`` s that are optimized for use with a
+particular communications stack.
+
+For example, a communications stack which runs off of shared memory or specific
+DMA'able regions might choose to allocate memory out of those regions to allow
+for zero-copy writes.
+
+Additionally, custom allocator implementations can reserve headers, footers, or
+even split a ``pw::MultiBuf`` into multiple chunks in order to provide zero-copy
+fragmentation.
+
+Deallocation of these regions is managed through the ``ChunkRegionTracker``.
+When no more ``Chunk`` s associated with a ``ChunkRegionTracker`` exist,
+the ``ChunkRegionTracker`` receives a ``Destroy`` call to release both the
+region and the ``ChunkRegionTracker``.
+
+The ``MultiBufAllocator`` can place the ``ChunkRegionTracker`` wherever it
+wishes, including as a header or footer for the underlying region allocation.
+This is not required, however, as memory in regions like shared or DMA'able
+memory might be limited, in which case the ``ChunkRegionTracker`` can be
+stored elsewhere.
+
+-----------------------------------------------------
+Compatibility With Existing Communications Interfaces
+-----------------------------------------------------
+
+lwIP
+====
+`Lightweight IP stack (lwIP)
+<https://www.nongnu.org/lwip>`_
+is a TCP/IP implementation designed for use on small embedded systems. Some
+Pigweed platforms may choose to use lwIP as the backend for their ``Socket``
+implementations, so it's important that ``pw::MultiBuf`` interoperate efficiently
+with their native buffer type.
+
+lwIP has its own buffer type, `pbuf
+<https://www.nongnu.org/lwip/2_1_x/structpbuf.html>`_ optimized for use with
+`zero-copy applications
+<https://www.nongnu.org/lwip/2_1_x/zerocopyrx.html>`_.
+``pbuf`` represents buffers as linked lists of reference-counted ``pbuf`` s
+which each have a pointer to their payload, a length, and some metadata. While
+this does not precisely match the representation of ``pw::MultiBuf``, it is
+possible to construct a ``pbuf`` list which points at the various chunks of a
+``pw::MultiBuf`` without needing to perform a copy of the data.
+
+Similarly, a ``pw::MultiBuf`` can refer to a ``pbuf`` by using each ``pbuf`` as
+a "``Chunk`` region", holding a reference count on the region's ``pbuf`` so
+long as any ``Chunk`` continues to reference the data referenced by that
+buffer.
+
+------------------
+Existing Solutions
+------------------
+
+Linux's ``sk_buff``
+===================
+Linux has a similar buffer structure called `sk_buff
+<https://docs.kernel.org/networking/skbuff.html#struct-sk-buff>`_.
+It differs in a few significant ways:
+
+It provides separate ``head``, ``data``, ``tail``, and ``end`` pointers.
+Other scatter-gather fragments are supplied using the ``frags[]`` structure.
+
+Separately, it has a ``frags_list`` of IP fragments which is created via calls to
+``ip_push_pending_frames``. Fragments are supplied by the ``frags[]`` payload in
+addition to the ``skb_shared_info.frag_list`` pointing to the tail.
+
+``sk_buff`` reference-counts not only the underlying chunks of memory, but also the
+``sk_buff`` structure itself. This allows for clones of ``sk_buff`` without
+manipulating the reference counts of the individual chunks. We anticipate that
+cloning a whole ``pw::buffers::MultiBuf`` will be uncommon enough that it is
+better to keep these structures single-owner (and mutable) rather than sharing
+and reference-counting them.
+
+FreeBSD's ``mbuf``
+==================
+FreeBSD uses a design called `mbuf
+<https://man.freebsd.org/cgi/man.cgi?query=mbuf>`_
+which interestingly allows data within the middle of a buffer to be given a
+specified alignment, allowing data to be prepended within the same buffer.
+This is similar to the structure of ``Chunk``, which may reference data in the
+middle of an allocated buffer, allowing prepending without a copy.