aboutsummaryrefslogtreecommitdiff
path: root/man/io_uring_enter.2
diff options
context:
space:
mode:
Diffstat (limited to 'man/io_uring_enter.2')
-rw-r--r--man/io_uring_enter.2410
1 files changed, 372 insertions, 38 deletions
diff --git a/man/io_uring_enter.2 b/man/io_uring_enter.2
index 909cc9b..3c04541 100644
--- a/man/io_uring_enter.2
+++ b/man/io_uring_enter.2
@@ -55,6 +55,52 @@ application can no longer get a free SQE entry to submit, without knowing
when it one becomes available as the SQ kernel thread consumes them. If
the system call is used with this flag set, then it will wait until at least
one entry is free in the SQ ring.
+.TP
+.B IORING_ENTER_EXT_ARG
+Since kernel 5.11, the system calls arguments have been modified to look like
+the following:
+
+.nf
+.BI "int io_uring_enter(unsigned int " fd ", unsigned int " to_submit ,
+.BI " unsigned int " min_complete ", unsigned int " flags ,
+.BI " const void *" arg ", size_t " argsz );
+.fi
+
+which is behaves just like the original definition by default. However, if
+.B IORING_ENTER_EXT_ARG
+is set, then instead of a
+.I sigset_t
+being passed in, a pointer to a
+.I struct io_uring_getevents_arg
+is used instead and
+.I argsz
+must be set to the size of this structure. The definition is as follows:
+
+.nf
+.BI "struct io_uring_getevents_args {
+.BI " __u64 sigmask;
+.BI " __u32 sigmask_sz;
+.BI " __u32 pad;
+.BI " __u64 ts;
+.BI "};
+.fi
+
+which allows passing in both a signal mask as well as pointer to a
+.I struct __kernel_timespec
+timeout value. If
+.I ts
+is set to a valid pointer, then this time value indicates the timeout for
+waiting on events. If an application is waiting on events and wishes to
+stop waiting after a specified amount of time, then this can be accomplished
+directly in version 5.11 and newer by using this feature.
+.TP
+.B IORING_ENTER_REGISTERED_RING
+If the ring file descriptor has been registered through use of
+.B IORING_REGISTER_RING_FDS,
+then setting this flag will tell the kernel that the
+.I ring_fd
+passed in is the registered ring offset rather than a normal file descriptor.
+
.PP
.PP
If the io_uring instance was configured for polling, by specifying
@@ -159,22 +205,28 @@ struct io_uring_sqe {
__u32 statx_flags;
__u32 fadvise_advice;
__u32 splice_flags;
+ __u32 rename_flags;
+ __u32 unlink_flags;
+ __u32 hardlink_flags;
};
__u64 user_data; /* data to be passed back at completion time */
union {
- struct {
- /* index into fixed buffers, if used */
+ struct {
+ /* index into fixed buffers, if used */
union {
/* index into fixed buffers, if used */
__u16 buf_index;
/* for grouped buffer selection */
__u16 buf_group;
}
- /* personality to use, if used */
- __u16 personality;
+ /* personality to use, if used */
+ __u16 personality;
+ union {
__s32 splice_fd_in;
+ __u32 file_index;
};
- __u64 __pad2[3];
+ };
+ __u64 __pad2[3];
};
};
.EE
@@ -228,11 +280,55 @@ specified in the
.I poll_events
field. Unlike poll or epoll without
.BR EPOLLONESHOT ,
-this interface always works in one shot mode. That is, once the poll
-operation is completed, it will have to be resubmitted. This command works like
+by default this interface always works in one shot mode. That is, once the poll
+operation is completed, it will have to be resubmitted.
+
+If
+.B IORING_POLL_ADD_MULTI
+is set in the SQE
+.I len
+field, then the poll will work in multi shot mode instead. That means it'll
+repatedly trigger when the requested event becomes true, and hence multiple
+CQEs can be generated from this single SQE. The CQE
+.I flags
+field will have
+.B IORING_CQE_F_MORE
+set on completion if the application should expect further CQE entries from
+the original request. If this flag isn't set on completion, then the poll
+request has been terminated and no further events will be generated. This mode
+is available since 5.13.
+
+If
+.B IORING_POLL_UPDATE_EVENTS
+is set in the SQE
+.I len
+field, then the request will update an existing poll request with the mask of
+events passed in with this request. The lookup is based on the
+.I user_data
+field of the original SQE submitted, and this values is passed in the
+.I addr
+field of the SQE. This mode is available since 5.13.
+
+If
+.B IORING_POLL_UPDATE_USER_DATA
+is set in the SQE
+.I len
+field, then the request will update the
+.I user_data
+of an existing poll request based on the value passed in the
+.I off
+field. This mode is available since 5.13.
+
+This command works like
an async
.BR poll(2)
-and the completion event result is the returned mask of events.
+and the completion event result is the returned mask of events. For the
+variants that update
+.I user_data
+or
+.I events
+, the completion result will be similar to
+.B IORING_OP_POLL_REMOVE.
.TP
.B IORING_OP_POLL_REMOVE
@@ -243,7 +339,10 @@ field of the
will contain 0. If not found,
.I res
will contain
-.B -ENOENT.
+.B -ENOENT,
+or
+.B -EALREADY
+if the poll request was in the process of completing already.
.TP
.B IORING_OP_EPOLL_CTL
@@ -342,10 +441,32 @@ clock source. The request will complete with
if the timeout got completed through expiration of the timer, or
.I 0
if the timeout got completed through requests completing on their own. If
-the timeout was cancelled before it expired, the request will complete with
+the timeout was canceled before it expired, the request will complete with
.I -ECANCELED.
Available since 5.4.
+Since 5.15, this command also supports the following modifiers in
+.I timeout_flags:
+
+.PP
+.in +12
+.B IORING_TIMEOUT_BOOTTIME
+If set, then the clocksource used is
+.I CLOCK_BOOTTIME
+instead of
+.I CLOCK_MONOTONIC.
+This clocksource differs in that it includes time elapsed if the system was
+suspend while having a timeout request in-flight.
+
+.B IORING_TIMEOUT_REALTIME
+If set, then the clocksource used is
+.I CLOCK_BOOTTIME
+instead of
+.I CLOCK_MONOTONIC.
+.EE
+.in
+.PP
+
.TP
.B IORING_OP_TIMEOUT_REMOVE
If
@@ -355,7 +476,7 @@ operation.
must contain the
.I user_data
field of the previously issued timeout operation. If the specified timeout
-request is found and cancelled successfully, this request will terminate
+request is found and canceled successfully, this request will terminate
with a result value of
.I 0
If the timeout request was found but expiration was already in progress,
@@ -370,13 +491,14 @@ If
.I timeout_flags
contain
.I IORING_TIMEOUT_UPDATE,
-instead of removing an existing operation it updates it.
+instead of removing an existing operation, it updates it.
.I addr
and return values are same as before.
.I addr2
field must contain a pointer to a struct timespec64 structure.
.I timeout_flags
-may also contain IORING_TIMEOUT_ABS.
+may also contain IORING_TIMEOUT_ABS, in which case the value given is an
+absolute one, not a relative one.
Available since 5.11.
.TP
@@ -389,26 +511,47 @@ must be set to the socket file descriptor,
.I addr
must contain the pointer to the sockaddr structure, and
.I addr2
-must contain a pointer to the socklen_t addrlen field. See also
+must contain a pointer to the socklen_t addrlen field. Flags can be passed using
+the
+.I accept_flags
+field. See also
.BR accept4(2)
for the general description of the related system call. Available since 5.5.
+If the
+.I file_index
+field is set to a positive number, the file won't be installed into the
+normal file table as usual but will be placed into the fixed file table at index
+.I file_index - 1.
+In this case, instead of returning a file descriptor, the result will contain
+either 0 on success or an error. If the index points to a valid empty slot, the
+installation is guaranteed to not fail. If there is already a file in the slot,
+it will be replaced, similar to
+.B IORING_OP_FILES_UPDATE.
+Please note that only io_uring has access to such files and no other syscall
+can use them. See
+.B IOSQE_FIXED_FILE
+and
+.B IORING_REGISTER_FILES.
+
+Available since 5.5.
+
.TP
.B IORING_OP_ASYNC_CANCEL
Attempt to cancel an already issued request.
.I addr
must contain the
.I user_data
-field of the request that should be cancelled. The cancellation request will
+field of the request that should be canceled. The cancelation request will
complete with one of the following results codes. If found, the
.I res
field of the cqe will contain 0. If not found,
.I res
-will contain -ENOENT. If found and attempted cancelled, the
+will contain -ENOENT. If found and attempted canceled, the
.I res
field will contain -EALREADY. In this case, the request may or may not
terminate. In general, requests that are interruptible (like socket IO) will
-get cancelled, while disk IO requests cannot be cancelled if already started.
+get canceled, while disk IO requests cannot be canceled if already started.
Available since 5.5.
.TP
@@ -426,9 +569,9 @@ If used, the timeout specified in the command will cancel the linked command,
unless the linked command completes before the timeout. The timeout will
complete with
.I -ETIME
-if the timer expired and the linked request was attempted cancelled, or
+if the timer expired and the linked request was attempted canceled, or
.I -ECANCELED
-if the timer got cancelled because of completion of the linked request. Like
+if the timer got canceled because of completion of the linked request. Like
.B IORING_OP_TIMEOUT
the clock source used is
.B CLOCK_MONOTONIC
@@ -516,6 +659,24 @@ is access mode of the file. See also
.BR openat(2)
for the general description of the related system call. Available since 5.6.
+If the
+.I file_index
+field is set to a positive number, the file won't be installed into the
+normal file table as usual but will be placed into the fixed file table at index
+.I file_index - 1.
+In this case, instead of returning a file descriptor, the result will contain
+either 0 on success or an error. If the index points to a valid empty slot, the
+installation is guaranteed to not fail. If there is already a file in the slot,
+it will be replaced, similar to
+.B IORING_OP_FILES_UPDATE.
+Please note that only io_uring has access to such files and no other syscall
+can use them. See
+.B IOSQE_FIXED_FILE
+and
+.B IORING_REGISTER_FILES.
+
+Available since 5.15.
+
.TP
.B IORING_OP_OPENAT2
Issue the equivalent of a
@@ -536,6 +697,24 @@ should be set to the address of the open_how structure. See also
.BR openat2(2)
for the general description of the related system call. Available since 5.6.
+If the
+.I file_index
+field is set to a positive number, the file won't be installed into the
+normal file table as usual but will be placed into the fixed file table at index
+.I file_index - 1.
+In this case, instead of returning a file descriptor, the result will contain
+either 0 on success or an error. If the index points to a valid empty slot, the
+installation is guaranteed to not fail. If there is already a file in the slot,
+it will be replaced, similar to
+.B IORING_OP_FILES_UPDATE.
+Please note that only io_uring has access to such files and no other syscall
+can use them. See
+.B IOSQE_FIXED_FILE
+and
+.B IORING_REGISTER_FILES.
+
+Available since 5.15.
+
.TP
.B IORING_OP_CLOSE
Issue the equivalent of a
@@ -545,6 +724,18 @@ system call.
is the file descriptor to be closed. See also
.BR close(2)
for the general description of the related system call. Available since 5.6.
+If the
+.I file_index
+field is set to a positive number, this command can be used to close files
+that were direct opened through
+.B IORING_OP_OPENAT
+,
+.B IORING_OP_OPENAT2
+, or
+.B IORING_OP_ACCEPT
+using the io_uring specific direct descriptors. Note that only one of the
+descriptor fields may be set. The direct close feature is available since
+the 5.15 kernel, where direct descriptors were introduced.
.TP
.B IORING_OP_STATX
@@ -596,7 +787,9 @@ does not refer to a seekable file,
.I off
must be set to zero. If
.I offs
-is set to -1, the offset will use (and advance) the file position, like the
+is set to
+.B -1
+, the offset will use (and advance) the file position, like the
.BR read(2)
and
.BR write(2)
@@ -622,8 +815,9 @@ is an offset to read from,
.I fd
is the file descriptor to write to,
.I off
-is an offset from which to start writing to. A sentinel value of -1 is used
-to pass the equivalent of a NULL for the offsets to
+is an offset from which to start writing to. A sentinel value of
+.B -1
+is used to pass the equivalent of a NULL for the offsets to
.BR splice(2).
.I len
contains the number of bytes to copy.
@@ -724,8 +918,11 @@ Issue the equivalent of a
.BR shutdown(2)
system call.
.I fd
-is the file descriptor to the socket being shutdown, no other fields should
-be set. Available since 5.11.
+is the file descriptor to the socket being shutdown, and
+.I len
+must be set to the
+.I how
+argument. No no other fields should be set. Available since 5.11.
.TP
.B IORING_OP_RENAMEAT
@@ -774,6 +971,90 @@ being passed in to
.BR unlinkat(2).
Available since 5.11.
+.TP
+.B IORING_OP_MKDIRAT
+Issue the equivalent of a
+.BR mkdirat2(2)
+system call.
+.I fd
+should be set to the
+.I dirfd,
+.I addr
+should be set to the
+.I pathname,
+and
+.I len
+should be set to the
+.I mode
+being passed in to
+.BR mkdirat(2).
+Available since 5.15.
+
+.TP
+.B IORING_OP_SYMLINKAT
+Issue the equivalent of a
+.BR symlinkat2(2)
+system call.
+.I fd
+should be set to the
+.I newdirfd,
+.I addr
+should be set to the
+.I target
+and
+.I addr2
+should be set to the
+.I linkpath
+being passed in to
+.BR symlinkat(2).
+Available since 5.15.
+
+.TP
+.B IORING_OP_LINKAT
+Issue the equivalent of a
+.BR linkat2(2)
+system call.
+.I fd
+should be set to the
+.I olddirfd,
+.I addr
+should be set to the
+.I oldpath,
+.I len
+should be set to the
+.I newdirfd,
+.I addr2
+should be set to the
+.I newpath,
+and
+.I hardlink_flags
+should be set to the
+.I flags
+being passed in to
+.BR linkat(2).
+Available since 5.15.
+
+.TP
+.B IORING_OP_MSG_RING
+Send a message to an io_uring.
+.I fd
+must be set to a file descriptor of a ring that the application has access to,
+.I len
+can be set to any 32-bit value that the application wishes to pass on, and
+.I off
+should be set any 64-bit value that the application wishes to send. On the
+target ring, a CQE will be posted with the
+.I res
+field matching the
+.I len
+set, and a
+.I user_data
+field matching the
+.I off
+value being passed in. This request type can be used to either just wake or
+interrupt anyone waiting for completions on the target ring, ot it can be used
+to pass messages via the two fields. Available since 5.18.
+
.PP
The
.I flags
@@ -786,7 +1067,10 @@ is an index into the files array registered with the io_uring instance (see the
.B IORING_REGISTER_FILES
section of the
.BR io_uring_register (2)
-man page). Available since 5.1.
+man page). Note that this isn't always available for all commands. If used on
+a command that doesn't support fixed files, the SQE will error with
+.B -EBADF.
+Available since 5.1.
.TP
.B IOSQE_IO_DRAIN
When this flag is specified, the SQE will not be started before previously
@@ -794,12 +1078,14 @@ submitted SQEs have completed, and new SQEs will not be started before this
one completes. Available since 5.2.
.TP
.B IOSQE_IO_LINK
-When this flag is specified, it forms a link with the next SQE in the
-submission ring. That next SQE will not be started before this one completes.
-This, in effect, forms a chain of SQEs, which can be arbitrarily long. The tail
-of the chain is denoted by the first SQE that does not have this flag set.
-This flag has no effect on previous SQE submissions, nor does it impact SQEs
-that are outside of the chain tail. This means that multiple chains can be
+When this flag is specified, the SQE forms a link with the next SQE in the
+submission ring. That next SQE will not be started before the previous request
+completes. This, in effect, forms a chain of SQEs, which can be arbitrarily
+long. The tail of the chain is denoted by the first SQE that does not have this
+flag set. Chains are not supported across submission boundaries. Even if the
+last SQE in a submission has this flag set, it will still terminate the current
+chain. This flag has no effect on previous SQE submissions, nor does it impact
+SQEs that are outside of the chain tail. This means that multiple chains can be
executing in parallel, or chains and individual SQEs. Only members inside the
chain are serialized. A chain of SQEs will be broken, if any request in that
chain ends in error. io_uring considers any unexpected result an error. This
@@ -829,7 +1115,7 @@ Used in conjunction with the
command, which registers a pool of buffers to be used by commands that read
or receive data. When buffers are registered for this use case, and this
flag is set in the command, io_uring will grab a buffer from this pool when
-the request is ready to receive or read data. If succesful, the resulting CQE
+the request is ready to receive or read data. If successful, the resulting CQE
will have
.B IORING_CQE_F_BUFFER
set in the flags part of the struct, and the upper
@@ -841,6 +1127,37 @@ are available and this flag is set, then the request will fail with
as the error code. Once a buffer has been used, it is no longer available in
the kernel pool. The application must re-register the given buffer again when
it is ready to recycle it (eg has completed using it). Available since 5.7.
+.TP
+.B IOSQE_CQE_SKIP_SUCCESS
+Don't generate a CQE if the request completes successfully. If the request
+fails, an appropriate CQE will be posted as usual and if there is no
+.B IOSQE_IO_HARDLINK,
+CQEs for all linked requests will be omitted. The notion of failure/success is
+opcode specific and is the same as with breaking chains of
+.B IOSQE_IO_LINK.
+One special case is when the request has a linked timeout, then the CQE
+generation for the linked timeout is decided solely by whether it has
+.B IOSQE_CQE_SKIP_SUCCESS
+set, regardless whether it timed out or was canceled. In other words, if a
+linked timeout has the flag set, it's guaranteed to not post a CQE.
+
+The semantics are chosen to accommodate several use cases. First, when all but
+the last request of a normal link without linked timeouts are marked with the
+flag, only one CQE per lin is posted. Additionally, it enables supression of
+CQEs in cases where the side effects of a successfully executed operation is
+enough for userspace to know the state of the system. One such example would
+be writing to a synchronisation file.
+
+This flag is incompatible with
+.B IOSQE_IO_DRAIN.
+Using both of them in a single ring is undefined behavior, even when they are
+not used together in a single request. Currently, after the first request with
+.B IOSQE_CQE_SKIP_SUCCESS,
+all subsequent requests marked with drain will be failed at submission time.
+Note that the error reporting is best effort only, and restrictions may change
+in the future.
+
+Available since 5.17.
.PP
.I ioprio
@@ -933,7 +1250,13 @@ is copied from the field of the same name in the submission queue
entry. The primary use case is to store data that the application
will need to access upon completion of this particular I/O. The
.I flags
-is reserved for future use.
+is used for certain commands, like
+.B IORING_OP_POLL_ADD
+or in conjunction with
+.B IOSQE_BUFFER_SELECT
+or
+.B IORING_OP_MSG_RING,
+, see those entries for details.
.I res
is the operation-specific result, but io_uring-specific errors
(e.g. flags or opcode invalid) are returned through this field.
@@ -941,13 +1264,22 @@ They are described in section
.B CQE ERRORS.
.PP
For read and write opcodes, the
-return values match those documented in the
+return values match
+.I errno
+values documented in the
.BR preadv2 (2)
and
.BR pwritev2 (2)
-man pages.
-Return codes for the io_uring-specific opcodes are documented in the
-description of the opcodes above.
+man pages, with
+.I
+res
+holding the equivalent of
+.I -errno
+for error cases, or the transferred number of bytes in case the operation
+is successful. Hence both error and success return can be found in that
+field in the CQE. For other request types, the return values are documented
+in the matching man page for that type, or in the opcodes section above for
+io_uring-specific opcodes.
.PP
.SH RETURN VALUE
.BR io_uring_enter ()
@@ -967,7 +1299,9 @@ completion queue entry (see section
rather than through the system call itself.
Errors that occur not on behalf of a submission queue entry are returned via the
-system call directly. On such an error, -1 is returned and
+system call directly. On such an error,
+.B -1
+is returned and
.I errno
is set appropriately.
.PP