<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" updates="" obsoletes="" category="std" docName="draft-ietf-ipsecme-multi-sa-performance-09" number="9611" consensus="true" submissionType="IETF" tocInclude="true" tocDepth="4" symRefs="true" sortRefs="true" version="3">

  <front>
    <title abbrev="IKEv2 Support for Per-Resource Child SAs">Internet Key Exchange Protocol Version 2 (IKEv2) Support for Per&nbhy;Resource Child Security Associations (SAs)</title>
    <seriesInfo name="RFC" value="9611"/>
    <author fullname="Antony Antony" initials="A." surname="Antony">
      <organization abbrev="secunet">secunet Security Networks AG</organization>
      <address>
        <email>antony.antony@secunet.com</email>
      </address>
    </author>
    <author initials="T." surname="Brunner" fullname="Tobias Brunner">
      <organization abbrev="codelabs">codelabs GmbH</organization>
      <address>
        <email>tobias@codelabs.ch</email>
      </address>
    </author>
    <author fullname="Steffen Klassert" initials="S." surname="Klassert">
      <organization abbrev="secunet">secunet Security Networks AG</organization>
      <address>
        <email>steffen.klassert@secunet.com</email>
      </address>
    </author>
    <author initials="P." surname="Wouters" fullname="Paul Wouters">
      <organization>Aiven</organization>
      <address>
        <email>paul.wouters@aiven.io</email>
      </address>
    </author>
    <date month="July" year="2024"/>
    <area>SEC</area>
    <workgroup>ipsecme</workgroup>
    <keyword>IKEv2</keyword>
    <keyword>IPsec</keyword>
    <abstract>

      <t>
   In order to increase the bandwidth of IPsec traffic between peers,
   this document defines one Notify Message Status Types and one Notify
   Message Error Types payload for the Internet Key Exchange Protocol
   Version 2 (IKEv2) to support the negotiation of multiple Child
   Security Associations (SAs) with the same Traffic Selectors used on
   different resources, such as CPUs.
      </t>
      <t>
       The SA_RESOURCE_INFO notification is used to convey information that the
       negotiated Child SA and subsequent new Child SAs with the same Traffic Selectors
       are a logical group of Child SAs where most or all of the Child SAs are
       bound to a specific resource, such as a specific CPU. The TS_MAX_QUEUE
       notify conveys that the peer is unwilling to create more additional Child
       SAs for this particular negotiated Traffic Selector combination.
      </t>
      <t>
       Using multiple Child SAs with the same Traffic Selectors has the benefit
       that each resource holding the Child SA has its own Sequence Number Counter,
       ensuring that CPUs don't have to synchronize their cryptographic state or disable
       their packet replay protection.
      </t>
    </abstract>
  </front>
  <middle>
    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>
       Most IPsec implementations are currently limited to using one
       hardware queue or a single CPU resource for a Child SA. 

Running
       packet stream encryption in parallel can be done, but there is a bottleneck of
       different parts of the hardware locking or waiting to get their
       sequence number assigned for the packet being encrypted. The
       result is that a machine with many such resources is limited to
       using only one of these resources per Child SA. This severely
       limits the throughput that can be attained. For example, at the
       time of writing, an unencrypted link of 10 Gbps or more is commonly
       reduced to 2-5 Gbps when IPsec is used to encrypt the link using
       AES-GCM. By using the implementation specified in this document,
       aggregate throughput increased from 5Gbps using 1 CPU to 40-60
       Gbps using 25-30 CPUs.
      </t>
      <t>
       While this could be (partially) mitigated by setting up multiple
       narrowed Child SAs (for example, using Populate From Packet (PFP)
       as specified in IPsec architecture <xref target="RFC4301" format="default"/>), this
       IPsec feature would cause too many Child SAs (one per network flow)
       or too few Child SAs (one network flow used on multiple CPUs). PFP is
       also not widely implemented.
      </t>
      <t>
       To make better use of multiple network queues and CPUs, it can
       be beneficial to negotiate and install multiple Child
       SAs with identical Traffic Selectors. IKEv2 <xref target="RFC7296" format="default"/>
       already allows installing multiple Child SAs with identical Traffic
       Selectors, but it offers no method to indicate that the additional Child
       SA is being requested for performance increase reasons and is restricted
       to some resource (queue or CPU).
      </t>
      <t>
       When an IKEv2 peer is receiving more additional Child SAs for a single
       set of Traffic Selectors than it is willing to create, it can return an
       error notify of TS_MAX_QUEUE.

      </t>
      <section numbered="true" toc="default">
        <name>Requirements Language</name>
        <t>
    The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
    "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL NOT</bcp14>",
    "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>",
    "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
    "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be
    interpreted as described in BCP&nbsp;14 <xref target="RFC2119"/> <xref
    target="RFC8174"/> when, and only when, they appear in all capitals, as
    shown here.
        </t>
      </section>
      <section numbered="true" toc="default">
        <name>Terminology</name>
        <t>This document uses the following terms defined in IKEv2 <xref target="RFC7296" format="default"/>:
          Notification Data, Traffic Selector (TS), Traffic Selector initiator (TSi), Traffic Selector responder (TSr), Child SA,
          Configuration Payload (CP), IKE SA, CREATE_CHILD_SA, and NO_ADDITIONAL_SAS.
        </t>
        <t>This document also uses the following terms defined in <xref target="RFC4301" format="default"/>:
          Security Policy Database (SPD), SA. 
        </t>
      </section>
    </section>
    <section anchor="performance" numbered="true" toc="default">
      <name>Performance Bottlenecks</name>
      <t>
      There are several pragmatic reasons why most implementations must restrict a
      Child Security Association (SA) to a single specific hardware resource. A
      primary limitation arises from the challenges associated with sharing
      cryptographic states, counters, and sequence numbers among multiple CPUs. When
      these CPUs attempt to simultaneously utilize shared states, it becomes
      impractical to do so without incurring a significant performance penalty. It is
      necessary to negotiate and establish multiple Child SAs
      with identical Traffic Selector initiator (TSi) and Traffic Selector responder
      (TSr) on a per-resource basis.
      </t>
    </section>
    <section anchor="neg_cpu" numbered="true" toc="default">
      <name>Negotiation of Resource-Specific Child SAs</name>
      <t>
       An initial IKEv2 exchange is used to set up an IKE SA and the
       initial Child SA. If multiple Child SAs with the same Traffic
       Selectors that are bound to a single resource are desired, the
       initiator will add the SA_RESOURCE_INFO notify payload to the
       Exchange negotiating the Child SA (e.g., IKE_AUTH or CREATE_CHILD_SA).
       If this initial Child SA will be tied to a specific resource, it
       <bcp14>MAY</bcp14> indicate this by including an identifier in the Notification
       Data. 
<!--[rfed] Should "Notify Data" be "Notification Data" here?
There are zero instance of "Notify Data" in RFCs, and
"Notification Data" is used in the preceding sentence.
Also, may the article "a" be removed?

Original:
   A responder that
   is willing to have multiple Child SAs for the same Traffic Selectors
   will respond by also adding the SA_RESOURCE_INFO notify payload in
   which it MAY add a non-zero Notify Data.

Perhaps:
   A responder that
   is willing to have multiple Child SAs for the same Traffic Selectors
   will respond by also adding the SA_RESOURCE_INFO notify payload in
   which it MAY add non-zero Notification Data.

Similarly, should "notify data" be updated in the Security Considerations section?

Original: The notify data SHOULD NOT be an identifier ...
Perhaps:  The Notification Data SHOULD NOT be an identifier ...
-->
A responder that is willing to have multiple Child SAs
       for the same Traffic Selectors will respond by also adding the
       SA_RESOURCE_INFO notify payload in which it <bcp14>MAY</bcp14> add a non-zero
       Notification Data.
      </t>
      <t>
       Additional resource-specific Child SAs are negotiated as regular Child
       SAs using the CREATE_CHILD_SA exchange and are similarly identified by an
       accompanying SA_RESOURCE_INFO notification.</t>
      <t>
       Upon installation, each resource-specific Child SA is
       associated with an additional local selector, such as the CPU.
       These resource-specific Child SAs <bcp14>MUST</bcp14> be negotiated with identical
       Child SA properties that were negotiated for the initial Child
       SA. This includes cryptographic algorithms, Traffic Selectors,
       Mode (e.g., transport mode), compression usage, etc. However, each
       Child SA does have its own keying material that is individually derived
       according to the regular IKEv2 process. The SA_RESOURCE_INFO
       notify payload <bcp14>MAY</bcp14> be empty or <bcp14>MAY</bcp14> contain some identifying data.
       This identifying data <bcp14>SHOULD</bcp14> be a unique identifier within all
       the Child SAs with the same TS payloads, and the peer <bcp14>MUST</bcp14> only
       use it for debugging purposes.
      </t>
      <t>
       Additional Child SAs can be started on demand or can be started
       all at once. Peers may also delete specific per-resource Child
       SAs if they deem the associated resource to be idle.
      </t>
      <t>
       During the CREATE_CHILD_SA rekey for the Child SA, the
       SA_RESOURCE_INFO notification <bcp14>MAY</bcp14> be included, but regardless of
       whether or not it is included, the rekeyed Child SA should be bound
       to the same resource(s) as the Child SA that is being rekeyed.
      </t>
    </section>
    <section anchor="impl_consider" numbered="true" toc="default">
      <name>Implementation Considerations</name>
      <t>
       There are various considerations that an implementation can
       use to determine the best procedure to install multiple Child SAs.
      </t>
      <t>
       A simple procedure could be to install one additional Child SA
       on each CPU. An implementation can ensure that one Child SA can be
       used by all CPUs, so that while negotiating a new per-CPU Child SA,
       which typically takes 1 RTT delay, the CPU with no CPU-specific
       Child SA can still encrypt its packets using the Child SA that is
       available for all CPUs. Alternatively, if an implementation finds
       it needs to encrypt a packet but the current CPU does not have
       the resources to encrypt this packet, it can relay that packet
       to a specific CPU that does have the capability to encrypt the
       packet, although this will come with a performance penalty.
      </t>
      <t>
       Performing per-CPU Child SA negotiations can result in both peers
       initiating additional Child SAs simultaneously. This is especially likely
       if per-CPU Child SAs are triggered by individual SADB_ACQUIRE messages
       <xref target="RFC2367" format="default"/>. Responders should install the
       additional Child SA on a CPU with the least amount of additional
       Child SAs for this TSi/TSr pair.
      </t>
      <t>
       When the number of queue or CPU resources are different between the
       peers, the peer with the least amount of resources may decide to
       not install a second outbound Child SA for the same resource, as
       it will never use it to send traffic. However, it must install
       all inbound Child SAs because it has committed to receiving traffic
       on these negotiated Child SAs.
      </t>
      <t>
       If per-CPU packet trigger (e.g., SADB_ACQUIRE) messages are implemented
       (see <xref target="Operations" format="default"/>),
       the Traffic Selector (TSi) entry containing the information of the
       trigger packet should be included in the TS set similarly to
       regular Child SAs as specified in IKEv2 <xref target="RFC7296" format="default" section="2.9"      
sectionFormat="comma"/>.
       Based on the trigger TSi entry, an implementation can select the most
       optimal target CPU to install the additional Child SA on. For example,
       if the trigger packet was for a TCP destination to port 25 (SMTP), it
       might be able to install the Child SA on the CPU that is also running
       the mail server process. Trigger packet Traffic Selectors are
       documented in IKEv2 <xref target="RFC7296" format="default" section="2.9" sectionFormat="comma"/>.
      </t>
      <t>
       As per IKEv2, rekeying a Child SA <bcp14>SHOULD</bcp14> use the same (or
       wider) Traffic Selectors to ensure that the new Child SA covers
       everything that the rekeyed Child SA covers. 
       This includes
       Traffic Selectors negotiated via Configuration Payloads 
       such as INTERNAL_IP4_ADDRESS, which may use the original wide TS
       set or use the narrowed TS set.
      </t>
    </section>
    <section anchor="payload_formats" numbered="true" toc="default">
      <name>Payload Format</name>
      <t>
	The Notify Payload format is defined in IKEv2 <xref target="RFC7296" format="default" section="3.10" sectionFormat="comma"/>, and is copied here for convenience.
      </t>
      <t>
      All multi-octet fields representing integers are laid out in big
      endian order (also known as "most significant byte first", or
      "network byte order").
      </t>
      <section anchor="payload_info_cpu" numbered="true" toc="default">
        <name>SA_RESOURCE_INFO Notify Message Status Type Payload</name>
        <artwork align="left" name="" type="" alt=""><![CDATA[
                    1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-------------------------------+-------------------------------+
| Next Payload  |C|  RESERVED   |         Payload Length        |
+---------------+---------------+-------------------------------+
|  Protocol ID  |   SPI Size    |      Notify Message Type      |
+---------------+---------------+-------------------------------+
|                                                               |
~               Resource Identifier (optional)                  ~
|                                                               |
+-------------------------------+-------------------------------+
            ]]></artwork>

        <dl>
	  <dt>(C)ritical flag -</dt><dd><bcp14>MUST</bcp14> be 0.</dd>
          <dt>Protocol ID (1 octet) -</dt><dd><bcp14>MUST</bcp14> be 0. <bcp14>MUST</bcp14> be ignored if not 0.</dd>
          <dt>SPI Size (1 octet) -</dt><dd><bcp14>MUST</bcp14> be 0. <bcp14>MUST</bcp14> be ignored if not 0.</dd>
          <dt>Notify Status Message Type value (2 octets) -</dt><dd>set to 16444.</dd>
          <dt>Resource Identifier (optional) -</dt><dd>This opaque data may be set to convey the local identity of the resource.</dd>
        </dl>
      </section>
      <section anchor="payload_max_q" numbered="true" toc="default">
        <name>TS_MAX_QUEUE Notify Message Error Type Payload</name>
        <artwork align="left" name="" type="" alt=""><![CDATA[
                    1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------------------------------+
| Next Payload  |C|  RESERVED   |         Payload Length        |
+---------------+---------------+-------------------------------+
|  Protocol ID  |   SPI Size    |      Notify Message Type      |
+---------------+---------------+-------------------------------+
            ]]></artwork>
        <dl>
	  <dt>(C)ritical flag -</dt><dd><bcp14>MUST</bcp14> be 0.</dd>
          <dt>Protocol ID (1 octet) -</dt><dd><bcp14>MUST</bcp14> be 0. <bcp14>MUST</bcp14> be ignored if not 0.</dd>
          <dt>SPI Size (1 octet) -</dt><dd><bcp14>MUST</bcp14> be 0. <bcp14>MUST</bcp14> be ignored if not 0.</dd>
          <dt>Notify Message Error Type (2 octets) -</dt><dd>set to 48.</dd>
	</dl>

            <t>There is no data associated with this Notify type.</t>

      </section>
    </section>
    <section anchor="Operations" numbered="true" toc="default">
      <name>Operational Considerations</name>
      <t>
      Implementations supporting per-CPU SAs <bcp14>SHOULD</bcp14> extend their local
      SPD selector, and the mechanism of on-demand negotiation that is
      triggered by traffic to include a CPU (or queue) identifier in
      their packet trigger (e.g., SADB_ACQUIRE) message from the SPD to
      the IKE daemon. An implementation that does not support
      receiving per-CPU packet trigger messages <bcp14>MAY</bcp14> initiate all its Child
      SAs immediately upon receiving the (only) packet trigger message it
      will receive from the IPsec stack. Such an implementation also needs
      to be careful when receiving a Delete Notify request for a per-CPU
      Child SA, as it has no method to detect when it should bring up such
      a per-CPU Child SA again later. Also, bringing the deleted per-CPU
      Child SA up again immediately after receiving the Delete Notify
      might cause an infinite loop between the peers. Another issue with
      not bringing up all its per-CPU Child SAs is that if the peer acts
      similarly, the two peers might end up with only the first Child
      SA without ever activating any per-CPU Child SAs. It is therefore
      <bcp14>RECOMMENDED</bcp14> to implement per-CPU packet trigger messages.
      </t>
      <t>
      Peers <bcp14>SHOULD</bcp14> be flexible with the maximum number of Child SAs they
      allow for a given TSi/TSr combination in order to account for corner cases. For
      example, during Child SA rekeying, there might be a large number
      of additional Child SAs created before the old Child SAs are torn
      down. Similarly, when using on-demand Child SAs, both ends could
      trigger multiple Child SA requests as the initial packet causing
      the Child SA negotiation might have been transported to the peer
      via the first Child SA, where its reply packet might also trigger an
      on-demand Child SA negotiation to start. As additional Child SAs
      consume little additional resources, allowing at the very least double
      the number of available CPUs is <bcp14>RECOMMENDED</bcp14>. An implementation <bcp14>MAY</bcp14> allow
      unlimited additional Child SAs and only limit this number based on its
      generic resource protection strategies that are used to require COOKIES
      or refuse new IKE or Child SA negotiations. Although having a very large
      number (e.g., hundreds or thousands) of SAs may slow down per-packet SAD lookup.
      </t>
      <t>
     Implementations might support dynamically moving a per-CPU Child
     SA from one CPU to another CPU. If this method is supported,
     implementations must be careful to move both the inbound and outbound
     SAs. If the IPsec endpoint is a gateway, it can move the inbound SA
     and outbound SA independently of each other. It is likely that
     for a gateway, IPsec traffic would be asymmetric.  If the IPsec
     endpoint is the same host responsible for generating the traffic,
     the inbound and outbound SAs <bcp14>SHOULD</bcp14> remain as a pair on the same CPU.
     If a host previously skipped installing an outbound SA because it
     would be an unused duplicate outbound SA, it will have to create
     and add the previously skipped outbound SA to the SAD with the new
     CPU ID. The inbound SA may not have a CPU ID in the SAD.  Adding the
     outbound SA to the SAD requires access to the key material, whereas
     updating the CPU selector on an existing outbound SAs might not require access
     to key material.  To support this, the IKE
     software might have to hold on to the key material longer than it
     normally would, as it might actively attempt to destroy key material
     from memory that the IKE daemon no longer needs access to.
      </t>
      <t>
   An implementation that does not accept any further resource-specific
   Child SAs <bcp14>MUST NOT</bcp14> return the NO_ADDITIONAL_SAS error because it
   could be misinterpreted by the peer to mean that no other Child SA with
   a different TSi and/or TSr is allowed either. Instead, it <bcp14>MUST</bcp14> return TS_MAX_QUEUE.
      </t>
    </section>
    <section anchor="Security" numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>
      Similar to how an implementation should limit the number of
      half-open SAs to limit the impact of a denial-of-service attack,
      it is <bcp14>RECOMMENDED</bcp14> that an implementation limits the maximum number of additional
      Child SAs allowed per unique TSi/TSr.
      </t>
      <t>
      Using multiple resource-specific child SAs makes sense for
      high-volume IPsec connections on IPsec gateway machines where the
      administrator has a trust relationship with the peer's
      administrator and abuse is unlikely and easily escalated to resolve.
      </t>
      <t>
      This trust relationship is usually not present for the deployments of remote
      access VPNs, and allowing per-CPU Child SAs
      is <bcp14>NOT RECOMMENDED</bcp14> in these scenarios. Therefore, it is also <bcp14>NOT
      RECOMMENDED</bcp14> to allow per-CPU Child SAs by default.
      </t>
      <t>
      The SA_RESOURCE_INFO notify contains an optional data payload that
      can be used by the peer to identify the Child SA belonging to a
      specific resource.  Notification data <bcp14>SHOULD NOT</bcp14> be an identifier that
      can be used to gain information about the hardware. For example,
      using the CPU number itself as the identifier might give an attacker
      knowledge of which packets are handled by which CPU ID, and it might
      optimize a brute-force attack against the system.
      </t>
    </section>
    <section anchor="IANA" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>
        IANA has registered one new value in the "IKEv2 Notify Message Status Types" registry.
      </t>
      <table anchor="iana_requests_i">
	<thead>
	  <tr>
	    <th>Value</th>
	    <th>Notify Message Status Type</th>
	    <th>Reference</th>
	  </tr>
	</thead>
	<tbody>
	  <tr>
	    <td>16444</td>
	    <td>SA_RESOURCE_INFO</td>
	    <td>RFC 9611</td>
	  </tr>
	</tbody>
      </table>
      
      <t>
        IANA has registered one new value in the "IKEv2 Notify Message Error Types" registry.
      </t>
      <table anchor="iana_requests_e">
	<thead>
	  <tr>
	    <th>Value</th>
	    <th>Notify Message Error Type</th>
	    <th>Reference</th>
	  </tr>
	</thead>
	<tbody>
	  <tr>
	    <td>48</td>
	    <td>TS_MAX_QUEUE</td>
	    <td>RFC 9611</td>
	  </tr>
	</tbody>
      </table>
    </section>
  </middle>
  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7296.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
      </references>
      <references>
        <name>Informative References</name>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2367.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4301.xml"/>
      </references>
    </references>
    <section anchor="acks" numbered="false" toc="default">
      <name>Acknowledgements</name>
      <t>The following people provided reviews and valuable feedback:
      <contact fullname="Roman Danyliw" />, <contact fullname="Warren Kumari" />,
      <contact fullname="Tero Kivinen" />, <contact fullname="Murray Kucherawy" />,
      <contact fullname="John Scudder" />, <contact fullname="Valery Smyslov" />,
      <contact fullname="Gunter van de Velde" />, and <contact fullname="Éric Vyncke" />.
      </t>
    </section>

  </back>
</rfc>
