NFSV4 F. Liu Internet Draft W. Wang Intended status: Standards Track R. Liu Expires: August 2024 H3C Y. Mu K. Yao China Mobile February 28, 2024 RoCEv2-based Collective Communication Offloading draft-liu-nfsv4-rocev2-00.txt Abstract This draft proposes the design scheme of RoCEv2-based collective communication offloading. Through establishing RDMA connections between client and switch, collective operations can be implemented on network nodes, thus improving the overall efficiency of collective communication. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, and it may not be published except as an Internet-Draft. This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format Liu, et al. Expires August 28, 2024 [Page 1] Internet-Draft RoCEv2 CCO February 2024 it for publication as an RFC or to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on August 28, 2024. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Introduction...................................................3 2. Terminology and Definitions....................................4 Liu, et al. Expires August 28, 2024 [Page 2] Internet-Draft RoCEv2 CCO February 2024 3. Architecture...................................................4 3.1. In-network Computing Aggregation Manager..................6 3.2. In-network Computing Switch...............................7 3.3. In-network Computing Client...............................8 4. Deployment.....................................................9 5. Interaction Process...........................................12 5.1. Control plane............................................12 5.2. Forwarding plane.........................................13 6. Packet encapsulation..........................................15 7. Transport layer requirements..................................16 8. Security Considerations.......................................17 9. IANA Considerations...........................................17 10. References...................................................17 10.1. Normative References....................................17 10.2. Informative References..................................18 11. Acknowledgments..............................................18 1. Introduction Collective communication means that within a network, multiple computers or devices communicate through shared resources and cooperation to achieve more efficient and secure data transmission and information exchange. Detailed use cases and problems are proposed in [I-D.yao-tsvwg-cco-problem-statement-and-usecases]. Various collective communication operations are used in both artificial intelligence (AI) and high performance computing (HPC) workloads, including: 1. Broadcast - spread data from one member to all other members. 2. AllGather - collect data from all members and spread it to all member. 3. AllToAll - distribute different data from all members to all other members. 4. Scatter - distribute different data from one member to all other members. 5. Gather - collect data from all members and send to one member. 6. Reduce - merge data of all members send to one member. 7. AllReduce - merge data of all members and spread it to all members. 8. ReduceScatter - merge different part of data of all members, and distribute it to all members. Liu, et al. Expires August 28, 2024 [Page 3] Internet-Draft RoCEv2 CCO February 2024 9. Barrier - synchronize among all members. In-network computing enables network device to participate in collective communication by offloading the collective communication operations frequently used by HPC and AI to network nodes. The acceleration of collective communication through in-network computing is of great significance, which is mainly reflected in the following aspects. The requirement and analysis are described in [I-D.draft- yao-tsvwg-cco-requirement-and-analysis]. From an application point of view, in-network computing can significantly reduce the communication traffic, thus improving the overall computing efficiency and the overall application performance. From the point of view of resource utilization, computational tasks of processors are shared, thus the computation speed is accelerated, and the overall resource utilization is improved. From the network point of view, the data flow of the network is reduced and congestion is relieved, so that the network utilization is improved. 2. Terminology and Definitions The following terms are used in this document: Aggregation The act of collecting and reducing input data from one or more group members. Collective Collective Operation - an operation done by a group of ranks. Collective Group A set of ranks that participate in a collective operation. INC In-network computing. INC-switch Switch with the capability to support INC. 3. Architecture Figure 1 illustrates a conceptual architecture of in-network computing. Liu, et al. Expires August 28, 2024 [Page 4] Internet-Draft RoCEv2 CCO February 2024 +------------------------+ | | | INC Switch | | | +-------------+ +------------+ | | | | Switch Chip| | | | +------------+ | | | | +---+----+ +------/----------\------+ | INC AM | // \\ +---+----+ / \\ | / RDMA-RoCEv2 \ | // \\ | / \\ | +----------/----------+ +---------\-----------+ | | | | | +-| INC Client | | INC Client | | | | | | +-------------+ | | +-------------+ | | | GPU | | | | GPU | | | +-------------+ | | +-------------+ | | | | | +---------------------+ +---------------------+ Figure 1 Architecture In order to offload collective communication, the architecture of in- network computing will mainly be composed of three parts: In-network Computing Aggregation Manager, In-network Computing Switch, and In- network Computing Client. o In-network Computing Aggregation Manager (INC AM): It is the controller of the entire in-network computing, mainly responsible for the generation and management of the Aggregation Tree, issuing in-network computing related flow table to the switch, and real- time monitoring of the in-network computing task status. o In-network Computing Switch (INC Switch): It is the core that offloads collective communication to network devices. It performs specific collective communication operations by receiving corresponding data and operation methods from the in-network computing client, and finally sends the results to the in-network computing client. It also provides related operation and maintenance data, such as in-network computing related task and message statistics. Liu, et al. Expires August 28, 2024 [Page 5] Internet-Draft RoCEv2 CCO February 2024 o In-network Computing Client (INC client): It is the data source that needs to perform collective communication in in-network computing. It is deployed in the computing nodes and is used to integrate with MPI (Message Passing Interface) library, NCCL (NVIDIA Collective Communication Library) to send collective communication data to the in-network computing switch. 3.1. In-network Computing Aggregation Manager The main function of the in-network computing Aggregation Manager is to coordinate the establishment and dismantling of the collection communication group. At the same time, it also provides the ability to collect and manage the lifecycle of the collection communication group, and monitors the in-network computing switches and in-network computing clients through heartbeat detection. The in-network computing Aggregation Manager must be deployed in a location that can access to the in-network computing switches and in- network computing clients; it connects to the in-network computing clients via gRPC and to the in-network computing switches via NETCONF. o Topology Information. The in-network computing aggregation manager must be able to obtain network topology and the capabilities of in-network computing switches, and display all in- network computing clients and in-network computing switches, as well as their connection relationships. The document focuses specifically on the tree topology, and does not discuss other topologies. o Establishment of Collection Communication Group. When using offloading mode in collective communication, the in-network computing aggregation manager needs to calculate and determine which in-network computing switches have the capability and resources, and establish an aggregation tree. All unsupported devices will be excluded from the aggregation tree. 1. Select a root switch, generally the position of the root is the spine switch, so that all subsequent leaf switches can communicate directly with the root. 2. Select the communication link between the root and leaf switches. 3. The in-network computing Aggregation Manager configures the in- network computing switches via NETCONF, obtains such as capability, RDMA information etc. from the in-network computing switches, and sends it to the in-network computing clients. Liu, et al. Expires August 28, 2024 [Page 6] Internet-Draft RoCEv2 CCO February 2024 4. If there are any change in the topology during the lifecycle, the in-network computing Aggregation Manager needs to dismantle the collection communication group or establish a new aggregation tree. o Dismantling of Collection Communication Group. The conditions for dismantling the collection communication group include: 1. In-network computing clients leaving the collection communication group. 2. Failure of heartbeat detection for in-network computing clients. 3. Link failure. 4. Manual dismantling. o Resource Allocation. Resource allocation and distribution are required in-network computing, and its main functions include: 1. Responsible for allocating identity identifiers for in-network computing: assigning identity identifiers to each in-network computing switch in the aggregation tree; mapping the identity identifiers of in-network computing clients to the identity identifiers of in-network computing switches. 2. Responsible for establishing QP in the RDMA protocol. 3. Distribute the in-network computing forwarding table to in-network computing clients and in-network computing switches: the in-network computing Aggregation Manager generates forwarding table based on the aggregation tree and distributes the forwarding table to in-network computing clients and in-network computing switches. 4. Monitor the status of in-network computing tasks: the in-network computing Aggregation Manager is responsible for monitoring the running status of in-network computing clients and in-network computing switches, including task status and statistics. 3.2. In-network Computing Switch The in-network computing switch offloads collective communication of in-network computing clients. The in-network computing switch is directly or indirectly connected to the in-network computing clients and serves as the core for offloading collective communication to network devices. It performs specific collective communication operations by receiving corresponding data and instructions from in- network computing clients, and ultimately sends the results back to Liu, et al. Expires August 28, 2024 [Page 7] Internet-Draft RoCEv2 CCO February 2024 the client or clients. The interface and functions between the in- network computing switch and the in-network computing aggregation manager include: o In-network computing related configuration processing, specifically including: configuring in-network computing management addresses, querying in-network computing aggregation trees, querying in-network computing statistics, and providing corresponding NETCONF interfaces. o In-network computing packet parsing and encapsulation: parsing in-network computing packets sent from in-network computing clients, performing in-network computing processing, and then re- encapsulating the in-network computing packets to send to in- network computing clients or in-network computing root and leaf switches. o Performing in-network computing processing based on the in- network computing forwarding table: supporting collective communication operations such as AllReduce, Broadcast, Barrier, etc. o Providing in-network computing statistics: including packet statistics based on identity and packet statistics based on QP (Queue Pair). 3.3. In-network Computing Client The in-network computing client needs to integrate with collective communication library. OpenMPI and NCCL define the standard MPI collective communication interface, but allow third-party to have its own implementation. By developing the INC Client to implement the docking with the MPI collective communication interface of OpenMPI and NCCL, and implementing the MPI collective communication algorithm in in-network computing, this INC client can be integrated into the communication library through plugins or embedded directly. When the application calls the MPI_AllReduce interface of OpenMPI or NCCL, it directly calls the processing in the INC Client. The INC Client sends the data of MPI collective communication in the encapsulation format of in-network computing to the in-network computing switch. The INC Client is also responsible for receiving the in-network computing packets in response from the in-network computing switch and returning them to the upper-layer application. The in-network computing client needs to have the following functions: Liu, et al. Expires August 28, 2024 [Page 8] Internet-Draft RoCEv2 CCO February 2024 o Deployed within the computing node, used for integration with MPI library and NCCL library: it needs to provide plugin for integration with OpenMPI and NCCL respectively, as well as INC Client lib; INC Client starts with the start of the MPI process and stops with the stop of the MPI process. o Responsible for sending and receiving in-network computing packets: the in-network computing client sends in-network computing packets to the in-network computing switch based on the forwarding table issued by the in-network computing aggregation manager (including identity, task identification, QP, switch IP, etc.). o Provide interface for querying in-network computing task- related information: mainly including in-network computing task status, data block size, identity, task identification, QP, message statistics. o Provide INC Client log. 4. Deployment Considering that the scale of networking can vary according to the size of AI training, in-network computing needs to support single- level aggregation and multi-level aggregation. In general, the single-level aggregation method can meet the requirements of in- network computing. If the aggregation capacity of the in-network computing switch is insufficient, or in order to save bandwidth between switches, a multi-level aggregation method can be adopted. The networking diagram for single-level aggregation is as follows: Liu, et al. Expires August 28, 2024 [Page 9] Internet-Draft RoCEv2 CCO February 2024 +-------------------------------------+ | INC Switch | | | | +-------------+ | +---------+ | Leaf1 | | | | | | | | | | AllReduce | | | | +-/-+-------+-+ | +---+----+ +---------//--+-------+-\\------------+ | INC AM | / | | \\ +---+----+ / | | \\ | // | | \\ | / | | \\ | / | | \\ | +---------//----------+-------+-------------\\--------+ | | +------/---+ +------+---+ +-+--------+ +----\-----+ | +-| |INC Client| |INC Client| |INC Client| |INC Client| | | | | | | | | | | | | | Worker1 | | Worker2 | | Worker3 | | Worker4 | | | +----------+ +----------+ +----------+ +----------+ | +-----------------------------------------------------+ Figure 2 Single-level Aggregation Network In a single-level aggregation network environment, the following operations need to be implemented: o The in-network computing aggregation manager generates aggregation trees and assigns Tree IDs for different computing tasks, and then sends the aggregation tree information to the switch. o The in-network computing switch performs local aggregation based on the aggregation tree information upon receiving packets from the in-network computing client. o Broadcast the local aggregation results to the in-network computing client. Liu, et al. Expires August 28, 2024 [Page 10] Internet-Draft RoCEv2 CCO February 2024 +----------------------------------------+ | INC Switch | | +-------------+ +-------------+ | | | Spine1 | | Spine2 | | | | | | AllReduce | | +---------+ +-+---------\\+ +/----------+-+ | | | | \\// | | | | | //\\ | | | |+-----+-------+ // \\ +-------+-----+| +---+----+ || Leaf1 |/ \| Leaf2 || | INC AM | || AllReduce | | AllReduce || +---+----+ |++--------+---+ +-+----------++| | +-+--------+----------------+----------+-+ | | | | | | | | | | | +---------+--------+----------------+----------+------+ | | +-------+--+ +---+------+ +-------+--+ +-----+----+ | +-| |INC Client| |INC Client| |INC Client| |INC Client| | | | | | | | | | | | | | Worker1 | | Worker2 | | Worker3 | | Worker4 | | | +----------+ +----------+ +----------+ +----------+ | +-----------------------------------------------------+ Figure 3 Multi-level Aggregation Network In a multi-level aggregation network environment, the following operations need to be implemented: o The in-network computing aggregation manager generates aggregation trees and assigns Tree IDs for different computing tasks, then sends the aggregation tree information to the switch, and informs the switch of its role: leaf or root. o The in-network computing switch first performs local aggregation based on the aggregation tree information upon receiving data packets from lower-level nodes. o If it is the root, it indicates that the aggregation is completed, and broadcasts the aggregation result to all members. o If it is not the root, it indicates the need for multi-level aggregation, and sends the local aggregation result to the upper- level in-network computing switch for further aggregation. Liu, et al. Expires August 28, 2024 [Page 11] Internet-Draft RoCEv2 CCO February 2024 o When a leaf in-network computing switch receives the aggregation result from the upper-level in-network computing switch, it continues to broadcast the aggregation result to the members at the lower level. 5. Interaction Process The interaction process in-network computing mainly consists of two parts, namely the control plane and the forwarding plane. The control plane is responsible for establishing, resource allocation/release, and dismantling of communication groups in-network computing; the forwarding plane is responsible for executing the data processing tasks of specific communication groups in-network computing. 5.1. Control plane The deployment architecture model starts from the in-network computing client joining the collective communication group. The in- network computing aggregation manager allocates the corresponding resources for in-network computing by establishing the collective communication group. The in-network computing aggregation manager needs to be deployed in a network environment accessible between the in-network computing switch and the in-network computing client, and then needs to complete the registration of the in-network computing switch capability, discover the topology between the in-network computing switch and the in-network computing client, and allocate/release the resources of the in-network computing switch according to the requirements of the in-network computing client for the collective communication group. Communication between the in- network computing clients and the in-network computing switches, and between in-network computing switches, is done through the RDMA protocol, so before RDMA communication, it is necessary to apply for QPN and create QP. Resources can be allocated through CM (Communication Management), resource allocation can be done through the Socket API, or resources can be allocated through the in-network computing aggregation manager. (1) Building a connection between RDMA QPs based on the Socket API requires establishing a TCP/IP connection between two nodes through the Socket API, and then using this connection to exchange information about both QPs. The application program implements the TCP/IP three-way handshake, data exchange, and four-way handshake process by calling the Socket API according to the process, and then starts to exchange information such as QPN. (2) CM is a mechanism specifically used in RDMA technology to establish connections between QPs. It has a set of exclusive message Liu, et al. Expires August 28, 2024 [Page 12] Internet-Draft RoCEv2 CCO February 2024 formats, interaction processes, and user interfaces. The CM protocol establishes connections through multiple round-trip messages, and it also specifies the way to disconnect. Users control the CM to send and receive CM protocol messages through the CM programming interface, completing the interaction of GID, QPN, and other information. (3) Considering the complexity of implementation, it is also possible to allocate QPN and QP for the switches in in-network computing through the in-network computing aggregation manager, and the QPN and QP allocation for the in-network computing client is done by the client itself, and the allocated information is synchronized to the in-network computing aggregation manager for unified pairing and management. 5.2. Forwarding plane The entire forwarding plane in-network computing starts with the client initiating a data packet. The in-network computing switches receive the data from the in-network computing client based on the generated topology graph and process it. The data is then broadcasted to all member clients. Considering the complexity of multi-level aggregation, the overall process is divided into upstream and downstream processes. We assume Work1 and Work2 are attached to leaf1 switch; Work3 and Work4 are attached to Leaf2 switch; Leaf1 switch and Leaf2 switch are attached to root spine switch. The specific upstream process is shown in the following. o Work1 and Work2 will send the messages calculated on the network to Leaf1 according to the message format of RoCEv2, carrying corresponding information such as QP and tree. o Leaf1 receives the information from Work1 and Work2, aggregates it locally, and then sends it to the spine switch, carrying information such as QP and tree. o Work3 and Work4 will send the messages calculated on the network to Leaf2 according to the message format of RoCEv2, carrying corresponding information such as QP and tree. o Leaf2 receives the information from Work3 and Work4, aggregates it locally, and then sends it to the spine switch, carrying information such as QP and tree. o The spine switch receives the information from Leaf1 and Leaf2 and completes the aggregation. Liu, et al. Expires August 28, 2024 [Page 13] Internet-Draft RoCEv2 CCO February 2024 The specific downstream process is shown in the following. o After the spine switch completes the aggregation, it locally replicates the aggregation result and sends it to the Leaf1 and Leaf2 switches respectively, carrying QP, tree, and other information. o Leaf1 receives the aggregation result from the spine, completes the local replication, and then sends it to work1 and Work2, carrying QP, tree, and other information. o Leaf2 receives the aggregation result from the spine, completes the local replication, and then sends it to Work3 and Work4, carrying QP, tree, and other information. In-network computing switch is crucial for completing the aggregation operation. Let's introduce the handling of aggregation on the switch. +-------------------------------------------------------------------+ | Tree ID=1 | | slot0 slot255 | | +-------+------+-----+------+ +-------+------+-----+------+| |sum | msg_0 |sum_1 | ... |sum_k |...|msg_255|sum_1 | ... |sum_k || | +-------+------+-----+------+ +-------+------+-----+------+| | | | +-------+------+-----+------+ +-------+------+-----+------+| |rank0 | msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k|| | +-------+------+-----+------+ +-------+------+-----+------+| | | | +-------+------+-----+------+ +-------+------+-----+------+| |rank1 | msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k|| | +-------+------+-----+------+ +-------+------+-----+------+| | | | +-------+------+-----+------+ +-------+------+-----+------+| |rank63| msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k|| | +-------+------+-----+------+ +-------+------+-----+------+| +-------------------------------------------------------------------+ Figure 4 The aggregation operation As shown in the figure above, assuming: o For a certain tree id, the in-network computing switch needs to process 64 workers (represented by rank0-rank63). Liu, et al. Expires August 28, 2024 [Page 14] Internet-Draft RoCEv2 CCO February 2024 o Each rank sends 256 messages at a time (represented by message0-message255). o The in-network computing switch creates 256 aggregator pools (corresponding to slot0-slot255) for this tree id, with each slot responsible for aggregating a column of messages. For each slot, it is necessary to check the arrival status of the rank's data under that slot. For example, for slot0, when the aggregated messages sent by rank0-rank63 are all received and the tree id and message id are checked to be the same, the aggregation operation is performed on each data (from data1 to datak) in these messages: o Aggregate rank0 data1, rank1 data1, and so on, up to rank63 data1. o Aggregate rank0 data2, rank2 data2, and so on, up to rank63 data2. o Continue until rank0 datak, rank1 datak, and so on, up to rank63 datak. After completing all the data aggregation, perform different operations based on whether the role is root or leaf: o If it is root, send the aggregated result to the leaf and clear the data under the slot, updating the expected message id. o If it is leaf, send the aggregated result of slot x to the root switch, and wait to receive the final aggregated result from the root before clearing the data under the slot and updating the expected message id. Each slot runs independently and does not interfere with each other. When a slot completes processing, it can initiate the processing of the next message id separately. 6. Packet encapsulation Communication between in-network computing switches and in-network computing clients is done through RDMA. RDMA communication requires a lossless network environment, so in an Ethernet environment, they communicate data messages for in-network computing through RoCEv2. RDMA generally uses RC (Reliable Connection) mode and UC (Unreliable Connection) mode. In RC mode, it supports message acknowledgment confirmation and timeout retransmission. If a message times out Liu, et al. Expires August 28, 2024 [Page 15] Internet-Draft RoCEv2 CCO February 2024 without confirmation, all subsequent messages after that will be retransmitted. In UC mode, a link needs to be established in advance, messages do not need to carry address information, do not support acknowledgment confirmation or retransmission, and do not guarantee that the other end can receive them correctly. Using the standard Ethernet/IP message format, UDP port number 4791 represents RoCEv2 messages; using the Basic Transport Header (BTH) containing fields that are always present in all IBA transport services; using the RDMA Extension Transport Header (RETH) of 16 bytes, which includes additional transport fields for RDMA operations; using the Immediate Extension Transport Header (IMMDT) of 4 bytes, followed by the placement of data information related to in-network computing. The specific message mainly contains key information for executing data information related to in-network computing, which includes the following: (1) Aggregation Tree ID: representing collective communication. (2) Collective communication type, including specific operations to be performed, such as AllReduce, Broadcast, Barrier, etc. (3) Data type: including specific data types to be executed, such as IEEE754 floating point in 16, 32, 64 bits, etc. (4) Operation type, including specific operation types of the in- network computing switch after receiving the collective communication message, such as Sum (add the data together), Min (find the minimum value), Max (find the maximum value), etc. (5) The Payload section contains the data that is specifically transferred through RDMA in in-network computing. 7. Transport layer requirements Data packets may be lost due to link quality, switch buffer overflow, or other abnormal conditions. If packet loss occurs, the client in in-network computing is responsible for retransmission. If RC mode is used, all retransmissions are guaranteed by the RDMA transport layer. If UC mode is used, the retransmission process for in-network computing is as follows: (1) The in-network computing client sends a packet with MessageID = n and starts the packet retransmission timer. Liu, et al. Expires August 28, 2024 [Page 16] Internet-Draft RoCEv2 CCO February 2024 (2) If the corresponding response packet with MessageID = n is received before the retransmission timer times out, the next MessageID packet is sent and the packet retransmission timer is reset. (3) If the packet retransmission timer times out, the packet with MessageID = n is retransmitted until it is successfully sent. (4) A threshold N can be set to indicate that if N timeouts occur without successful transmission, the aggregation manager should be notified for error handling. The in-network computing switch passively processes the data packets, and in order to determine whether the received packet is a retransmitted packet and prevent duplicate aggregation of packets, the switch needs to record whether the corresponding MessageID packet has been received. 8. Security Considerations In network computing scheme may introduce some security and privacy concerns. Offloading collective operations may introduce new risks to the network. The content of the information exchanged among the INC aggregation manager, INC switches, and INC hosts may be topologically sensitive. It is possible to disclose the location information of computing resources hosted in the network and service sites, and attackers can use this information to identify vulnerable points in the network. For example, an attacker may take advantage of tampering with network topology information to interrupt customer service delivery, or even direct traffic to other places. The solution should support authentication and integrity protection mechanisms to enhance security. 9. IANA Considerations TBD 10. References 10.1. Normative References [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol Specification Version 2", RFC 5531, May 2009, . Liu, et al. Expires August 28, 2024 [Page 17] Internet-Draft RoCEv2 CCO February 2024 [RFC6241] R. Enns, M. Bjorklund, J. Schoenwaelder, A. Bierman, "Network Configuration Protocol (NETCONF)", RFC 6241, June 2011, . 10.2. Informative References [I-D.yao-tsvwg-cco-problem-statement-and-usecases] K. Yao, S. Xu, Y. Li, H. Huang, D. KUTSCHER, "Collective Communication Optimization: Problem Statement and Use cases", Work in Progress, Internet-Draft, draft-yao-tsvwg-cco-problem- statement-and-usecases-00, 23 October 2023, . [I-D.draft-yao-tsvwg-cco-requirement-and-analysis] K. Yao, S. Xu, Y. Li, H. Huang, W. Wang, D. KUTSCHER, "Collective Communication Optimizations: Requirement and Analysis", Work in Progress, Internet-Draft, draft-yao-tsvwg-cco- requirement-and-analysis-01, 5 February 2024, . 11. Acknowledgments TBD Liu, et al. Expires August 28, 2024 [Page 18] Internet-Draft RoCEv2 CCO February 2024 Authors' Addresses Feng Liu New H3C Technologies Co., Ltd Hangzhou, China Email: 11957147@qq.com Weifeng Wang New H3C Technologies Co., Ltd Beijing, China Email: wangweifeng@h3c.com Rubing Liu New H3C Technologies Co., Ltd Hangzhou, China Email: liurubing@h3c.com Yan Mu China Mobile Beijing, China Email: muyan@chinamobile.com Kehan Yao China Mobile Beijing, China Email: yaokehan@chinamobile.com Liu, et al. Expires August 28, 2024 [Page 19]