<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-wang-ffd-framework-02" ipr="trust200902">
  <front>
    <title abbrev="Framework of FFD for IP-based Network">Framework of Fast
    Fault Detection for IP-based Network</title>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Fengwei Qin" initials="F." surname="Qin">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>qinfengwei@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L." surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road</street>

          <city>Beijing</city>

          <region/>

          <code>100085</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Shuanglong Chen" initials="S." surname="Chen">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>chenshuanglong@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>hongyi.huang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="01" month="March" year="2024"/>

    <abstract>
      <t>The IP-based distributed system and software application layer often
      use heartbeat to maintain the network topology status. However, the
      heartbeat setting is long, which prolongs the system fault detection
      time. IP-based storage network is the typical usage of that scenario.
      When the IP-based storage network fault occurs, NVMe connections need to
      be switched over. Currently, no effective method is available for quick
      detection, switchover is performed only based on keepalive timeout,
      resulting in low performance.</t>

      <t>This document defines the basic framework of how network assisted
      host devices can quickly detect application connection failures caused
      by network faults.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Today, distributed systems based on network communication are widely
      used. In order to ensure that both ends of the distributed system can
      perceive faults, heartbeat is a common technology. However, relying on
      the heartbeat to detect whether the peer is faulty also faces
      challenges: if the heartbeat is set too short, it may be misjudged by
      network disturbances; if the heartbeat is set too long, when a fault
      occurs, it will not be found for a long time.</t>

      <t>Application scenarios such as IP-based NVMe, distributed storage, and
      cluster computing are typical scenarios for such technologies.</t>

      <t>The <xref target="I-D.guo-ffd-requirement"/> describes the problems
      of the current IP-based NVMe solution. On an IP-based storage area
      network, if the access link of a storage device is faulty, hosts cannot
      access the storage device. Because the host cannot directly detect the
      fault, the host has to wait for the KA timeout. To speed up fault
      detection, hosts and storage devices can implement fast KA or BFD.
      However, this solution introdueced additional cost on hosts and storage
      devices and is hard to use in large-scale IP-based storage area network.
      In fact, the IP network can directly detect these faults, so we can use
      the IP network to assist these access endpoints to quickly perceive the
      fault, so as to perform quickly service recovery.</t>
    </section>

    <section title="Terminology">
      <t>NoF : NVMe of Fabrics</t>

      <t>FC : Fiber Channel</t>

      <t>NVMe : Non-Volatile Memory Express</t>

      <t>SAN: Storage Area Network</t>
    </section>

    <section title="Reference Models">
      <t>The frame solution here is applicable to the system where the
      terminals are directly connected to the IP network. </t>

      <t><figure>
          <artwork align="center"><![CDATA[ +--------+    +-----------+     +-----------+    +--------+
 |Terminal|----| IP Network|-----| IP Network|----|Terminal|
 | Device |    |   Device  |     |   Device  |    | Device |
 +--------+    +-----------+     +-----------+    +--------+
             Figure 1 : Basic framework
]]></artwork>
        </figure>Terminals are connected to the IP network, and they establish
      IP connections through the reachability provided by the IP network. When
      the connection path fails, they cannot be detected quickly. They can
      only detect it after the keep-alive timeout, and then can carry out
      service protection processing. This time may be relatively long.
      Therefore, it is necessary to notify the terminal device of some
      failures in the network, such as access port failures and internal
      network failures that will cause IP connection failures between
      terminals, so that the terminal device can respond quickly and perform
      corresponding service processing.</t>

      <t>As introduced in Introduction, there are scenarios such as IP-based
      NVMe, distributed storage, and cluster computing. Here we take IP-based
      NVME as a typical scenario for introduction, and the processing behavior
      of other scenarios is similar.</t>

      <t>An IP-based storage area network mainly includes three types of
      roles: </t>

      <t>o Initiator, the terminal device, is also called the host.</t>

      <t>o Switch, which is a network device used to access terminal
      devices.</t>

      <t>o Target is also a terminal device, also known as a storage
      device.</t>

      <t>The host and storage devices use the Ethernet-based NVMe protocol to
      transmit data through the IP network to provide high-performance storage
      services.</t>

      <section title="Small-scale SAN">
        <t><figure align="center">
            <artwork><![CDATA[               +--+       +--+
    Host       |H1|       |H2|
 (Initiator)   +-,+       +_.+
                | `',   _-` |
                |    _-`    |
                | _-`   `', |
      IP     +----+       +----+
   Network   | SW1|       | SW2|
             +---,+       +_.--+
                | `',   _-` |
                |    `',    |
                | _-`   `', |
   Storage     +-`+       +`'+
   (Target)    |S1|       |S2|
               +--+       +--+
     Figure 1 : Small-scale SAN
]]></artwork>
          </figure></t>

        <t>This is the basic model for small-scale storage access networks.
        Hosts and storage devices are dual-homed to different switches.</t>

        <t>When the access link of the storage device is faulty, the host
        needs to quickly detect the fault so that the NVMe connection can be
        quickly switched to the standby path.</t>
      </section>

      <section title="Large-scale SAN">
        <t><figure>
            <artwork align="center"><![CDATA[               +--+      +--+      +--+      +--+
   Host        |H1|      |H2|      |H3|      |H4|
(Initiator)    +/-+      +-,+      +.-+      +/-+
                |         | '.   ,-`|         |
                |         |   `',   |         |
                |         | ,-`  '. |         |
              +-\--+    +--`-+    +`'--+    +-\--+
              | SW1|    | SW2|    | SW3|    | SW4|
              +--,-+    +---,,    +,.--+    +-.--+
                  `.          `'.,`         .`
                    `.   _,-'`    ``'.,   .`
    IP              +--'`+            +`-`-+
  Network           | SW5|            | SW6|
                    +--,,+            +,.,-+
                    .`   `'.,     ,.-``   ',
                  .`         _,-'`          `.
              +--`-+    +--'`+    `'---+    +-`'-+
              | SW7|    | SW8|    | SW9|    |SW10|
              +-.,-+    +-..-+    +-.,-+    +-_.-+
                | '.   ,-` |        | `.,   .' |
                |   `',    |        |    '.`   |
                | ,-`  '.  |        | ,-`  `', |
  Storage      +-`+      `'\+      +-`+      +`'+
  (Target)     |S1|      |S2|      |S3|      |S4|
               +--+      +--+      +--+      +--+
               Figure 2 : Large-scale SAN
]]></artwork>
          </figure></t>

        <t>This is a relatively large-scale storage network which applies to a
        large-scale storage device access network.</t>

        <t>When the access link of the storage device is faulty, the host
        needs to quickly detect the fault so that the NVMe connection can be
        quickly switched to the standby path.</t>
      </section>
    </section>

    <section title="Functional Components">
      <t>The NVMe IP-based SANs consists of storage devices, hosts and
      switches. Hosts and storage devices need to obtain required fault
      information from the IP network. Switches need to synchronize locally
      detected fault information on the IP network so that other switches can
      obtain the faults and notify hosts or storage devices that require the
      fault infomation.</t>

      <section title="Storage Device">
        <t>As the server side, storage devices provide storage access services
        for hosts. If a storage device is connected to an IP network and is
        interested in the status of other devices, the storage device can
        initiate a subscription request to the connected switch to obtain
        status notifications of other devices from the access switch.</t>

        <t>To reduce the complexity of storage devices, it's suggest to extend
        the LLDP protocol to support subscription from storage devices to
        switches and use the new L2-based protocol to notify the switch of
        status to the storage device.</t>

        <t><figure align="center">
            <artwork><![CDATA[  +-------+                  +------+
  |Storage|                  |Switch|
  +-------+                  +------+
      |      Subscribe Msg      |
      | ----------------------->|
      |                         |
      |     Notification Msg    |
      | <-----------------------|
      |                         |
      |                         |
      Figure 3 : Storage Device
]]></artwork>
          </figure></t>
      </section>

      <section title="Host">
        <t>The host is the client of the storage device. As the client side, a
        host needs to quickly obtain the service status of the storage device
        that provides services. When the host receives a notification message
        from the switch indicating that the storage device is faulty, the host
        will quickly disconnect from the storage device and switch to a
        redundant one.</t>

        <t>The recommended protocol on the host side is the same as that on
        the storage device.</t>

        <t><figure>
            <artwork><![CDATA[+-------+                  +------+
|  HOST |                  |Switch|
+-------+                  +------+
    |      Subscribe Msg      |
    | ----------------------->|
    |                         |
    |     Notification Msg    |
    | <-----------------------|
    |                         |
    |                         |
     Figure 4 : Host Device
]]></artwork>
          </figure></t>
      </section>

      <section title="Network Device">
        <t>Switches can quickly detect local faults and synchronize the faults
        to other switches on the IP network. After detecting a fault, the
        switch needs to notify the required host or storage device of the
        fault.</t>

        <t><figure>
            <artwork><![CDATA[+------+                  +------+
|Switch|                  |Switch|
+------+                  +------+
   |    Information Sync     |
   | ----------------------->|
   |                         |
   |                         |
   |                         |
    Figure 5 : Network Device
]]></artwork>
          </figure></t>
      </section>
    </section>

    <section title="Procedures">
      <t/>

      <section title="Network Deployment">
        <t>The IP-based SAN uses the standard Ethernet technolog. Network
        deployments typically use the current IP technologies. For example,
        OSPF is usually deployed as an underlay protocol.</t>
      </section>

      <section title="Storage and Host Access">
        <t>Hosts and storage devices are connected to the ethernet network.
        The administrator assigns access IP addresses to the hosts and storage
        devices. In most scenarios, these routes can be advertised through the
        underlay protocol. In addition, after hosts and storage devices go
        online, they needs to send subscription requests to the switch to
        obtain the status information of the target device.</t>

        <t>To prevent hosts or storage devices from being aware of extra IP
        address, it is recommended that LLDP be used to implement this
        message.</t>
      </section>

      <section title="Status Infomation Sync And Notification">
        <t>When hosts and storage devices go online, the switch can calculates
        an initial state of these devices and synchronizes the state on the IP
        network.</t>

        <t>After detecting a local fault, the switch needs to notify other
        access devices that need the fault information. In addition, the
        switch needs to synchronize the fault information to other switches on
        the network. To ensure that synchronization messages can be reliably
        synchronized to other switches, a reliable transmission protocol, such
        as TCP or Quic, must be used. For large-scale IP networks,
        hierarchical synchronization can be used to reduce the number of
        sessions between switches.</t>

        <t>The synchronization information about the host and storage devices
        belongs to the application layer's information.</t>

        <t><figure align="center">
            <artwork><![CDATA[+-------+           +----+      +------+      +----+         +-------+
|  HOST |-----------|TOR1|------|Spine1|------|TOR3|---------|Storage|
+---/---+           +-/--+      +--/---+      +-/--+         +---/---+
    |---------------->|  Info Sync |  Info Sync |<---------------|
    |  SubscribeMsg   |----------->|<-----------|  Subscribe Msg |
    |                 |<-----------|----------->|                |
    |<----------------|  Info Sync |  Info Sync |                |
    |Notification Msg |            |            |                |
    |                 |            |            |                |
            Figure 7 : Information Advertisement
]]></artwork>
          </figure></t>

        <section title="Access Link Failure">
          <t>When an access link is faulty, the access switch detects the
          fault. Based on the faulty link, the access switch can calculate the
          devices whose IP addresses are affected. The access switch
          advertises the faulty IP address information on other access links.
          The switch synchronizes the faulty IP address information on the IP
          network based on the computation result. After receiving the
          synchronized fault information, other switches notify the access
          host or storage device of the fault information.</t>
        </section>

        <section title="Network Link or Device Failure">
          <t>ECMP or redundant link protection is usually deployed to prevent
          this failure.</t>
        </section>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>In order to control the communication range of information and reduce
      the negative impact of possible information flooding, the Subscribe Msg
      and Notification Msg considered in this framework are suggested to be
      implemented through the L2 extension protocol, so that the sending and
      receiving of this information will only be controlled by the access
      network device within the domain. At the same time, the network device
      is not allowed to forward this message, only allowed to receive or send
      such message as needed.</t>

      <t>For the communication protocol between network devices, in order to
      ensure its security, it can be encrypted by commonly used encryption
      technology, including but not limited to TCP-AO, TLS and other
      technologies.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.I-D.guo-ffd-requirement"?>
    </references>
  </back>
</rfc>
