<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std"
     docName="draft-chen-idr-ctr-availability-08"
     ipr="trust200902">
  <front>
    <title abbrev="BGP for Network HA">BGP for Network High Availability</title>
     <author initials="H" surname="Chen" fullname="Huaimo Chen">
      <organization>Futurewei</organization>
      <address>
        <postal>
          <street></street>
          <city>Boston, MA</city>
          <region></region>
          <code></code>
          <country>USA</country>
        </postal>
        <email>hchen.ietf@gmail.com</email>
      </address>
    </author>

   <author initials="Y" fullname="Yanhe Fan" 
            surname="Fan">
      <organization>Casa Systems</organization>
      <address>
        <postal>
          <street></street>
          <city></city>
          <region></region>
          <code></code>
          <country>USA</country>
        </postal>
        <email>yfan@casa-systems.com</email>
      </address>
    </author>

     <author initials="A" fullname="Aijun Wang" 
            surname="Wang">
      <organization>China Telecom</organization>
      <address>
        <postal>
          <street>Beiqijia Town, Changping District</street>
          <city>Beijing</city>
          <region> </region>
          <code>102209</code>
          <country>China</country>
        </postal>
        <email>wangaj3@chinatelecom.cn</email>
      </address>
    </author>

   <author initials="L" fullname="Lei Liu" 
            surname="Liu">
      <organization>Fujitsu</organization>
      <address>
        <postal>
          <street> </street>
          <city> </city>
          <region></region>
          <code></code>
          <country>USA</country>
        </postal>
        <email>liulei.kddi@gmail.com</email>
      </address>
    </author>

   <author initials="X" fullname="Xufeng Liu" 
            surname="Liu">
      <organization>IBM Corporation</organization>
      <address>
        <postal>
          <street> </street>
          <city> </city>
          <region> </region>
          <code></code>
          <country>USA</country>
        </postal>
        <email>xufeng.liu.ietf@gmail.com</email>
      </address>
    </author>

    <date year="2024"/>

    <abstract>
      <t>This document describes protocol extensions to BGP
      for improving the reliability or availability of a network 
      controlled by a controller cluster.</t>

      <t/>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">	
     <t>More and more networks are controlled by 
     central controllers or controller clusters.
     A controller cluster is a single controller externally.
     It normally consists of two or more controllers internally
     working together as a single controller externally to control 
     a network, i.e., every network element (NE) in the network. 
     The reliability or availability of a network is heavily 
     dependent on its controller cluster. 
     The issues or failures in the controller cluster may impact 
     the reliability or availability of the network greatly.</t>

     <t>For a controller cluster comprising two or more controllers 
     (i.e., primary controller, secondary controller, and so on), 
     the failures in the cluster may split the cluster into a few of 
     separated controller groups. These groups do not know each other 
     and may be out of synchronization. 
     Two or more groups may be elected as primary groups to control 
     the network at the same time, which may cause some issues.</t>

     <t>This document proposes some procedures and extensions to BGP
     for the separated controllers or controller groups to know each other 
     thus elect one new primary controller or controller group correctly 
     when the cluster is split because of failures in the cluster.</t>  
    </section> <!-- Introduction -->


    <section title="Terminologies">
    <t>The following terminologies are used in this document.
     <list style="hanging">
       <t hangText="BGP:">Border Gateway Protocol</t>
       <t hangText="NE:">Network Element</t>
       <t hangText="CE:">Customer Edge</t>
       <t hangText="PE:">Provider Edge</t>
      </list>
     </t>
    </section> <!-- Terminologies -->


    <section title="BGP for Controller Cluster Reliability">
    <t>This section briefs the mechanism of controller cluster 
    reliability or availability using BGP, and illustrates 
    some details through a simple example.</t>

    <section title="Overview of Mechanism">
    <t>When a cluster of controllers is split into a few of separated
    groups because of failures in the cluster, 
    the live controllers are still actually connected to the 
    network (i.e., network elements). 
    Through some of these connections, each group can get 
    the information about the other groups. 
    A new primary controller or controller group is correctly elected
    to control the network based on the information.</t>

    <t>Each controller has a BGP session with each of a give number of 
    the same NEs in the network and the session is established and 
    maintained over an IP path between the controller and the NE. 
    The session is a session of BGP with extensions.</t>

    <t>In one example or configuration, 
    the given number of NEs is one NE with the highest BGP ID.
    Suppose that node PE2 as NE has the highest BGP ID. 
    The session between the primary controller (e.g., A) and 
    the NE (e.g., PE2) is the session of BGP with extensions.
    Each of the non-primary controllers (e.g., B, C, ...) creates 
    and maintains a BGP session with this NE (e.g., PE2).</t>

    <t>In normal operations, the cluster has all its controllers connected.
    They are the primary controller controlling the network, the secondary 
    controller, and so on. 
    They have current position 1, 2, and so on respectively.
    The primary controller advertises the information about the controllers 
    via its BGP sessions to the given number of the same NEs.</t>

    <t>For example, it sends the information in a BGP message to
    the NE (e.g., PE2), which transfers the information to each of 
    the other controllers via the BGP sessions to the other controllers.</t>

    <t>When the cluster is split into a few separated groups of controllers, 
    each group elects an intent primary controller, 
    secondary controller and so on from the group, 
    which have intent position 1, 2, and so on respectively.
    The intent primary controller in each group advertises the information 
    about the controllers in its group.</t>

    <t>The information advertised by the (intent) primary controller 
    includes its current (intent) position, its old position, 
    its priority to become a primary controller, 
    number of controllers in its group or cluster,
    and the IDs of the controllers which are ordered in their
    (intent) positions. In addition, a flag C indicating that 
    whether it is Controlling the network (i.e., it is the primary 
    controller or intent primary controller) is included.</t>
    </section> <!-- Overview -->

    <section title="Example">
    <t><xref target = "cluster-2-controllers"/> 
    shows a controller cluster comprising two controllers: 
    the primary controller and the secondary controller. 
    Each controller has a BGP session with the same NE, 
    which is NE4.  

<figure anchor="cluster-2-controllers" 
 title="Controller Cluster of 2 Controllers">
  <artwork> <![CDATA[
   +---------------------------------------------------+
   | Controller Cluster                                |
   |                                                   |
   |    +------------+               +------------+    |
   |    |Controller A|  Synchronize  |Controller B|    |
   |    |(Primary)   +---------------+(Secondary) |    |
   |    +------------+               +-----------++    |
   |           ^                                 |     |
   |           |_______________                  |     |
   |                          |                  |     |
   |                          v                  |     |
   +-----------------Channels to Network---------|-----+
                         /       \               |
      Session   ---->   /         \____          |
      between          /           \   \____     | <--Session
      A and NEi       /\  .---. .---+       \    |    between
      (i=1,2,..)     |  \(     '    |'.---. |    |    B and NE4
                     |---\  Network |      '+.   |
                    (o NE1\         |       | ) /
                     (     |        |       o) /
                      (    |        |       ) NE4
                       (   o NE2    o NE3.-'
                        '               )
                         '---._.-.     )
                                  '---']]></artwork>
</figure>
 
    The primary BGP controller (i.e., A) has a BGP session with each NE 
    in the network, including NE4.
    The secondary controller (i.e., B) has a BGP session with 
    the same NE4 in the network and the session is established and 
    maintained over an IP path between B and NE4.</t> 

    <t>In normal operations, controller A (Primary)
    sends NE4 a BGP message containing 
    the information about the controllers connected to it.
    NE4 transfers the information to controller B (Secondary).  
    The information includes:</t>

    <t>C = 1, A's current Position = 1, A's OldPosition = 1, 
    A's Priority, NoControllers = 2, A's ID, B's ID</t>

    <t>When failures happen in the cluster, the live controllers act as follows:</t>

    <t>For the secondary controller (e.g., B) alive, 
    if the primary controller is dead,
    it promotes itself as the new primary controller; 
    if the primary controller is alive but separated from the secondary controller, 
    the secondary controller will not promote itself to be a new primary controller.</t>

    <t>For the primary controller (e.g., A), 
    if it is alive, it continues to be the primary controller.</t>

    <t>With the extensions to BGP, the secondary controller can determine 
    the status of the primary controller based on  
    the information about the primary controller received. 
    The conditions that the primary controller is alive but separated from 
    the secondary controller (i.e., condition a: the connection between the primary 
    controller and the secondary controller in the cluster failed, 
    but condition b: the two controllers are alive) can be determined 
    by the secondary controller as follows:</t>

    <t>For condition a, when the heartbeat from the primary stops, 
    the secondary knows that the connection between the primary and 
    secondary controller failed.</t>

    <t>For condition b, it checks whether the information about 
    the primary controller is updated within a given time. 
    If so, the primary controller is alive; otherwise, it is dead.</t>
	    
    </section> <!-- Example -->  
    </section> <!-- BGP for Controller Cluster Reliability --> 


    <section title="Extensions to BGP">
      <t>This section describes extensions to BGP.</t>

    <section title="Capability">
      <t>During a BGP session establishment, BGP Speakers advertise 
      their support for BGP extensions for network reliability, 
      especially the High Availability of Controller cluster (HAC). 

      A new Controller HA Support Capability Triple is defined for HAC below. 
      A BGP speaker indicates its support for HAC by including 
      the triple in the Capabilities Optional Parameter in its OPEN 
      message if it supports for HAC.

<figure anchor="controller-ha-cap-triple" 
        title="Controller HA Support Capability Triple">
<artwork> <![CDATA[  
  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |Cap Code (TBD1)|     Length    |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                Flags                                        |C|
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+]]></artwork>
</figure>
</t>
<t>
     <list style="hanging">
       <t hangText="Cap Code (8 bits):">TBD1 is to be assigned by IANA.</t>
       <t hangText="Length:">It indicates the length of the Capability value 
          portion in octets, which is 4.</t>
       <t hangText="Flag (32 bits):">One flag bit, C-bit, is defined. 
          When it is set to one, it indicates that the BGP speaker supports 
          the high availability of controller cluster as a Controller. 
          When it is set to zero, it indicates that the BGP speaker supports 
          the high availability of controller cluster as a network element (NE).</t>
      </list>
</t>

    <t>When two BGP speakers establish a BGP session between them, 
    each of the speakers indicates its support for HAC by including a 
    Controller HA Support Capability Triple in the Capabilities Optional Parameter 
    in the OPEN message if it supports for HAC.</t> 

    <t>For a BGP speaker supporting for HAC, if it receives the 
    Controller HA Support Capability Triple in the OPEN message from 
    the other BGP speaker over the BGP session, it records that the other 
    BGP speaker (i.e., the other/remote end of the session) supports for HAC; 
    otherwise, it records that the other speaker does not. 
    Thus for all its BGP sessions, it knows whether each session's remote 
    end BGP speaker supports for HAC. If the C-bit in the Triple is set to one, 
    the BGP speaker is a controller; otherwise, it is a NE.</t>

    <t>A BGP as a controller supporting for HAC acts on the information about 
    the controllers in its cluster or group as follows:</t>

    <t>It sends the information in a BGP UPDATE message to each of 
    a given set of NEs that runs BGP with HAC support 
    whenever the information changes.
    The given set of NEs may be the one NE with the highest BGP ID.</t>

    <t>It adjusts the positions of the controllers accordingly 
    whenever there is a change in the information about the controllers 
    received from the NE supporting for HAC.</t>

    <t>An NE running BGP with HAC support receives the information about
    the controllers from the BGP as a controller supporting for HAC,
    and sends the information to every BGP as a controller supporting 
    for HAC and having a BGP session with the NE except for the one 
    from which the information is received.</t>

    </section> <!-- Capability -->


    <section title="Controller NLRI">
      <t>A new Address Family Identifier (AFI) and Sub-address Family 
      Identifier (SAFI), called Controllers AFI and SAFI, are defined 
      to carry the information about controllers with Network Layer 
      Reachability Information (NLRI). Under the AFI and SAFI, a new NLRI, 
      called Controllers NLRI, is defined to contain the information. 
      A controller in a cluster may advertise the information
      in a BGP UPDATE message containing a Controllers NLRI of the 
      following format.

<figure anchor="controllers-NLRI" 
        title="Controllers NLRI">
<artwork> <![CDATA[  
  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |              Type             |             Length            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |   Flags     |C|    Position   |  OldPosition  |   Priority    |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                 Reserved                      | NoControllers |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                  Connected Controller 1 ID                    |
 :                              :                                |
 |                  Connected Controller n ID                    |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+]]></artwork>
</figure>
</t>
<t>
     <list style="hanging">
       <t hangText="Type (16 bits):">TBD2 is to be assigned by IANA.</t>
       <t hangText="Length (16 bits):">It indicates the length of the value 
          portion in octets.</t>
       <t hangText="Flag (8 bits):">One flag bit, C-bit, is defined. When set, 
          it indicates that the position is the position of the current active primary 
          controller. In this case, C = 1 and Position = 1, which indicate 
          that the controller is the current active primary controller controlling 
          the network.</t>
       <t hangText="Position (8 bits):">It indicates the current/intent position 
          of the controller in the controller cluster or group. 
          1: primary (first) controller, 2: secondary controller, 3: third controller,
          and so on (i.e., Controller Position of value n: n-th controller 
          in the cluster or group).</t>
       <t hangText="OldPosition (8 bits):">): It indicates the old position of 
          the controller in the controller cluster before it is split.</t>
       <t hangText="Priority (8 bits):">It indicates the priority of the 
          controller to be elected as a primary controller.</t>
       <t hangText="Reserved (24 bits):">Reserved field, must set to zero for 
          transmission and ignored for reception.</t>
       <t hangText="NoControllers (8 bits):">It indicates the number of controllers 
          connected to the controller advertising the TLV.</t>
       <t hangText="Controller i ID (32 bits):">It represents the identifier (ID)
          of controller i at position i (i = 1, ..., n) in the cluster or group.</t>
      </list>
</t>
      <t></t>
    </section> <!-- EController NLRI -->
    </section>  <!-- Extensions to BGP -->


    <section title="Recovery Procedure">
    <t>This section describes the recovery procedure for 
    a controller cluster of n (n > 2) controllers, which are 
    the primary controller A, the secondary controller B, ..., 
    the n-th controller N.</t> 

    <t>When failures happen in the cluster, it may be split 
    into a few separated groups of controllers. 
    In one policy, the group with the maximum number of controllers 
    is responsible for controlling the network as the primary group of 
    the cluster, in which the new primary controller, secondary controller, 
    and so on are elected.</t>

    <t>For each separated group of controllers,
    the intent primary controller, secondary controller, and so on are elected.
    The intent primary controller of the group advertises the information 
    about its group. 
    The information includes its intent position, 
    its old position,
    its priority to become a primary controller, 
    the number of controllers in the group, and 
    identifiers of the controllers in the group. 
    The identifiers of the controllers are ordered according to their positions. 
    The identifier of the intent primary controller, which has position 1, 
    is the first one; 
    The identifier of the intent secondary controller, which has position 2, 
    is the second one; and so on. 
    Thus every separated group has the information about the other groups and 
    can determine which group has the maximum number of controllers. </t>

    <t>In the case of tie (i.e., two or more groups have the same maximum number 
    of controllers), 
    the group with the highest old position controller 
    (e.g., the old primary controller) wins in one policy. 
    In another policy, the group with the highest priority controller wins.</t>

    <t>Some details of the recovery procedures 
    in the current and intent primary controller  
    in a controller cluster or group are as follows.</t>

    <t>In normal operations, it advertises the information about controllers 
    containing:</t>
    <t>C = 1, Position = 1, Old Position = 1, 
    Primary Controller's priority, NoControllers = n, Primary Controller's ID, 
    secondary controller's ID, ..., and n-th Controller's ID.</t>

    <t>When failures cause the cluster split, it advertises 
    the information about controllers containing:</t>
    <t>C = 0, Position = 1, Old Position = 1, Intent Primary Controller's priority, 
    NoControllers = m (m is the number of controllers in the group 
    that the primary controller is connected after the failures), 
    Intent Primary Controller's ID, IDs of the other controllers connected.</t>
 
    <t>Then after a given time, it checks if the group is elected as the primary 
    group. If so, it advertises the information about controllers containing:</t>
 
    <t>C = 1, Position = 1, Old Position = 1, its Priority, NoControllers = m, 
    the IDs of the controllers in the group.</t>

    <t>One example is that failures split the cluster into two separated groups:
    group 1 comprising A and C, group 2 consisting of B and N. 
    Each group elects its intent primary controller, secondary controller, 
    and so on. 
    Suppose that controller A and C are elected as the intent primary and 
    secondary controller respectively in group 1; 
    controller B and N are elected as the intent primary and secondary 
    controller respectively in group 2.</t>

    <t>Each of the intent primary controllers A and B advertises 
    the information about the controllers in its group. 
    The information advertised by A includes: </t>
    <t>C = 0, Position = 1, OldPosition = 1, 
    A's Priority, NoControllers = 2, A's ID, C's ID.</t>

    <t>The information advertised by B includes:</t> 
    <t>C = 0, Position = 1, OldPosition = 2, 
    B's Priority, NoControllers = 2, B's ID, N's ID.</t>

    <t>Group 1 and 2 have the same number of controllers, which is 2. 
    But OldPosition in group 1 is higher than that in group 2. 
    Group 1 is elected as the primary group, and 
    the intent primary controller A in the primary group is determined 
    as the current primary controller. 
    After the determination, the information about the controllers 
    in group 1 (i.e., the primary group) is changed. 
    The updated information advertised by A includes:</t>

    <t>C = 1, Position = 1, OldPosition = 1, 
    A's Priority, NoControllers = 2, A's ID, C's ID.</t>

    </section> <!-- Recovery Procedures -->

 
    <section anchor="IANA" title="IANA Considerations">
      <t>TBD</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
      <?rfc include="reference.RFC.4271"?>
      <?rfc include="reference.RFC.4760"?>
      <?rfc include="reference.RFC.5492"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.RFC.8283"?>
    </references>

  </back>

</rfc>
