Internet Draft                                          S. Bandyopadhyay
draft-shyam-hn-ipv6-00.txt                                 January, 2008
Intended status: Proposed Standard
Expires: July, 2008


                    Hierarchical Networking and IPv6

Status of this memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet- Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


Abstract

   This document tries to address an approach for reorganization of
   entire network in a large address space. It describes how entire
   address space can be distributed within some regions and sub regions
   inside each of them by establishing mesh structured hierarchy. It
   addresses issues which could be relevant to this architecture in the
   context of IPv6. This document also tries to come out with an
   approach how IP switch based network can perform as good as ATM
   network for the processing of real time traffic.


Bandyopadhyay              Expires July, 2008                   [Page 1]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


1. Introduction

   Transition from IPv4 to IPv6 is in the process. Work has been done to
   upgrade individual nodes (workstations) from IPv4 to IPv6. Also,
   there are established documents to make router/switches to work to
   support IPv4 as well as IPv6 packets at the same time in order to
   make the transition possible [12]. There is no published document so
   far how hierarchy can be established within the network. There are
   documents/concern over BGP table entries to become too large in the
   existing system [11]. There were proposals to upgrade autonomous
   system number to 32-bit from 16-bit to support the demand at the same
   time. The challenge relies on how to make the transition smooth with
   least changes. ATM network performs faster than the network with IP
   switches. The difference becomes more prominent for real time
   applications. Whereas they have disadvantages as far as bandwidth
   usages compared to the IP-switch based network. This document tries
   to address approaches for IP-switch based network to process real-
   time applications as fast as ATM network also a mesh structured
   hierarchical network for routing convenience.

2. A Three-tier mesh structured hierarchical network

   Existing system is in work with autonomous system (AS) and inter-AS
   layer with the approach of CIDR[2]. If the same gets continued with a
   larger network ID, load in the switches will be too high. If
   hierarchy can be established within the network-ID portion, routing
   issues could be made simpler. If network is designed with a fixed
   length of prefix for the autonomous system everywhere, routing
   information for the rest will get confined with the other part of the
   network prefix. Thus entire network can be viewed as a network of
   inter-AS layer nodes. Each node in the inter-AS layer can act either
   only as a router in the inter-AS layer or a router in the inter-AS
   layer with an autonomous system attached to it like a tree or an
   autonomous system with multiple area border routers (ABR) appearing
   like a mesh. Thus mesh structured hierarchy gets established between
   AS layer and inter-AS layer with each AS having a fixed length of
   prefix.

   In the similar manner, mesh-structured hierarchy may be established
   within inter-AS layer. The inter-AS layer may be split into inter-AS-
   top and inter-AS-bottom. To maintain this hierarchy, each node of
   inter-AS-top needs to have multiple inter-AS-ABRs in the similar
   manner an autonomous system maintains ABRs. Thus, the entire network
   will appear as a network of nodes of inter-AS-top layer. Each node of
   the inter-AS-top will have a fixed length of prefix. i.e. each layer
   of the inter-AS top will have a fixed number of nodes of inter-AS-
   bottom layers.


Bandyopadhyay              Expires July, 2008                   [Page 2]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   With three-tier mesh structured hierarchy in the network layer,
   network ID can be viewed as A.B.C. If pA, pB and pC be the prefix
   lengths of inter-AS-top, inter-AS-bottom and AS layers respectively,
   there will be 2^pA nodes at the topmost layer, 2^pB at the inter-AS-
   bottom layer and 2^pC nodes at the AS layer. Thus the entire space
   gets divided into a fixed number of regions and each regions gets
   divided into fixed number of sub regions. This division is supposed
   to be made based on geography, population density and their demands
   and related factors. e.g. if pA=pB=16, there will be 2^16(65536)
   nodes at the inter-AS-top layer. If each state in the USA is assigned
   a top-level node, each of them will have 2^16 inter-AS-bottom nodes
   and each inter-AS-bottom node will have an autonomous system with
   prefix length pC. Introduction of mesh-structured hierarchy at the
   inter-AS layer will have several advantages:

        Load at each router will get reduced substantially.
        Concept of CIRD style approach and complexity related to
           prefix reduction will not be needed.
        Protection due to failure will become more stable.
        Full mesh hierarchy will make traffic evenly distributed.
        Physical cable connection can be optimized.
        Administrative issues will become easier.

2.1. Route propagation

   With hierarchy established, routing information that gets established
   inside a node of inter-AS-top, does not need to be propagated to
   another node of inter-AS-top. Entire routing information of inter-AS-
   top layer needs to be propagated to inter-AS-bottom layer. So, each
   router of inter-AS layer will have two tables of information, one for
   the inter-AS-top and another for the inter-AS-bottom of the inter-AS-
   top node it belongs to. Same BGP protocol will work very well with a
   little trick applied at the inter-AS-ABRs. Each inter-AS-ABR will not
   propagate the routing information of inter-AS-bottom layer of its
   domain to another inter-AS-ABR of another domain. i.e. one inter-AS-
   area-border router of one top level node will propagate routing
   information only of inter-AS-top layer to another inter-AS-ABR of
   another node. Inside a node of inter-AS-top, routing information of
   inter-AS-top and inter-AS-bottom need to be propagated from one
   inter-AS router to another neighboring inter-AS router.

   For a network with pA=pB=16, network will be viewed as a network of
   16bit length of inter-AS-top nodes. As the number of nodes are not
   too high, each node can have an entry for any other node in the
   routing table. So, there will be maximum of 2^16 entries in the
   routing table of inter-AS-top layer. Inside each node of the inter-
   AS-top layer, there is a smaller network of inter-AS-bottom nodes.
   With 16bit prefix for the inter-AS-bottom, there will be a routing


Bandyopadhyay              Expires July, 2008                   [Page 3]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   table of 2^16 entries for propagation of packets from one inter-AS-
   bottom node to another. So, BGP tables will have a maximum of 2*65536
   (i.e. 131072) entries.

   Similarly, each node of the AS layer will have three tables of
   routing entries. One for the inter-AS-top, one for the inter-AS-
   bottom and another for the routing information inside the autonomous
   system itself.

   With this architecture, each network acts as a leaf node, i.e. a
   network will not act as a transit. Realistically, edge routers are
   considered to be as leaf nodes with whom multiple small networks with
   different sizes get connected with. So, the user-ID space needs to be
   divided as subnet-ID and user-ID. Profoundly, a VLSM (variable length
   subnet mask) type of approach has to be adopted at the edge routers.
   So, each edge router will act as the root of a tree whose leaves are
   independent small networks which will act as stub.

   An autonomous system is a network of edge routers confined to a sub
   region within a region of entire global space. This definition of AS
   may differ from the usual definition of autonomous system. So, names
   of these terms can be replaced by suitable one for the sake of
   clarity.

2.2. Default route

   If default route is maintained in the AS layer, routing information
   of the inter-AS layer need not be propagated down to the AS layer.
   So, the load in the AS layer gets reduced. Autonomous systems with
   only one ABR is the best choice for default route. With multiple ABRs
   inside an AS, default route can be introduced by maintaining the
   nearest area border router as their default router. Default route in
   the AS layer introduces a delay for most of the packets while they
   get out of their AS if they contain multiple ABRs. The delay could be
   as large as log(n) hops (where n is the number of nodes inside AS).
   This delay will reduce the bandwidth usage inside the autonomous
   system to some extent as well.

2.3. Propagation of a packet

   To reduce complexity, packets are sent from one inter-AS-ABR of one
   node to its neighboring inter-AS-ABR of another node just by looking
   at the routing table entries of the inter-AS-top layer. i.e. no label
   switching activities within the inter-AS-top layer. This will reduce
   the maximum label stack depth by one as well (maximum label stack
   depth may become a criteria to process real time packets as discussed
   in the next section). Label switching is done at the inter-AS-bottom
   as well as at the AS layer. While a packet gets transported from one


Bandyopadhyay              Expires July, 2008                   [Page 4]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   inter-AS-bottom node to another through a node at the middle, packet
   needs to be label stacked at the middle node if it contains multiple
   ABRs for tunneling purposes.

3. Processing of real time packets (QoS issue)

   Here is an attempt to come out with a solution for Gigabit Ethernet
   switches (in full duplex mode) to operate in the most user-friendly
   manner to transport data traffic (IP) as well as real time (RT)
   traffic (as RTP[5] packet) in the existing 32bit system.

   In case of IP routing/switching entire packet gets collected at the
   intermediate router/switch and forwarded based on the forwarding
   table. Inside the switch/router the variable length IP packet gets
   fragmented into smaller size frames at the ingress side. The frames
   gets transported through the switching fabric with proper priority
   mechanism (to support QoS) and then reassembled at the egress side
   and passed through the media for the next hop.

   In case of ATM, packets get fragmented at the ingress edge devices
   into small size cells. Entire packet gets transported as a stream of
   cells and gets collected at the egress edge device. The success of
   ATM over IP routing as far as speed is concerned is due to the fact
   that the latency gets reduced as the entire packet does not get
   collected, fragmented and reassembled at the intermediate nodes. So,
   in case of Gigabit Ethernet, if RT packets can be passed without
   getting fragmented inside the switch, better performance can be
   expected. i.e. one RT packet needs to get to fit inside one internal
   frame of the switch fabric. Additionally, to make this approach
   successful, maximum size of MPLS label stack has to be defined.
   Inside the switch all the IP packets will be assumed to carry same
   number of MPLS labels whether they are having one or the maximum in
   real sense. In fact, to reduce overhead, this limit should be the
   minimum number of labels needed to satisfy all sorts of features
   supported by MPLS. i.e. label stacking of depth n (without limit)
   needs modification.

   If minimum frame size is selected to fit one RTP packet, overhead
   becomes too high due to very large (40 bytes: 20 bytes IP + 8bytes
   UDP + 12 bytes RTP) packet header. Again, if large frame size is
   used, fragmentation loss becomes too high for the small size packets
   (say, 40 bytes IP packets). So, a compromise is needed that will give
   a better result based on the IP packet size distribution. Frame size
   is selected based on the minimum value of the overhead due to the
   fragmentation loss of data packet as well as the overhead as header
   of the RT packets.

   Studies show that primarily IP data packets of three different sizes


Bandyopadhyay              Expires July, 2008                   [Page 5]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   are found common in nature. Almost
          ~50% packets of size 40 bytes (TCP ACK),
          ~20% packets of size 576 bytes (path MTU set by X.25) and
          ~30% packets of size 1500 bytes (path MTU set by ethernet)
   Other packets are less compared to the above three categories and
   almost evenly distributed. For the sake of simplicity of calculation,
   traffic of the first three categories are only considered. Payload of
   the data traffic is the actual IP packet size where as the payload of
   RT traffic is the payload inside RTP packet.

   Let totBytes are to be transported across the internet.
   dataPcnt is the %of data traffic; (100-dataPcnt)% is for RT traffic.
   i.e. totBytes*dataPcnt/100 = data traffic and
         (100-dataPcnt)*totBytes/100 = RT traffic;

   Out of data traffic 50% of 40 bytes length;20% of 576 bytes length; &
                       30% of 1500 bytes length.
   Let totDataPkt = total data packets;
   i.e. totDataPkt*50/100 pkt   40 bytes length
                                        = 40*50*totDataPkt/100 bytes;
   i.e. totDataPkt*20/100 pkt  576 bytes length
                                        = 576*20*totDataPkt/100 bytes;
   i.e. totDataPkt*30/100 pkt 1500 bytes length
                                        = 1500*30*totDataPkt/100 bytes;
   ---------------------------------------------------------------------
                                  total = 58520*totDataPkt/100 bytes;
                                        = totBytes*dataPcnt/100 bytes;
   i.e. totDataPkt*58520/100 = totBytes*dataPcnt/100
   i.e. totDataPkt = totBytes*dataPcnt/58520;

   Let totBytes (for the sake of calculation) = 58520*100;
   i.e.  totDataPkt = dataPcnt*100;
      40 bytes packets = 50*totDataPkt/100 i.e. 50*dataPcnt
      576 bytes packets = 20*totDataPkt/100 i.e. 20*dataPcnt
      1500 bytes packets = 30*totDataPkt/100 i.e. 30*dataPcnt

   If n is considered to be the depth of MPLS label stack,
   inside the switch, actual size of
           40 bytes packet = 40+4*n bytes,
           576 bytes packet = 576 + 4*n bytes &
           1500 bytes packet = 1500 + 4*n bytes

   Let frameSize be the size of a frame (in bytes) inside the switch.
   If an RT packet fits inside frameSize,
           RT packet payload = (frameSize - 40 - 4*n) bytes;

   Total overhead = packet header overhead + fragmentation overhead;


Bandyopadhyay              Expires July, 2008                   [Page 6]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   Overhead of

      40 bytes packets (in bytes) = 50*dataPcnt*(4*n + (frameSize -
         (((40+4*n)%frameSize==0)?frameSize:(40+4*n)%frameSize)));
      576 bytes packets = 20*dataPcnt*(4*n + (frameSize -
         (((576+4*n)%frameSize==0)?frameSize:(576+4*n)%frameSize)));
      and 1500 bytes packets = 30*dataPcnt*(4*n + (frameSize -
         (((1500+4*n)%frameSize==0)?frameSize:(1500+4*n)%frameSize)));

   Overhead of RT packets (in bytes) =
      ((100-dataPcnt)*58520)/(frameSize-40-4*n))*(40+4*n);

   Total overhead (in bytes) = 100*dataPcnt*4*n +
       50*dataPcnt *(frameSize -
          (((40+4*n)%frameSize==0)?frameSize:(40+4*n)%frameSize)) +
       20*dataPcnt *(frameSize -
          (((576+4*n)%frameSize==0)?frameSize:(576+4*n)%frameSize)) +
       30*dataPcnt *(frameSize -
          (((1500+4*n)%frameSize==0)?fameSize:(1500+4*n)%fameSize)) +
       ((100-dataPcnt)*58520)/frameSize)*(40+4*n)

   If a plot is drawn for (frameSize = 40+4*n+1; frameSize < 1500+4*n;
   frameSize++) for different dataPcnt (with dataPcnt 80 to dataPcnt
   100), minimum values are found for frameSize = (85, 102, 119, 127 and
   152) for n==4.

   Actual data of the IP traffic has to be collected to get the best
   result. As dataPcnt increases minimum values are found at a lower
   frameSize and it gives better result with the higher range for lower
   dataPcnt. With average IP packet size 585 bytes, switches will
   encounter a loss of 4*(n-1) bytes for packets that will need only one
   label.

   In order to make this scheme work, a standard for maximum label stack
   size has to be defined. RTP packet size also has to be standardized.

3.1. Duel mode operation

   Ingress service cards need to act in duel mode to process RT packets
   and non-RT packets. i.e. the RT packets should follow a direct path
   that won't need fragmentation and related complexities. Whereas other
   packets need to follow a different path for fragmentation operations.
   This will prevent a RT packet to be blocked by the fragmentation
   procedure of not-RT packets that arrive in the service card prior to
   the arrival of RT packet. So, mere mapping of RT packet size with the
   frameSize of switch fabric will not achieve the speed of ATM
   switches.


Bandyopadhyay              Expires July, 2008                   [Page 7]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


4. Refinements over existing IPv6 specification

   As IPv6 was envisioned long before some of the newer technologies
   e.g. MPLS came into picture, some refinements can be made over the
   existing specification. These considerations are related to bandwidth
   usages and performance inside switches. Previous chapter shows that
   smaller packet size gives better result for processing of RT packet.
   So, it is desirable to have IP packet header to be as small as
   possible.

   The values of pA, pB, pC and the length of user-ID needs to be
   analyzed by the experts properly. pA=pB==16 may be an obvious choice.
   So, pC and the length of user-ID has to be investigated properly. If
   pC is restricted to 8bits, a 24bit user-ID along with it will give a
   64bit address which will be soothing to the eyes. With pC=8, and
   24bit user-ID may satisfy present days requirement for a long. IP
   header can be reduced substantially with proper choice of IP address.

   The flow label field of IPv6 packet header may not be of any use. ATM
   used to have 4 priority classes. The first specification of IPv6
   [RFC-1883] used a 4 bit type of service field along with a 24bits
   flow label field. These two were modified to a 8 bit type of service
   field and a 20 bit flow label field in the current spec [RFC-2460].
   Too many priority classes may increase complexities to process inside
   switches. If type of service field of IPv6 header may be reduced to
   be of 4 bit length as it was stated in [RFC-1883] and 'flow label'
   field gets removed, another three bytes may be reduced from the IPv6
   header.

   The field 'Hop Limit' has got a 8bit value in the existing spec. The
   role of this field needs to be discussed properly. For a long route-
   ID (say 48bit) whether 8bit would be sufficient or how this field
   will be processed needs to be discussed.

   Three-tier mesh structured approach expects each node to have an IP
   address where as IPv6 spec expects each interface is to have an IP
   address.

4.1. Distributed processing and Multicasting

   With the inherent hierarchy involved in this architecture,
   distributed applications can also be structured in a suitable manner.
   Say, for a commonly used web based application a master level server
   will be there at every top level node. Any changes that might happen
   in the application, has to be synchronized within these master level
   servers first. There might be servers at the middle layer (inside
   each inter-AS-bottom) inside each top level node. Once the changes
   get reflected at the master node, all the servers at the middle layer


Bandyopadhyay              Expires July, 2008                   [Page 8]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   needs to update themselves with their master level node. This will
   reduce network traffic substantially. Multicasting can also be
   looked at in the similar manner. Work on these issues can be
   progressed only after this architecture gets approved.

5. Prospective Issues

   IP packets with size 576 in most of the cases come out of those TCP
   layers that do not process maximum path-MTU and takes the default one
   that was set during X.25. The 576 factor can be corrected very easily
   with path-MTU set to 1500. With the consideration that label switch
   path do not get changed very frequently in between two arbitrary
   network points for any particular type of packet, most of the
   applications are expected to become UDP based with negative ACK. TCP
   in turn might go through changes. Once this comes into effect, 40
   bytes packets will come down drastically. Switch fabric frame size
   needs to be determined keeping these two factors in mind along with
   changes in IP packet header. With the existing 32bit system, frame
   size of 152 and 127 are most viable solution for n=4.

6. IANA Consideration

   This is a first level draft for proposed standard. Hence, IANA
   actions should come into play at a later stage, if needed.

7. Security Consideration

   This document does not include any security related issues.


8. References

8.1 Normative References

   [1]  Postel, J., "Internet Protocol", STD 5, RFC 791,
        September 1981.

   [2]  Fuller V., Li. T., Yu J., and K. Varadhan, "Classless
        Inter-Domain Routing (CIDR): an Address Assignment and
        Aggregation Strategy", RFC 1519, September 1993.

   [3]  Rekhter, Y., and T., Li, "A Border Gateway Protocol 4 (BGP-
        4)",RFC 1771, March 1995.

   [4]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
        Specification, RFC 1883, December 1995.


Bandyopadhyay              Expires July, 2008                   [Page 9]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


   [5]  Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson.
        "RTP: A Transport Protocol for Real-Time Applications", RFC
        1889, January 1996.

   [6]  Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998.

   [7]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
        Specification", RFC 2460, December 1998.

   [8]  Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", March 1999.

   [9]  Srisuresh, P. and K. Egevang, "Traditional IP Network Address
        Translator (Traditional NAT)", RFC 3022, January 2001.

   [10] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol
        Label Switching Architecture", RFC 3031, January 2001.

   [11] Huston, G., "Commentary on Inter-Domain Routing in the
        Internet", RFC 3221, December 2001.

   [12] Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for
        IPv6 Hosts and Routers", RFC 4213, October 2005.


7. Author's Address

Shyam Bandyopadhyay
HL No 205/157/7, Inda
Kharagpur 721305
India

Phone: +91 3222 225137
e-mail: shyamb66@gmail.com


Bandyopadhyay              Expires July, 2008                  [Page 10]

Internet Draft      Hierarchical Networking and IPv6       January, 2008


Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard. Please address the information to the IETF at
   ietf-ipr@ietf.org.


Bandyopadhyay              Expires July, 2008                  [Page 11]