Internet Draft S. Bandyopadhyay draft-shyam-hn-ipv6-00.txt January, 2008 Intended status: Proposed Standard Expires: July, 2008 Hierarchical Networking and IPv6 Status of this memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document tries to address an approach for reorganization of entire network in a large address space. It describes how entire address space can be distributed within some regions and sub regions inside each of them by establishing mesh structured hierarchy. It addresses issues which could be relevant to this architecture in the context of IPv6. This document also tries to come out with an approach how IP switch based network can perform as good as ATM network for the processing of real time traffic. Bandyopadhyay Expires July, 2008 [Page 1] Internet Draft Hierarchical Networking and IPv6 January, 2008 1. Introduction Transition from IPv4 to IPv6 is in the process. Work has been done to upgrade individual nodes (workstations) from IPv4 to IPv6. Also, there are established documents to make router/switches to work to support IPv4 as well as IPv6 packets at the same time in order to make the transition possible [12]. There is no published document so far how hierarchy can be established within the network. There are documents/concern over BGP table entries to become too large in the existing system [11]. There were proposals to upgrade autonomous system number to 32-bit from 16-bit to support the demand at the same time. The challenge relies on how to make the transition smooth with least changes. ATM network performs faster than the network with IP switches. The difference becomes more prominent for real time applications. Whereas they have disadvantages as far as bandwidth usages compared to the IP-switch based network. This document tries to address approaches for IP-switch based network to process real- time applications as fast as ATM network also a mesh structured hierarchical network for routing convenience. 2. A Three-tier mesh structured hierarchical network Existing system is in work with autonomous system (AS) and inter-AS layer with the approach of CIDR[2]. If the same gets continued with a larger network ID, load in the switches will be too high. If hierarchy can be established within the network-ID portion, routing issues could be made simpler. If network is designed with a fixed length of prefix for the autonomous system everywhere, routing information for the rest will get confined with the other part of the network prefix. Thus entire network can be viewed as a network of inter-AS layer nodes. Each node in the inter-AS layer can act either only as a router in the inter-AS layer or a router in the inter-AS layer with an autonomous system attached to it like a tree or an autonomous system with multiple area border routers (ABR) appearing like a mesh. Thus mesh structured hierarchy gets established between AS layer and inter-AS layer with each AS having a fixed length of prefix. In the similar manner, mesh-structured hierarchy may be established within inter-AS layer. The inter-AS layer may be split into inter-AS- top and inter-AS-bottom. To maintain this hierarchy, each node of inter-AS-top needs to have multiple inter-AS-ABRs in the similar manner an autonomous system maintains ABRs. Thus, the entire network will appear as a network of nodes of inter-AS-top layer. Each node of the inter-AS-top will have a fixed length of prefix. i.e. each layer of the inter-AS top will have a fixed number of nodes of inter-AS- bottom layers. Bandyopadhyay Expires July, 2008 [Page 2] Internet Draft Hierarchical Networking and IPv6 January, 2008 With three-tier mesh structured hierarchy in the network layer, network ID can be viewed as A.B.C. If pA, pB and pC be the prefix lengths of inter-AS-top, inter-AS-bottom and AS layers respectively, there will be 2^pA nodes at the topmost layer, 2^pB at the inter-AS- bottom layer and 2^pC nodes at the AS layer. Thus the entire space gets divided into a fixed number of regions and each regions gets divided into fixed number of sub regions. This division is supposed to be made based on geography, population density and their demands and related factors. e.g. if pA=pB=16, there will be 2^16(65536) nodes at the inter-AS-top layer. If each state in the USA is assigned a top-level node, each of them will have 2^16 inter-AS-bottom nodes and each inter-AS-bottom node will have an autonomous system with prefix length pC. Introduction of mesh-structured hierarchy at the inter-AS layer will have several advantages: Load at each router will get reduced substantially. Concept of CIRD style approach and complexity related to prefix reduction will not be needed. Protection due to failure will become more stable. Full mesh hierarchy will make traffic evenly distributed. Physical cable connection can be optimized. Administrative issues will become easier. 2.1. Route propagation With hierarchy established, routing information that gets established inside a node of inter-AS-top, does not need to be propagated to another node of inter-AS-top. Entire routing information of inter-AS- top layer needs to be propagated to inter-AS-bottom layer. So, each router of inter-AS layer will have two tables of information, one for the inter-AS-top and another for the inter-AS-bottom of the inter-AS- top node it belongs to. Same BGP protocol will work very well with a little trick applied at the inter-AS-ABRs. Each inter-AS-ABR will not propagate the routing information of inter-AS-bottom layer of its domain to another inter-AS-ABR of another domain. i.e. one inter-AS- area-border router of one top level node will propagate routing information only of inter-AS-top layer to another inter-AS-ABR of another node. Inside a node of inter-AS-top, routing information of inter-AS-top and inter-AS-bottom need to be propagated from one inter-AS router to another neighboring inter-AS router. For a network with pA=pB=16, network will be viewed as a network of 16bit length of inter-AS-top nodes. As the number of nodes are not too high, each node can have an entry for any other node in the routing table. So, there will be maximum of 2^16 entries in the routing table of inter-AS-top layer. Inside each node of the inter- AS-top layer, there is a smaller network of inter-AS-bottom nodes. With 16bit prefix for the inter-AS-bottom, there will be a routing Bandyopadhyay Expires July, 2008 [Page 3] Internet Draft Hierarchical Networking and IPv6 January, 2008 table of 2^16 entries for propagation of packets from one inter-AS- bottom node to another. So, BGP tables will have a maximum of 2*65536 (i.e. 131072) entries. Similarly, each node of the AS layer will have three tables of routing entries. One for the inter-AS-top, one for the inter-AS- bottom and another for the routing information inside the autonomous system itself. With this architecture, each network acts as a leaf node, i.e. a network will not act as a transit. Realistically, edge routers are considered to be as leaf nodes with whom multiple small networks with different sizes get connected with. So, the user-ID space needs to be divided as subnet-ID and user-ID. Profoundly, a VLSM (variable length subnet mask) type of approach has to be adopted at the edge routers. So, each edge router will act as the root of a tree whose leaves are independent small networks which will act as stub. An autonomous system is a network of edge routers confined to a sub region within a region of entire global space. This definition of AS may differ from the usual definition of autonomous system. So, names of these terms can be replaced by suitable one for the sake of clarity. 2.2. Default route If default route is maintained in the AS layer, routing information of the inter-AS layer need not be propagated down to the AS layer. So, the load in the AS layer gets reduced. Autonomous systems with only one ABR is the best choice for default route. With multiple ABRs inside an AS, default route can be introduced by maintaining the nearest area border router as their default router. Default route in the AS layer introduces a delay for most of the packets while they get out of their AS if they contain multiple ABRs. The delay could be as large as log(n) hops (where n is the number of nodes inside AS). This delay will reduce the bandwidth usage inside the autonomous system to some extent as well. 2.3. Propagation of a packet To reduce complexity, packets are sent from one inter-AS-ABR of one node to its neighboring inter-AS-ABR of another node just by looking at the routing table entries of the inter-AS-top layer. i.e. no label switching activities within the inter-AS-top layer. This will reduce the maximum label stack depth by one as well (maximum label stack depth may become a criteria to process real time packets as discussed in the next section). Label switching is done at the inter-AS-bottom as well as at the AS layer. While a packet gets transported from one Bandyopadhyay Expires July, 2008 [Page 4] Internet Draft Hierarchical Networking and IPv6 January, 2008 inter-AS-bottom node to another through a node at the middle, packet needs to be label stacked at the middle node if it contains multiple ABRs for tunneling purposes. 3. Processing of real time packets (QoS issue) Here is an attempt to come out with a solution for Gigabit Ethernet switches (in full duplex mode) to operate in the most user-friendly manner to transport data traffic (IP) as well as real time (RT) traffic (as RTP[5] packet) in the existing 32bit system. In case of IP routing/switching entire packet gets collected at the intermediate router/switch and forwarded based on the forwarding table. Inside the switch/router the variable length IP packet gets fragmented into smaller size frames at the ingress side. The frames gets transported through the switching fabric with proper priority mechanism (to support QoS) and then reassembled at the egress side and passed through the media for the next hop. In case of ATM, packets get fragmented at the ingress edge devices into small size cells. Entire packet gets transported as a stream of cells and gets collected at the egress edge device. The success of ATM over IP routing as far as speed is concerned is due to the fact that the latency gets reduced as the entire packet does not get collected, fragmented and reassembled at the intermediate nodes. So, in case of Gigabit Ethernet, if RT packets can be passed without getting fragmented inside the switch, better performance can be expected. i.e. one RT packet needs to get to fit inside one internal frame of the switch fabric. Additionally, to make this approach successful, maximum size of MPLS label stack has to be defined. Inside the switch all the IP packets will be assumed to carry same number of MPLS labels whether they are having one or the maximum in real sense. In fact, to reduce overhead, this limit should be the minimum number of labels needed to satisfy all sorts of features supported by MPLS. i.e. label stacking of depth n (without limit) needs modification. If minimum frame size is selected to fit one RTP packet, overhead becomes too high due to very large (40 bytes: 20 bytes IP + 8bytes UDP + 12 bytes RTP) packet header. Again, if large frame size is used, fragmentation loss becomes too high for the small size packets (say, 40 bytes IP packets). So, a compromise is needed that will give a better result based on the IP packet size distribution. Frame size is selected based on the minimum value of the overhead due to the fragmentation loss of data packet as well as the overhead as header of the RT packets. Studies show that primarily IP data packets of three different sizes Bandyopadhyay Expires July, 2008 [Page 5] Internet Draft Hierarchical Networking and IPv6 January, 2008 are found common in nature. Almost ~50% packets of size 40 bytes (TCP ACK), ~20% packets of size 576 bytes (path MTU set by X.25) and ~30% packets of size 1500 bytes (path MTU set by ethernet) Other packets are less compared to the above three categories and almost evenly distributed. For the sake of simplicity of calculation, traffic of the first three categories are only considered. Payload of the data traffic is the actual IP packet size where as the payload of RT traffic is the payload inside RTP packet. Let totBytes are to be transported across the internet. dataPcnt is the %of data traffic; (100-dataPcnt)% is for RT traffic. i.e. totBytes*dataPcnt/100 = data traffic and (100-dataPcnt)*totBytes/100 = RT traffic; Out of data traffic 50% of 40 bytes length;20% of 576 bytes length; & 30% of 1500 bytes length. Let totDataPkt = total data packets; i.e. totDataPkt*50/100 pkt 40 bytes length = 40*50*totDataPkt/100 bytes; i.e. totDataPkt*20/100 pkt 576 bytes length = 576*20*totDataPkt/100 bytes; i.e. totDataPkt*30/100 pkt 1500 bytes length = 1500*30*totDataPkt/100 bytes; --------------------------------------------------------------------- total = 58520*totDataPkt/100 bytes; = totBytes*dataPcnt/100 bytes; i.e. totDataPkt*58520/100 = totBytes*dataPcnt/100 i.e. totDataPkt = totBytes*dataPcnt/58520; Let totBytes (for the sake of calculation) = 58520*100; i.e. totDataPkt = dataPcnt*100; 40 bytes packets = 50*totDataPkt/100 i.e. 50*dataPcnt 576 bytes packets = 20*totDataPkt/100 i.e. 20*dataPcnt 1500 bytes packets = 30*totDataPkt/100 i.e. 30*dataPcnt If n is considered to be the depth of MPLS label stack, inside the switch, actual size of 40 bytes packet = 40+4*n bytes, 576 bytes packet = 576 + 4*n bytes & 1500 bytes packet = 1500 + 4*n bytes Let frameSize be the size of a frame (in bytes) inside the switch. If an RT packet fits inside frameSize, RT packet payload = (frameSize - 40 - 4*n) bytes; Total overhead = packet header overhead + fragmentation overhead; Bandyopadhyay Expires July, 2008 [Page 6] Internet Draft Hierarchical Networking and IPv6 January, 2008 Overhead of 40 bytes packets (in bytes) = 50*dataPcnt*(4*n + (frameSize - (((40+4*n)%frameSize==0)?frameSize:(40+4*n)%frameSize))); 576 bytes packets = 20*dataPcnt*(4*n + (frameSize - (((576+4*n)%frameSize==0)?frameSize:(576+4*n)%frameSize))); and 1500 bytes packets = 30*dataPcnt*(4*n + (frameSize - (((1500+4*n)%frameSize==0)?frameSize:(1500+4*n)%frameSize))); Overhead of RT packets (in bytes) = ((100-dataPcnt)*58520)/(frameSize-40-4*n))*(40+4*n); Total overhead (in bytes) = 100*dataPcnt*4*n + 50*dataPcnt *(frameSize - (((40+4*n)%frameSize==0)?frameSize:(40+4*n)%frameSize)) + 20*dataPcnt *(frameSize - (((576+4*n)%frameSize==0)?frameSize:(576+4*n)%frameSize)) + 30*dataPcnt *(frameSize - (((1500+4*n)%frameSize==0)?fameSize:(1500+4*n)%fameSize)) + ((100-dataPcnt)*58520)/frameSize)*(40+4*n) If a plot is drawn for (frameSize = 40+4*n+1; frameSize < 1500+4*n; frameSize++) for different dataPcnt (with dataPcnt 80 to dataPcnt 100), minimum values are found for frameSize = (85, 102, 119, 127 and 152) for n==4. Actual data of the IP traffic has to be collected to get the best result. As dataPcnt increases minimum values are found at a lower frameSize and it gives better result with the higher range for lower dataPcnt. With average IP packet size 585 bytes, switches will encounter a loss of 4*(n-1) bytes for packets that will need only one label. In order to make this scheme work, a standard for maximum label stack size has to be defined. RTP packet size also has to be standardized. 3.1. Duel mode operation Ingress service cards need to act in duel mode to process RT packets and non-RT packets. i.e. the RT packets should follow a direct path that won't need fragmentation and related complexities. Whereas other packets need to follow a different path for fragmentation operations. This will prevent a RT packet to be blocked by the fragmentation procedure of not-RT packets that arrive in the service card prior to the arrival of RT packet. So, mere mapping of RT packet size with the frameSize of switch fabric will not achieve the speed of ATM switches. Bandyopadhyay Expires July, 2008 [Page 7] Internet Draft Hierarchical Networking and IPv6 January, 2008 4. Refinements over existing IPv6 specification As IPv6 was envisioned long before some of the newer technologies e.g. MPLS came into picture, some refinements can be made over the existing specification. These considerations are related to bandwidth usages and performance inside switches. Previous chapter shows that smaller packet size gives better result for processing of RT packet. So, it is desirable to have IP packet header to be as small as possible. The values of pA, pB, pC and the length of user-ID needs to be analyzed by the experts properly. pA=pB==16 may be an obvious choice. So, pC and the length of user-ID has to be investigated properly. If pC is restricted to 8bits, a 24bit user-ID along with it will give a 64bit address which will be soothing to the eyes. With pC=8, and 24bit user-ID may satisfy present days requirement for a long. IP header can be reduced substantially with proper choice of IP address. The flow label field of IPv6 packet header may not be of any use. ATM used to have 4 priority classes. The first specification of IPv6 [RFC-1883] used a 4 bit type of service field along with a 24bits flow label field. These two were modified to a 8 bit type of service field and a 20 bit flow label field in the current spec [RFC-2460]. Too many priority classes may increase complexities to process inside switches. If type of service field of IPv6 header may be reduced to be of 4 bit length as it was stated in [RFC-1883] and 'flow label' field gets removed, another three bytes may be reduced from the IPv6 header. The field 'Hop Limit' has got a 8bit value in the existing spec. The role of this field needs to be discussed properly. For a long route- ID (say 48bit) whether 8bit would be sufficient or how this field will be processed needs to be discussed. Three-tier mesh structured approach expects each node to have an IP address where as IPv6 spec expects each interface is to have an IP address. 4.1. Distributed processing and Multicasting With the inherent hierarchy involved in this architecture, distributed applications can also be structured in a suitable manner. Say, for a commonly used web based application a master level server will be there at every top level node. Any changes that might happen in the application, has to be synchronized within these master level servers first. There might be servers at the middle layer (inside each inter-AS-bottom) inside each top level node. Once the changes get reflected at the master node, all the servers at the middle layer Bandyopadhyay Expires July, 2008 [Page 8] Internet Draft Hierarchical Networking and IPv6 January, 2008 needs to update themselves with their master level node. This will reduce network traffic substantially. Multicasting can also be looked at in the similar manner. Work on these issues can be progressed only after this architecture gets approved. 5. Prospective Issues IP packets with size 576 in most of the cases come out of those TCP layers that do not process maximum path-MTU and takes the default one that was set during X.25. The 576 factor can be corrected very easily with path-MTU set to 1500. With the consideration that label switch path do not get changed very frequently in between two arbitrary network points for any particular type of packet, most of the applications are expected to become UDP based with negative ACK. TCP in turn might go through changes. Once this comes into effect, 40 bytes packets will come down drastically. Switch fabric frame size needs to be determined keeping these two factors in mind along with changes in IP packet header. With the existing 32bit system, frame size of 152 and 127 are most viable solution for n=4. 6. IANA Consideration This is a first level draft for proposed standard. Hence, IANA actions should come into play at a later stage, if needed. 7. Security Consideration This document does not include any security related issues. 8. References 8.1 Normative References [1] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [2] Fuller V., Li. T., Yu J., and K. Varadhan, "Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy", RFC 1519, September 1993. [3] Rekhter, Y., and T., Li, "A Border Gateway Protocol 4 (BGP- 4)",RFC 1771, March 1995. [4] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification, RFC 1883, December 1995. Bandyopadhyay Expires July, 2008 [Page 9] Internet Draft Hierarchical Networking and IPv6 January, 2008 [5] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson. "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, January 1996. [6] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. [7] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [8] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", March 1999. [9] Srisuresh, P. and K. Egevang, "Traditional IP Network Address Translator (Traditional NAT)", RFC 3022, January 2001. [10] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001. [11] Huston, G., "Commentary on Inter-Domain Routing in the Internet", RFC 3221, December 2001. [12] Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for IPv6 Hosts and Routers", RFC 4213, October 2005. 7. Author's Address Shyam Bandyopadhyay HL No 205/157/7, Inda Kharagpur 721305 India Phone: +91 3222 225137 e-mail: shyamb66@gmail.com Bandyopadhyay Expires July, 2008 [Page 10] Internet Draft Hierarchical Networking and IPv6 January, 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Bandyopadhyay Expires July, 2008 [Page 11]