Network Working Group Internet Draft M. Napierala Document: draft-mnapierala-mvpn-rev-03.txt AT&T Expires: May 2008 November 2007 Segmented Multicast MPLS/BGP VPNs Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes inter-site signaling procedures in MPLS/BGP IP VPNs that allow the same multicast stream to flow simultaneously on multiple inter-PE paths without duplicates being sent to receivers. Those procedures are independent of multicast tunnel technology used in service provider network as well as of the protocol used to exchange multicast signaling among PE's. The document specifies necessary information elements and their exchange process for the desired MVPN operation. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [i]. Napierala Expires - May 2008 [Page 1] Segmented Multicast MPLS/BGP VPNs Table of Contents 1. Introduction................................................2 2. Terminology.................................................3 3. Overview of the Solution....................................4 3.1 Overview of Supporting PIM-SM............................4 4. Preserving PIM-SM Traffic Patterns in MVPN..................6 5. PE-to-PE Signaling Information Elements....................10 6. Inter-Site Signaling Procedures for PIM-SM.................10 6.1 Path Between C-S and C-RP Across Provider's Network.....11 6.2 Path Between C-S and C-RP Outside of Provider's Network.16 6.3 Group-Only S-PMSI Auto-discovery Route..................19 6.4 Handling Initial Packets Sent on C-Shared Tree..........19 6.5 Using P2MP LSP's as P-Tunnels for C-Shared Trees........20 7. Supporting C-Shared Trees..................................21 8. Support of Anycast C-RP....................................21 9. Inter-Site Signaling Procedures for PIM-SSM................22 9.1 C-Receiver Pruning......................................23 9.2 P-tunnel Withdrawal for C-S.............................23 9.3 Using P2MP LSP's as P-Tunnels for PIM-SSM C-Trees.......23 10. Inter-Site Signaling Procedures for PIM-Bidir..............23 10.1 Preventing C-Bidir Packet Loops in MVPN...............24 10.2 Active Group P-tunnel Announcement in C-Bidir.........25 10.3 Bidir C-Group Becomes Inactive........................27 10.4 P-tunnels for C-Bidir Traffic.........................27 10.5 DF-PE Redundancy with Fast Convergence................27 10.6 Using MP2MP LSP's as P-Tunnels for C-Bidir............28 11. IANA Considerations........................................29 12. Security Considerations....................................29 13. References.................................................29 14. Acknowledgments............................................30 15. Author's Addresses.........................................30 16. Intellectual Property Statement............................30 17. Copyright Notice...........................................31 1. Introduction Multicast VPN (cf.[ii]) extends MPLS/BGP VPN services (cf.[iii]) by enabling customers to run native IP multicast within their IP VPN's. From VPN customer perspective there is no change in the multicast operational model. Multicast distribution trees are built in service provider network to carry VPN multicast traffic. Those trees are essentially point-to-multipoint (P2MP) or multipoint-to-multipoint (MP2MP) tunnels that encapsulate IP VPN multicast packets for transport across provider's network. Throughout this document Napierala Expires - May 2008 [Page 2] Segmented Multicast MPLS/BGP VPNs whenever we refer to a VPN we mean MPLS/BGP IP VPN and whenever we refer to an MVPN we mean MPLS/BGP Multicast IP VPN. This document defines procedures for exchanging multicast VPN routing that allow for the same multicast stream to traverse multiple inter- PE paths without duplicate packets being sent to receivers. As a consequence, inter-PE C-multicast traffic can flow on multiple tunnels and simultaneously utilize multiple paths in a redundant topology. Different downstream PE's or even different multicast VRF's are allowed to choose different upstream PE's to a customer RP or a customer source. The lack of support of parallel paths for multicast traffic would remove one of the main benefits of IP VPN's, namely, that different multicast VRF's of the same VPN might have different routing policies and choose different paths to reach the RP or the source. It would also break Anycast RP [iv][v] operation in MVPN by not allowing multiple RP's to send traffic in parallel to their closest receivers. The proposed duplicate-free operation of Multicast VPN's is independent of multicast tunnel technology used by the service provider as well as of the protocol used to exchange multicast signaling among PE's. The proposed inter-PE multicast signaling does not impose any restrictions on customer's multicast routing or requirements on multicast service offering, e.g., it does not require customer to outsource its RP functionality to the service provider. The inter-site signaling specified in this document does not change multicast traffic patterns in customer network and it is transparent to PIM signaling in the customer domain. A direct consequence of the proposed procedures is simplification of inter-PE multicast routing. The procedures defined in this document include the support of PIM- SM [vii], PIM-SSM [vi], and PIM-Bidir [x] based C-tress. 2. Terminology In this document when we use the "C-" prefix when we refer to the MVPN customer multicast addresses and multicast trees. We will prefix MVPN customer multicast trees, sources, groups, Rendezvous Points, and PIM routes with "C-", as in: C-tree, C-S, C-G, C-RP, (C-*, C-G), (C-S, C-G). When we use the "P-" prefix when we refer to provider's multicast addresses and multicast trees/tunnels. We assume familiarity with PIM protocol [vii][vi][x] and the terminology used in [ii]. Napierala Expires - May 2008 [Page 3] Segmented Multicast MPLS/BGP VPNs 3. Overview of the Solution In order to support multiple inter-PE paths to the same C-S or C-RP without duplicate packets sent to receivers, the inter-PE signaling procedures defined in this document do not use overlay broadcast networks (i.e., MI-PMSI's defined in [ii]) for multicast data traffic. Broadcast networks require that any C-multicast flow has only one entry into provider's network, prohibiting parallel active paths for the same flow. Instead, we segment a multicast VPN into sets of multicast VRF's (mVRF's) such that each set is served by a specific P2MP or MP2MP P-tunnel. Each such P-tunnel is rooted at a unique PE that, for a given set of mVRF's, is the best next-hop to C- RP, C-Source, or C-RP Address. This allows for the same C-group or the same C-source traffic to enter provider's network at multiple PE's without creating duplicates to C-receivers. In case of PIM-SM the proposed signaling procedure supports Anycast C-RP's by partitioning the MVPN by C-RP location, i.e. by the upstream PE attached to C-RP. In case of PIM-SM and PIM-SSM the proposed procedure supports partitioning the MVPN by C-source location, i.e. by upstream PE attached to C-S. This allows C-multicast traffic to be simultaneously sent from each C-source location to a different set of C-receiver locations. In case of PIM-Bidir the proposed signaling procedure supports partitioning the MVPN by C-RPA location, i.e. by upstream PE attached to C-RPA. In PIM-Bidir the partitioning of MVPN by C-RPA location avoids multicast packet loops during routing convergence. In order to build P2MP or MP2MP P-tunnels rooted at PE's attached to C-RP/A or C-source, the active C-groups and C-sources have to be discovered in provider's network. This is straightforward when C- trees are built with PIM-SSM and PIM-Bidir. In PIM-SSM an active C-S is discovered when a PE attached C-S receives customer initiated (C- S, C-G) Join. In PIM-Bidir a group C-G is discovered when a PE attached to C-RPA receives customer initiated (C-*, C-G) Join. When a PE attached to C-S (in PIM-SSM) or C-RPA (in PIM-Bidir) receives, respectively, (C-S, C-G) Join or (C-*, C-G) Join, it announces a P- tunnel for C-S or C-G traffic rooted at this PE. 3.1 Overview of Supporting PIM-SM Supporting multiple inter-PE paths for the same C-multicast flow is more complex in PIM-SM. In native PIM-SM mode the same multicast traffic does not necessarily flow over a single tree but it can simultaneously flow on both shared and shortest path trees. If MVPN is modeled as an overlay broadcast network then PIM-SM based C- multicast will create duplicate packets sent to receivers unless LAN- based or LAN-like multicast procedures are used. Such procedures choose a single forwarder (single upstream PE) for any C-multicast flow entering provider's network, whether on shared or source-tree. Napierala Expires - May 2008 [Page 4] Segmented Multicast MPLS/BGP VPNs In order to support multiple inter-PE paths for the same PIM-SM C- multicast stream without duplicates sent to receivers, we decompose PIM-SM C-multicast into two types of topologies: (1) when the C-SPT to C-RP is across provider's network and (2) when the C-SPT to C-RP is outside of provider's network. This decomposition of PIM-SM routing is explained in detail in section 4. In topology (1) we carry C-multicast traffic only on C-SPT's across provider's network. In topology (2) we carry C-multicast traffic only on C-shared across provider's network. We define signaling procedures that discover the topology (1) and initiate inter-site C-RPT to C-SPT switchover at the egress PE's (i.e., PE's attached to sites with receivers). If C- multicast topology is not discovered to be the type (1) topology, it is by default treated as type (2) topology and C-multicast traffic is carried between PE's on C-shared trees only. Such decomposition of PIM-SM C-multicast allows for multiple entry points for C-G or C-S traffic into provider's network without duplicate packets being sent to C-receivers. This is because according to the specified procedures, multicast traffic from a customer source or from a customer RP is never sent to a downstream multicast VRF over a tunnel that is not rooted at this mVRF's best next-hop PE towards the source or the RP. We observe further that it is not necessary to perform customer initiated RPT-to-SPT switchover across provider's network. In MVPN context, the state created along the SPT to the C-RP can be used by PE's to discover customer sources. An active C-source can be discovered by a PE attached to C-RP when it receives (C-S, C-G) Join initiated by C-RP and destined to C-S. Such C-source discovery mechanism does not depend on receiving SPT Joins from sites attached to receivers and thus avoids customer-initiated inter-PE RPT-to-SPT switchover. The traffic for C-sources that PE's discovered can be carried over inter-PE C-SPT's, all other C-source traffic will always flow on C-shared trees across provider's network. The C-sources that could be not be discovered are those that communicate with C-RP outside of provider's network. According to the procedure defined in this document, PE-to-PE C- multicast traffic is being sent either only on SPT's or on shared trees, regardless of whether it was or wasn't switched to SPT's in customer domain. This avoids significant shifts of traffic in provider's network and leads to simplification of PE-to-PE multicast routing. The following PIM messages are eliminated between PE's: (C- S, C-G, rpt) Prunes and customer initiated (C-S, C-G) Joins associated with C-RPT to C-SPT switchover. The latter elimination has only one exception associated with dually homed receiver sites where C-RPT and C-SPT diverge (defined in section 6.1.1). Napierala Expires - May 2008 [Page 5] Segmented Multicast MPLS/BGP VPNs 4. Preserving PIM-SM Traffic Patterns in MVPN Our goal is to allow multicast traffic to flow simultaneously, but without duplicates, on multiple C-trees across provider's network while preserving customer's multicast traffic patterns. In this section we decompose PIM-SM C-multicast into two routing topologies which will provide the foundation for achieving this goal. PIM-SM has the capability for last-hop routers (i.e. routers with directly connected receivers) to switch to the shortest-path tree and bypass the RP if the traffic rate is above a configured threshold called the "SPT-threshold". The default value of the SPT-threshold is typically zero. This means that the default behavior for PIM-SM last- hop routers attached to receivers is to immediately join the SPT to the source as soon as the first packet arrives via the (*,G) shared tree. By switching to SPT, the optimal path is used to deliver the multicast traffic. Depending on the location of the source in relation to the RP, switching to the SPT can significantly reduce network latency. However, in networks with large numbers of senders, SPT's can increase amount of state that must be kept in the routers. VPN customer might set SPT-threshold to a value higher than zero in order to switch to SPT's only for sources that cross certain traffic rate. This is done in order to alleviate RP from carrying too much traffic while at the same time controlling the number of (S, G) states created in the network. If an SPT-threshold of "infinity" is specified for a group, the sources will not be switched to SPT and will always remain on the shared tree. We decompose PIM-SM C-multicast into two scenarios where: (1) path between C-S and C-RP is via service provider network, and (2) path between C-S and C-RP is outside of service provider network. C-S1 C-RP C-S2 | | / CE1 CE2 CE2' | | / | | / PE1 PE2 \ / Provider's Network | PE3 | | CE3 | C-R Napierala Expires - May 2008 [Page 6] Segmented Multicast MPLS/BGP VPNs Figure 1: Scenario (1) - Path between C-Si and C-RP via provider's network In scenario (1), shown in Figure 1, we assume that C-RP communicates with source C-S, e.g., C-S1 and C-S2, over provider's network. The (C-S, C-G) state is created within the site with C-S by a PIM Join issued by C-RP towards the C-S. Hence, switching at the egress PE's to SPT will not introduce new multicast states or change multicast traffic patterns within the site with C-S (or any other VPN site). In this scenario, immediate switching to SPT's at the egress PE's is transparent to the customer. As a consequence, in scenario (1), PIM- SM C-trees can be by default automatically triggered as SPT's by all egress PE's with no inter-PE RPT-to-SPT switchover initiated by C- routers. Regardless of whether or not the traffic in customer's network switched to SPT's, inter-PE MVPN traffic is sent only on SPT's. Even if from provider's network perspective C-S and C-RP are reachable via different PE's (as C-S1 and C-RP in Figure 1a) or via different interfaces on the same PE (as C-S2 and C-RP in Figure 1a), a better path between the C-S and the C-RP could be engineered by a customer to be outside of provider's network. C-S1 ======= C-RP ==== C-S2 | | / CE1 CE2 CE2' | | / | | / PE1 PE2 \ / Provider's Network | PE3 | | CE3 | C-R Figure 1a: Scenario (1a) - Path between C-S and C-RP engineered to be outside of provider's network Figure 1a depicts this scenario. From provider's network perspective CE1 is reachable via PE1 and C-RP is reachable via PE2. Hence, from provider's perspective the path between C-S and C-RP is via provider's network. Yet, the best path between C-S and C-RP has been engineered by VPN customer to be outside of provider's network (which is depicted by a double line between C-S1 and C-RP in Figure 1a). Similarly, from provider's network perspective CE2 and C-RP are both Napierala Expires - May 2008 [Page 7] Segmented Multicast MPLS/BGP VPNs reachable via PE2. Hence, from provider's perspective the path between C-S2 and C-RP is via provider's PE router. Yet, the best path between C-S and C-RP has been engineered by VPN customer to be outside of provider's network (which is depicted by a double line between C-RP and C-S2 in Figure 1a). Handling such topologies would complicate inter-PE C-multicast routing because it requires full C- RPT to C-SPT switching between PE's. Such scenarios are unusual and could be a result of unintentional or incomplete route advertisement by the customer. To avoid full RPT-to-SPT switching, in the scenarios depicted in Figure 1a, the C-S traffic will be kept on inter-PE C- shared trees. In scenario (2) customer source and customer RP are located at the same site. In this scenario, the optimal path from C-S to C-RP might not overlap with the optimal path from CE towards C-RP. Figure 2 depicts an example of such scenario. In this topology, if PE3 unconditionally switches to C-SPT, (C-S, C-G) state is created on CE1 which would not be otherwise created. If, in customer network, switching from RPT to SPT is based on a non-zero SPT-threshold then a specific source C-S traffic might never be switched to SPT if C-S rate does not reach the configured threshold. Hence, under scenario (2), to preserve PIM-SM multicast states in customer network, C-RPT to C-SPT switching cannot be initiated by provider's network. C-S C-RP \ / \ / R-1 | CE1 | | PE1 | | Provider's Network | | PE3 | | CE3 | C-R Figure 2: Scenario (2) - Path between C-S and C-RP outside of provider's network In scenario (2) there is no advantage to switch inter-PE traffic from C-RPT to C-SPT. Even more, it is beneficial to the customer not to Napierala Expires - May 2008 [Page 8] Segmented Multicast MPLS/BGP VPNs switch to SPT's at all because customer's multicast traffic is already on the shortest path across provider's network. In addition, in scenario (2), if customer initiates switching to SPT for C-S traffic at a remote site (e.g., CE3 in Figure 2), this would not change the C-S traffic pattern within the site with C-S. This is because at this site the path from C-RP to C-S intersects with the path from provider's network towards C-S. Hence, staying on inter-PE shared tree for C-S will not change the C-S traffic pattern even if customer switched to SPT for C-S at a remote site. Based on the these observations, the C-G traffic from any source C-S that is located at the same site as C-RP will be kept on inter-PE C- shared tree, regardless whether or not the customer network initiated the switching to SPT's. There could a scenario, with C-S and C-RP located at the same site, where RPT-to-SPT switchover is initiated by the customer to alleviate C-RP from carrying too much traffic. The example of such scenario is depicted in Figure 2a. In Figure 2a it is assumed that the best path from source C-S to C-RP is directly via CR1 only and not via CE1. When a remote CE3 switches to SPT, C-S traffic does not need to flow through the C-RP. However, this requires (C-S, C-G) state to be created on CE1. In scenario (2a) a path from CE1 to C-RP does not intersect with the SPT from C-RP to C-S. Hence, when staying on the shared tree the C-S traffic cannot be to be "picked off" as it flows along the SPT to the C-RP. In Figure 2a, if the best path from C-RP to C-S were via CE1, the benefit of switching to SPT would be eliminated because the C-S traffic would not flow via C-RP while on the shared tree. Another benefit is that (C-S, C-G) state would not be created on CE1. CR1---C-S / | C-RP | \ | \ | CE1 | PE1 | Provider's Network | PE3 | CE3 | C-R Figure 2a: Scenario (2a) - Path between C-S and C-RP outside of provider's network Napierala Expires - May 2008 [Page 9] Segmented Multicast MPLS/BGP VPNs It is beneficial to a VPN customer to assure that the best path from the C-RP to C-S (when they are located at the same site) intersects with the path from the provider's network towards C-S. Such topology gains all the benefits of staying on the shared-tree because C-S traffic can be "picked off" and send towards provider's network as it flows along the SPT to the C-RP. We assume that staying on C- shared trees in topologies exemplified by Figure (2a) has a minimal impact to the customer or that this impact can be easily eliminated by a straightforward routing or topology adjustment in customer network. In addition, such adjustment is beneficial to customer because it results in fewer multicast states on customer routers. 5. PE-to-PE Signaling Information Elements The following information elements are required in support of the multicast signaling procedures defined in this document: - active C-source announcements - P-tunnel announcements and withdrawals for (C-*, C-G) traffic - P-tunnel announcements and withdrawals for (C-S, C-G) traffic. When BGP is used as an auto-discovery mechanism in MVPN, a new BGP NLRI (MCAST-VPN) is already defined in [viii] to handle different route types in MVPN. For active C-source announcements, Source Active auto-discovery route defined in [viii] can be used. The P-tunnel announcements and withdrawals for (C-S, C-G) traffic can use S-PMSI auto-discovery route also defined in [viii]. The S-PMSI auto- discovery route for P-tunnel announcements and withdrawals for (C-*, C-G) traffic is defined is section 6.3 of this document. Optionally, there can be an additional route type defined for active C-group announcements. This route type and its purpose are defined in section 10.2.1 of the document. 6. Inter-Site Signaling Procedures for PIM-SM As described in section 4, a VPN source C-S and its C-RP could communicate either across provider's network or outside of provider's network. In either topology, a PE attached to C-RP, upon receiving (C-*, C-G) PIM Join from another PE or from a locally attached site, will send (C-*, C-G) Join towards the C-RP. This PE will also announce a P-tunnel for the group C-G to all PE's in a given MVPN and it will add the P-tunnel interface to (C-*, C-G) outgoing interface list (olist). There could more than one PE to which the same C-RP is attached. This could be because the C-RP is multi-homed or because it is Anycast-RP. Each PE that is attached to the C-RP and receives (C-*, C-G) Join will announce a distinct P-tunnel for C-G. This allows for the same C-G traffic to enter provider's network at multiple ingress points. Different PE's attached to receivers of C-G may receive C-G traffic on different P-tunnels without duplicate packets sent to receivers. Napierala Expires - May 2008 [Page 10] Segmented Multicast MPLS/BGP VPNs A PE, or more precisely an mVRF of a given MVPN attached to receiver(s) of C-G (i.e., an egress PE) will "join" only the C-G tunnel announced by its best next-hop PE to C-RP. If there is more than one best next-hop PE to C-RP in the mVRF, the egress PE will choose as the next-hop the PE with the highest IP address or it may utilize multicast multipath load splitting algorithm when there are multiple C-RP's behind the same PE's. All PE's in the given MVPN will store the C-G's P-tunnel information until they receive the P-tunnel withdrawal message for C-G. A PE that does not have any interested receivers for C-G when it receives C-G P-tunnel announcement message, it will store this information so it can join the P-tunnel for late C-G receivers. The conditions for C-G P-tunnel withdrawal are defined in section 6.1.3. In meantime, a VPN source C-S might have sent a PIM Register message to C-RP with encapsulated multicast data it in. The C-RP extracts the multicast data packet from the Register message and sends it to MVPN over the P-tunnel for group C-G. If the P-tunnel is not built yet, which is very unlikely because the P-tunnel creation was triggered upon receiving the first (C-*, C-G) Join, the initial data packet(s) to be sent across provider's network will be dropped. We describe the probability of dropping the initial C-multicast traffic in section 6.4. From this point on, depending on whether the C-RP and C-S communicate via provider's network or outside of provider's network, the inter-PE procedures differ. They are defined in sections 6.1 and 6.2, respectively. 6.1 Path Between C-S and C-RP Across Provider's Network A PE attached to C-RP, as PE2 in Figure 1, upon receiving (C-S, C-G) PIM Join from CE attached to C-RP (CE2 in Figure 1), will create (C- S, C-G) state and will add the CE-PE interface to its olist. The olist of the (C-S, C-G) entry is also populated with a copy of the olist from the (C-*, C-G) entry except the P-tunnel used for C-G traffic. This is to avoid duplicate traffic, i.e. traffic being sent on both shortest-path tree as well as shared-tree across provider's network. The PE attached to C-RP will propagate (C-S, C-G) Join toward C-S. When a site with C-S and a site with C-RP are attached to the same PE (as C-S2 and C-RP in Figure 1), this PE, upon receiving the first C-S packet on (C-S, C-G) state, will start sending (C-S, C-G, rpt) Prunes towards the C-RP. This is to stop receiving C-S traffic over the C- shared tree, i.e., to stop receiving packets de-capsulated from Register messages. The traffic arriving on C-RPT tree will eventually stop flowing when the Register Stop message from C-RP is received by Napierala Expires - May 2008 [Page 11] Segmented Multicast MPLS/BGP VPNs the C-S. This will result in no more (C-S, C-G, rpt) Prunes being sent to the C-RP. To optimize further the traffic flow, the PE attached to C-RP should use so-called "turnaround rules" to prevent multicast traffic from unnecessarily reaching the C-RP if there are no interested receivers behind it. In case a site with C-S and a site with C-RP are attached to the same PE, this PE will not announce a new P-tunnel for (C-S, C-G) traffic and it will send the C-S traffic over already announced P-tunnel for C-G. In case the C-S is not attached to the same PE as C-RP (as C-S1 in Figure 1), the PE attached to C-RP will announce the active source C- S of C-G to all PE's in a given MVPN. Upon receiving active source C- S announcement message, a PE that is the next-hop to source C-S (as PE1 in Figure 1) will send a P-tunnel announcement for (C-S, C-G) traffic to all PE's in the MVPN. The PE's will store the C-S P-tunnel information until they receive the P-tunnel withdrawal message for (C-S, C-G). A PE that does not have any interested receivers for C-G when it receives (C-S, C-G) P-tunnel announcement message, it will store this information so it can join this P-tunnel for late receivers. The conditions for (C-S, C-G) P-tunnel withdrawal are defined in sections 6.1.3 and 6.1.4. If C-S is dually connected to two different PE's, both of those PE's will announce their distinct P-tunnels for C-S traffic. The PE's attached to receivers of C-G, upon receiving the P-tunnel announcement for (C-S, C-G) traffic, will initiate (C-S, C-G) Joins based on (C-*, C-G) PIM Joins received from locally attached CE's. Each such egress PE will send (C-S, C-G) Join to the best next-hop PE towards C-S in an mVRF of the specified MVPN. The egress PE will also connect to the P-tunnel announced by the best next-hop PE to C-S in the mVRF. Egress PE's will continue participating in the C-shared tree to receive traffic from all other C-sources sending to C-G. If there is more than one best next-hop to C-S in the mVRF (i.e., there are multiple equal cost paths), the egress PE will choose as the next-hop the PE with the highest IP address. PE might utilize multicast multipath load splitting algorithm if there are multiple C- sources behind the same PE's. All PE's have to use the same load splitting algorithm in order to choose the same upstream PE for the same C-S. The P-tunnel announced for (C-S, C-G) traffic is also joined by the PE attached to C-RP that has (C-S, C-G) state with the interface towards C-RP in its olist (as PE2 in Figure 1). This is in order for C-RP to receive C-S traffic natively on the C-SPT. When the first C-S packet arrives over C-S P-tunnel at the PE attached to C-RP (PE2 in Figure 1), this PE will start sending (C-S, C-G, rpt) Prunes towards the C-RP. This is in order to stop receiving C-S traffic over the C- Napierala Expires - May 2008 [Page 12] Segmented Multicast MPLS/BGP VPNs shared tree, i.e., to stop receiving packets de-capsulated from Register messages. The traffic arriving on C-RPT tree will eventually stop flowing when the Register Stop message, sent by C-RP, is received by the C-S and no more (C-S, C-G, rpt) Prunes will be sent to the C-RP. To optimize further the traffic flow, the PE attached to C-RP should use so-called "turnaround rules" to prevent multicast traffic from unnecessarily reaching the C-RP if there are no interested receivers behind it. Upon receiving packets directly from a source C-S, customer last-hop routers might switch to SPT and send (C-S, C-G) Joins towards the C- S. When the SPT between C-RP and C-S is built across provider's network, regardless whether C-RP and C-S are attached to the same PE or different PE's, egress PE's do not need to propagate the (C-S, C- G) Join towards C-S. More precisely, when C-RP and C-S are attached to different PE's, egress PE does not need to propagate (C-S, C-G) Join received from locally attached CE because in this scenario egress PE's have already switched to SPT when P-tunnel for C-S was announced. When C-RP and C-S are attached to the same ingress PE, egress PE does not need to propagate (C-S, C-G) Join received from locally attached CE because in this scenario the ingress PE has already joined the source C-S and pruned C-S traffic from the C- shared tree. 6.1.1 Dually Connected C-Receivers In this section we describe a scenario where a dually homed VPN site with receiver(s) chooses a different next-hop PE depending on whether a shared (C-*, C-G) tree or source (C-S, C-G) tree is joined. This means that shared and source trees diverge at this site. C-S C-RP | | CE1 CE2 / \ | / \ | PE1 PE2 PE3 | | | Provider's Network | | PE4 PE5 ^ \ / ^ (C-*,C-G) | \ / | (C-S,C-G) Join | CE3 | Join | | C-R Figure 3: Dually connected C-Receiver Napierala Expires - May 2008 [Page 13] Segmented Multicast MPLS/BGP VPNs Figure 3 depicts an example of such scenario. Customer receiver C-R is dually connected to provider's network via PE4 and PE5. Let's assume that C-RPT and C-SPT diverge at CE3 and that PE4 is on C-RPT and PE5 is on C-SPT for (C-S, C-G). Let's also assume that PE1 is the best next-hop PE to C-S on PE4 and that PE2 is the best next-hop PE to C-S on PE5. When a dually connected VPN receiver site switches from shared to shortest path tree, the egress PE on C-SPT (PE5 in Figure 3) will receive (C-S, C-G) Join from this site. The egress PE will create (C- S, C-G) state, if it does not exist yet, and will add the interface on which it received (C-S, C-G) Join to its olist. If there is already (C-*, C-G) state in the same multicast VRF, the olist of (C- *, C-G) entry is copied into the olist of new (C-S, C-G) entry. This is a standard PIM procedure to allow C-S traffic to flow to (C-*, C- G) receivers. If C-S and C-RP are not attached to the same PE and if the egress PE received a P-tunnel announcement for (C-S, C-G) traffic from the best next-hop PE to C-S in the specified mVRF (the latter condition guarantees that C-RP and C-S communicate across provider's network), the egress PE will propagate (C-S, C-G) Join towards C-S. This is to cover the case when C-S is dually connected and the egress PE on C-RPT (as PE4 in Figure 3) chooses a different upstream PE to C-S than the egress PE on C-SPT (as PE5 in Figure 3). The egress PE on C-SPT will join the P-tunnel for either C-G or C-S of C-G if it was not joined yet. A PE always joins the most specific P-tunnel that was announced for (C-S, C-G) traffic, i.e., it will only join a P-tunnel that was announced for the C-G if there was no P-tunnel announcement for the C-S of the C-G. Once a multicast packet is received on the C-SPT at a dually connected site, the PE which is on the C-RPT will receive (C-S, C-G, rtp) Prune message from that site to prune off C-S traffic off C- shared tree. The PE on the C-RPT (as PE4 in Figure 3) does not need to propagate (C-S, C-G, rtp) Prune message to C-RP, regardless whether C-RP and C-S are attached to the same or different PE's. This is because C-S has been already pruned off the C-shared tree. The PE on the C-RPT might also stop joining the P-tunnel for (C-S, C-G) if there are no other receivers for (C-S, C-G) attached to it (i.e., if C-S traffic was pruned off on all (C-*, C-G) outgoing interfaces). 6.1.2 C-Shared Tree Switchback If a site attached to egress PE switches back from C-SPT to C-RPT because C-S traffic rate fell below the SPT-threshold, the PE on C- RPT will receive (C-*, C-G) Join to rejoin the shared tree. Since this (C-*, C-G) Join is sent without a (C-S, C-G, rpt) Prune it will cause the (C-S, C-G) Prune state along C-RPT to be deleted, which in Napierala Expires - May 2008 [Page 14] Segmented Multicast MPLS/BGP VPNs turn will permit (C-S, C-G) traffic to begin flowing down the C-RPT again. If the egress PE stopped participating in the P-tunnel for C- S it has to rejoin this tunnel to receive the C-S traffic. When a customer site switches back from C-SPT to C-RPT, the PE on the C-SPT attached to this site will receive (C-S, C-G) Prune message. In general, the egress PE does not need to propagates the (C-S, C-G) Prune message to a PE attached to C-S, even if C-S and C-RP are not attached to the same PE. This is because in this scenario, inter-PE C-trees are always SPT's. However, there is one exception, namely when SPT and RPT diverge at a dually connected site, as described in section 6.1.1. In this scenario, given that C-S and C-RP are attached to different PE's, when the egress PE receives (C-S, C-G) Prune message it will remove the interface on which it received the Prune from the olist for (C-S, C-G). If the olist for (C-S, C-G) is empty, the egress PE on C-SPT will send (C-S, C-G) Prune message up the C-SPT. It will also stop joining the P-tunnel for (C-S, C-G) traffic. This is to cover the case when C-S is dually connected and the egress PE on C-SPT (as PE5 in Figure 3) chooses a different upstream PE to C-S than the egress PE on C-RPT (as PE4 in Figure 3). 6.1.3 C-Receiver Pruning and P-tunnel Withdrawal An egress PE will send (C-*, C-G) Prune message towards C-RP when the olist for (C-*, C-G) in an mVRF of a given MVPN becomes empty. The C- RP could be locally attached to this PE or it can be attached to a different PE. In the latter case, the mVRF with empty olist for (C-*, C-G) will stop joining C-G P-tunnel announced by its best next-hop to C-RP. The egress PE will keep the C-G P-tunnel information in case it receives a new (C-*, C-G) Join from a locally attached site. This PE will also send (C-S, C-G) Prunes for all C-sources for which it triggered SPT's in the specified mVRF. The mVRF will also stop participating in P-tunnels announced for those C-sources but the P- tunnel information will be kept on the egress PE until it receives C- S tunnel withdrawals. The state (C-*, C-G) is removed from a PE, or more specifically from an mVRF, attached to C-RP when its olist for (C-*, C-G) becomes empty. This means that the P-tunnel for C-tree rooted at this PE is not longer needed. Upon (C-*, C-G) state removal the PE attached to C-RP will send the P-tunnel withdrawal message for C-G. It will also stop joining the P-tunnels for (C-S, C-G) that it previously joined and it will remove their P-tunnel information. Upon receiving C-G tunnel withdrawal message, all PE's in given MVPN will remove the C-G tunnel information. Every egress PE that previously joined this C-G tunnel in any of its mVRF's will also remove information about any P-tunnel for C-S of C-G associated with those mVRF's. Napierala Expires - May 2008 [Page 15] Segmented Multicast MPLS/BGP VPNs 6.1.4 C-Source Becomes Inactive The state (C-S, C-G) expires or is removed on a PE attached to C-S when C-S stops sending traffic or/and the state (C-S, C-G) was pruned by the PE because there were no receivers for this traffic (the latter condition was described in section 6.1.3). When (C-S, C-G) state expires on PE attached to C-S because C-S becomes inactive, this PE will send P-tunnel withdrawal message for (C-S, C-G) to all PE's in a given MVPN. Upon receiving C-S P-tunnel withdrawal message, PE?s attached to receivers of C-G (including the PE attached to C-RP), will stop joining this P-tunnel and will remove this P-tunnel information. After C-S stops sending traffic, the (C-S, C-G) state will also expire on PE's attached to receivers of (C-S, C- G). Upon receiving C-S P-tunnel withdrawal message, PE attached to C-RP of C-G will, if applicable, stop sending periodic (C-S, C-G, rtp) Prune messages towards the C-RP's. 6.2 Path Between C-S and C-RP Outside of Provider's Network When a path between C-RP and C-S is outside of provider's network, in order not to change traffic patterns in customer network, the C- shared tree will be preserved between PE's. Moreover, the inter-PE signaling is simplified by not switching to C-SPT's at the egress PE's at all. Hence, C-trees will be the shared trees from egress PE's to C-RP's, regardless whether customer last-hop routers switched to SPT's. When a path between a source C-S of C-G and C-RP of C-G is outside of provider's network, traffic from each such source is kept on the same P-tunnel, regardless whether it is flowing on shared tree or source tree in the customer network. This is the P-tunnel that was announced for the group C-G. Upon receiving packets directly from source C-S, customer last-hop routers might switch to SPT's and sent (C-S, C-G) Joins. However, the egress PE that received (C-S, C-G) Join from a locally attached CE will not propagate it to C-S since in this scenario, namely, when C- RP and C-S communicate outside of provider's network, egress PE's do not switch to C-SPT's. This includes the topologies where PE attached to C-S is either the same or different from the PE attached to C-RP. In addition, when C-RP and C-S are attached to the same PE, there is no switching to C-SPT's regardless whether C-RP and C-S are behind the same or different CE's. Napierala Expires - May 2008 [Page 16] Segmented Multicast MPLS/BGP VPNs 6.2.1 Dually Connected C-Receivers There is one scenario that needs to be separately addressed, namely a dually homed VPN receiver site with shared and source trees diverging. C-S C-RP \ / \ / R-1 | CE1 / \ / \ PE1 PE2 | | | | Provider's Network | | | | PE3 PE4 ^ \ / ^ (C-*,C-G) | \ / | (C-S,C-G) Join | CE2 | Join | C-R Figure 4: Dually connected C-Receiver Figure 4 depicts an example of such scenario. Customer receiver C-R is dually connected to provider's network via PE3 and PE4. Let's assume that C-RPT and C-SPT diverge at CE2 and that PE3 is on C-RPT and PE4 is on C-SPT for (C-S, C-G). Let's also assume that PE1 is the best next-hop PE to C-RP on PE3 and that PE2 is the best next-hop PE to C-RP on PE4. When such dually connected site switches from shared to shortest path tree, the egress PE on C-SPT (PE4 in Figure 4) will receive from this site (C-S, C-G) Join message. The egress PE on C-SPT will create (C- S, C-G) state in the relevant mVRF, if it does not exist yet, and it will add the site's interface to the (C-S, C-G) olist. If there is already (C-*, C-G) state in the same multicast VRF, the olist of (C- *, C-G) entry is copied into the olist of new (C-S, C-G) entry. This is a standard PIM procedure to allow C-S traffic to flow to (C-*, C- G) receivers. The egress PE on C-SPT will not propagate (C-S, C-G) Join towards C-S because there is no C-RPT to C-SPT switching across provider's network. The egress PE on C-SPT will convert (C-S, C-G) Joins to (C-*, C-G) Joins and will sent them to its upstream PE towards the C-RP. This is necessary because the best next-hop to C-RP on the egress PE on C-SPT (as PE4 in Figure 4) might be different Napierala Expires - May 2008 [Page 17] Segmented Multicast MPLS/BGP VPNs than the best next-hop to C-RP on the egress PE on C-RPT (as PE3 in Figure 3). The egress PE will join the P-tunnel announced for C-G by the best next-hop PE to C-RP in the relevant mVRF, if it did not join it yet. Once multicast traffic is received on the C-SPT at dually connected site, the PE which is on the C-RPT tree will start receiving (C-S, C- G, rtp) Prune messages to prune C-S traffic off C-shared tree. The egress PE will not propagate the (C-S, C-G, rtp) Prune towards C-RP because the C-RPT will not be switched to C-SPT across provider's network. 6.2.2 C-Shared Tree Switchback If a site attached to an egress PE switches back from C-SPT to C-RPT because C-S traffic rate fell below the SPT-threshold, the PE on C- RPT will receive (C-*, C-G) Join from a customer site to rejoin the shared tree. Since (C-*, C-G) Join will be sent without a (C-S, C-G, rpt) Prune this will cause the (C-S, C-G) Prune state along C-RPT to be deleted, which will permit (C-S, C-G) traffic to begin flowing down the C-RPT again. In case a receiver site is dually connected and it receives the C-S traffic on C-RPT, it will send (C-S, C-G) Prune message to the PE on C-SPT. The PE on C-SPT will prune the interface on which it received (C-S, C-G) Prune message off the C-SPT. If its olist for (C-S, C-G) is empty and there is no (C-*, C-G) state or olist for (C-*, C-G) becomes empty, the egress PE on C-SPT will stop sending (C-*, C-G) Joins towards C-RP and it will also stop joining the P-tunnel for C-G traffic. This is to stop unneeded traffic to be sent to the egress PE. 6.2.3 C-Receiver Pruning and P-tunnel Withdrawal An egress PE will send (C-*, C-G) Prune message towards C-RP when the olist for (C-*, C-G) becomes empty in an mVRF. The C-RP could be locally attached to this PE or it can be attached to a different PE. The mVRF on egress PE with empty (C-*, C-G) olist will stop participating in the P-tunnel for C-G that it previously joined. The state (C-*, C-G) is removed on PE attached to C-RP when its olist for (C-*, C-G) becomes empty. This means that C-G tunnel rooted at this PE is not longer needed. Upon (C-*, C-G) state removal the PE attached to C-RP will send the P-tunnel withdrawal message for C-G to all PE's in a given MVPN. Upon receiving C-G tunnel withdrawal message, all PE's in the MVPN will remove the C-G tunnel information. Napierala Expires - May 2008 [Page 18] Segmented Multicast MPLS/BGP VPNs 6.3 Group-Only S-PMSI Auto-discovery Route When BGP is used for an auto-discovery mechanism in MVPN, a new BGP NLRI (MCAST-VPN) is already defined in [viii] to handle different route types in MVPN. According to procedures defined in sections 6.1 and 6.2, MCAST-VPN NLRI definition has to be extended to include a new Route Type called Group-Only S-PMSI auto-discovery route. The Group-Only S-PMSI auto-discovery route is an announcement of an active VPN C-group and the P-tunnel to be used for its traffic. The P-tunnel information is carried in a BGP attribute called PMSI P- tunnel attribute already defined in [viii]. Group-Only S-PMSI auto-discovery route type will be assigned Route Type value of 6 of the MCAST-VPN NLRI and will consist of the following: +-----------------------------------+ | RD (8 octets) | +-----------------------------------+ | Multicast Group Length (1 octet) | +-----------------------------------+ | Multicast Group (Variable) | +-----------------------------------+ | Originating Router's IP Addr | +-----------------------------------+ The RD is encoded as described in [iii]. The Multicast Group field contains the C-G address or C-Generic LSP Identifier Value. If the Multicast Group field contains an IPv4 address or a C-Generic LSP Identifier Value, then the value of the Multicast Group Length field is 32. If the Multicast Group field contains an IPv6 address, then the value of the Multicast Group Length field is 128. The Originating Router's IP Address field MUST be set to the IP address that the PE places in the Global Administrator field of the VRF Route Import extended community of the VPN-IP routes advertised by the PE. 6.4 Handling Initial Packets Sent on C-Shared Tree According to the procedures described in sections 6.1 and 6.2, the initial C-G multicast packets send over C-shared tree could be dropped by PE attached to C-RP until a P-tunnel for C-G traffic is build. Since the C-G tunnel is announced when the first (C-*, C-G) Napierala Expires - May 2008 [Page 19] Segmented Multicast MPLS/BGP VPNs PIM Join is received by the PE attached to C-RP of C-G, this P-tunnel should be built in time to carry the initial C-S packets. In PIM-SM there are two scenarios to consider: (A) source registers first before there are any interested receivers, or (B) receivers join the group first, waiting for traffic on the shared tree. We will analyze these two scenarios in MVPN context based on inter-PE PIM-SM procedures defined in this document. In scenario (A), whether in plain PIM or MVPN context, the initial source packets are discarded because there are no receivers on shared tree. According to PIM-SM procedure when there are no receivers on the shared tree, the C-RP sends (C-S, C-G) "Register-Stop" message to the 1st-hop router to stop sending Register messages. The Register process will restart in 3 minutes (at the earliest, depending whether C-S is still active). If in meantime the C- receivers join the group C-G there is plenty of time for C-G P- tunnel to be announced and created. In scenario (B), there exists a short window of time during which the initial C-source packets could be dropped, namely when the first active C-S registers with C-RP immediately after the first C-receiver joined the C-G, not giving enough time for C-G P-tunnel to be built. This is the only scenario under which there could be packet discards in MVPN while there are not similar drops in plain PIM-SM multicast. However, even in plain PIM-SM there could be packet drops especially with bursty sources since only a bounded amount of traffic can be encapsulated in PIM Register or MSDP SA messages. In addition, since PIM is not a reliable protocol, loosing the first (*, G) Join message sent by a last-hop router attached to a receiver leads to loss of packets by this receiver. 6.5 Using P2MP LSP's as P-Tunnels for C-Shared Trees If P-tunnels are built with receiver-driven P2MP MPLS LSP's [ix], the P-tunnel for C-G can be algorithmically and uniquely chosen by the egress PE's. An egress PE selects the "root" PE of the P-tunnel, which is its best next-hop PE towards C-RP, and builds the P-tunnel towards this root PE. Different PE's may choose different upstream (i.e., root) PE's to reach C-RP in the same MVPN. This might happen if C-RP is dually connected or if Anycast C-RP is used. When the address of the root PE is used in the tunnel identification algorithm, a distinct P2MP LSP per root can be built. Hence, multiple P-tunnels can be simultaneously used to carry the same C-G traffic without creating duplicates at the C-receivers. The P2MP LSP is triggered by the egress PE when (C-*, C-G) Join is received from a locally attached receiver. This technique allows for further aggregation of traffic without generating duplicates. Instead of one P2MP LSP per root PE per C-G, Napierala Expires - May 2008 [Page 20] Segmented Multicast MPLS/BGP VPNs one P2MP LSP per root could be used for all C-groups for which the C- RP is the active RP. In this case, C-group address has to be ignored in the P2MP LSP identifier; instead the C-RP address should be used. Such aggregation may cause loss of bandwidth optimality but it will not generate duplicate traffic to C-receivers. In most typical MVPN network topology, a data center or a hub location is where one-to-many multicast applications are being sourced. Typically, customer's Rendezvous Points are also located at the data centers/hubs. In this topology there is no advantage to switch from shared to source trees since multicast VPN traffic is already on the shortest path in provider's network. Moreover, it is beneficial to MVPN customer to stay on shared trees because no unnecessary multicast states are created. If is known that a C-tree never switches to SPT then P2MP LSP with inbound signaling is sufficient in supporting such C-trees. 7. Supporting C-Shared Trees Multicast VPN might never switch traffic to SPT's for certain multicast C-groups if SPT-threshold of "infinity" is specified for those groups. The procedures defined in section 6 of this document preserve C-shared trees, regardless of whether a path between C-RP and C-S is outside or across provider's network. The procedures defined in section 6.2 of this document preserve C- shared trees in case a path between C-RP and C-S is outside of provider's network. This is in order to preserve the multicast states and traffic patterns in MVPN customer network. According to procedures in section 6.1, inter-PE traffic is automatically switched to source trees for those C-sources whose path to C-RP is across provider's network. However, in this scenario it is transparent to the VPN customer whether multicast traffic is sent on shared or source trees across provider's network. In other words, from customer network perspective multicast traffic is still on shared trees. 8. Support of Anycast C-RP The expected Anycast C-RP behavior is that different egress PE's could choose different upstream PE's as the next-hops to the C-RP. Support of multiple upstream PE's for Anycast C-RP is required. There are two ways to support Anycast C-RP: based on provider's network IGP cost or based on VPN customer routing. If there are multiple next-hops to static C-RP installed in mVRF, the closest PE, based on provider's network IGP cost, should be chosen as best next- hop to C-RP and only as a tie breaker the PE with the highest IP Napierala Expires - May 2008 [Page 21] Segmented Multicast MPLS/BGP VPNs address. IGP cost-based next-hop selection provides PIM-like support of Anycast C-RP's, i.e., C-receivers join the closest Anycast C-RP across provider's network. Another option is to always use the highest IP address as a tie breaker for RPF neighbor selection and leave it to MVPN routing policy to reach different Anycast-RP's. This allows MVPN customer to define its own Anycast C-RP selection, based on other criterion than the closest distance. Both Anycast C-RP options described above should be supported by the MVPN implementation. 9. Inter-Site Signaling Procedures for PIM-SSM With PIM-SSM an active C-source is discovered when a PE attached to C-source receives the first (C-S, C-G) Join, either from directly connected CE or from another PE in MVPN. When a PE attached to C-S receives the first (C-S, C-G) Join from another PE, this PE will announce the P-tunnel to be used for (C-S, C-G) traffic to all other PE's in the MVPN. In PIM-SSM the source discovery and P-tunnel announcement is one and the same message. The PE's will store the C-S P-tunnel information until they receive the P-tunnel withdrawal message for (C-S, C-G). A PE that does not have any interested receivers for (C-S, C-G) when it receives the P-tunnel announcement message, it will store this information so it can join this P-tunnel for late (C-S, C-G) receivers. The conditions for (C-S, C-G) P-tunnel withdrawal are defined in section 9.2. Each PE attached to C-S, when it receives (C-S, C-G) Join, will announce its distinct P-tunnel for (C-S, C-G) traffic. An egress PE, or more precisely an egress mVRF with receiver(s) of (C-S, C-G) will "join" the P-tunnel announced for (C-S, C-G) only if the PE that sent this announcement is the best next-hop to C-S in this mVRF. If there is more than one best next-hop to C-S in the mVRF, the PE will choose as the next hop the PE with the highest IP address or PE may utilize multicast multipath load splitting algorithm. PIM-SSM allows the source to continuously send traffic even if there are no receivers for this traffic. (The drawback of this behavior is waste of sender resources and the first-hop router/link bandwidth). If the C-S is already active when the (C-S, C-G) Join reaches the C- router attached to C-S, the C-S traffic starts immediately flowing on the C-source tree towards the PE. If the P-tunnel for (C-S, C-G) has not yet been built up to the PE attached to C-S, few initial packets arriving from C-S will be dropped. It is rather unlikely that there are PIM-SSM applications where sender can be active without receivers and yet any initial packet drop cannot be tolerated. In addition, Napierala Expires - May 2008 [Page 22] Segmented Multicast MPLS/BGP VPNs plain PIM-SSM is not a reliable protocol and loosing the first (S, G) Join message sent by a last-hop router attached to a receiver leads to loss of packets by this receiver. 9.1 C-Receiver Pruning An egress PE will send (C-S, C-G) Prune message towards C-S when its olist for (C-S, C-G) in an mVRF becomes empty. The egress PE will also remove the (C-S, C-G) state from the mVRF. Upon (C-S, C-G) state removal the mVRF will stop joining the P-tunnel announced for (C-S, C-G) traffic. 9.2 P-tunnel Withdrawal for C-S The state (C-S, C-G) will be removed by PE attached to C-S after the olist for (C-S, C-G) becomes empty. Upon (C-S, C-G) state removal, PE attached to C-S will send P-tunnel withdrawal message for (C-S, C-G). The egress PE's in a given MVPN, upon receiving (C-S, C-G) P-tunnel withdrawal message, will remove the P-tunnel information. 9.3 Using P2MP LSP's as P-Tunnels for PIM-SSM C-Trees If P-tunnels are built with receiver-driven P2MP MPLS LSP's [ix], the P-tunnel for (C-S, C-G) can be algorithmically and uniquely chosen by the egress PE's. Egress PE selects the "root" PE of the P-tunnel, which is the best next-hop PE towards C-S in mVRF, and builds the P2MP LSP towards this root PE. Different PE's and different mVRF's may choose different upstream PE's to reach C-S in the same MVPN. If the address of the root PE is used in the LSP identification algorithm, a distinct P2MP LSP per root is built. Hence, there could be multiple entry points for C-S traffic into provider's network without duplicates at the C-receivers. The LSP is triggered by the egress PE when (C-S, C-G) Join is received from a locally attached receiver. The advantage of using P2MP LSP's for PIM-SSM C-trees is that no out-of-band signaling is required. However, without out-of- band signaling the aggregation of P2MP LSP's is not possible because it could result in duplicate traffic being sent to customer. 10.Inter-Site Signaling Procedures for PIM-Bidir Some multicast applications use many-to-many model where each participant is the receiver as well as the sender. Using PIM-SM for such applications results in increased memory and protocol overhead. Bi-directional PIM [x] eliminates both Register message encapsulation and source-specific states by allowing packets to be natively forwarded from a source to the Rendezvous Point using shared tree Napierala Expires - May 2008 [Page 23] Segmented Multicast MPLS/BGP VPNs state only. This ensures that only (*,G) entries will appear in multicast forwarding tables and that the path taken by packets flowing from the source and/or receiver to the Rendezvous Point Address (RPA) and vice versa will be the same. Membership to a Bidir group is signaled via explicit (*, G) join messages. Traffic from sources is unconditionally sent up the shared tree toward the RPA and passed down the tree toward the receivers on each branch of the tree. This is in contrast with PIM-SM where traffic flows are unidirectional. The olist of a (*, G) entry for Bidir group G includes all the interfaces on which (*, G) Joins were received. If a router is located on a sender-only branch, a Bidir implementation might also create (*, G) state but the olist will not include any interfaces. Traffic in a Bidir group is always forwarded to the RPA of that group. If no receivers are along the way to the RPA, the traffic will be dropped off only at the RPA. Traffic will be forwarded to the RPA even if there are no receivers at all. 10.1 Preventing C-Bidir Packet Loops in MVPN IP Bi-directional PIM chooses a single Designated Forwarder (DF) for upstream packets (away from the source) on every network segment and point-to-point link. The DF procedure selects one router as the DF for every RPA of bidirectional groups. DF is responsible for forwarding multicast packets upstream to RPA as well as sending (*,G) Join/Prune messages towards RPA. To avoid packet loops DF election procedure eliminates parallel downstream paths from any RPA. It enforces consistent view of the DF on all routers on network segment, and during periods of ambiguity or routing convergence the traffic forwarding is suspended. To avoid loops, customized routing in downstream routers does not affect the choice of DF. In Bidir the path from a source/receiver to DF is always the best metric unicast path. In MVPN context a Designated Forwarder for Bidir C-RPA is a PE attached to C-RPA. Different mVRF's in a given MVPN might have different next-hop PE's to C-RPA due to different routing policies or they might have temporarily different next-hop PE's to C-RPA due to routing transients. The MVPN solution for C-Bidir cannot rely on all mVRF's in a given MVPN to either have common routing view to C-RPA or to reach a common routing view to C-RPA in time to prevent packet looping. Rather, a VPN has to be treated as a collection of sets of multicast VRF's, each having the same but distinct from other sets reachability towards C-RPA. Resolving C-Bidir packet loops in MVPN inevitably results in the ability to partition an MVPN into disjoined sets of mVRF's, served by disjoined P-tunnels. Each such set would have a distinct view of converged network, i.e., it would have the same upstream PE as the best next-hop towards the C-RPA. If there is Napierala Expires - May 2008 [Page 24] Segmented Multicast MPLS/BGP VPNs more than one best next-hop PE to C-RPA in a set, the tie breaker will be the upstream PE with the highest IP address. As an option, the MVPN implementation of C-Bidir should allow to ignore specific multicast routing policy in mVRF, and instead make all PE's in a given MVPN choose the same next-hop PE to C-RPA. Among all candidate next-hop PE's, the single chosen upstream PE to C-RPA could be the PE with the highest IP address. This approach to C-Bidir might be desirable to customers that do not want a permanent splitting of their MVPN's into disjoined C-Bidir trees. Note that the unicast routing policy in a VPN cannot influence VPN multicast routing from a multi-homed site. This is the nature of Bidir that the path from a source/receiver site towards the C-RPA is always the best metric unicast path and that choice is made locally at the VPN site. 10.2 Active Group P-tunnel Announcement in C-Bidir The (C-*, C-G) state is first created on a PE attached to C-RPA (i.e., on a DF-PE) by a (C-*, C-G) Join from a locally connected or remote C-receiver. Once (C-*, C-G) state is created a DF-PE announces a P-tunnel for active group C-G to all PE's in a given MVPN. If BGP is used as P-tunnel announcement delivery mechanism, the P-tunnel for the active C-Bidir group is announced via the Group-Only S-PMSI auto- discovery route, defined in section 6.3. A PE that does not have (C- *, C-G) state when it receives a C-G P-tunnel announcement message will store this information so it can join the P-tunnel for late group members. PIM-Bidir supports source-only branches i.e., branches that do not lead to any receivers, but that are used to forward packets traveling upstream from sources towards the RPA. In plain IP PIM-Bidir it is up to the implementation whether to maintain group state for source-only branches [x]. However, the procedures defined in this document require that in MVPN context PE's on C-source-only branches maintain (C-*, C-G) state. The existence of this state indicates that a PE is on C-Bidir tree and has to join a P-tunnel used for its traffic. If (C-*, C-G) state was not maintained for source-only sites, a PE would not know whether or not it is on C-G's Bidir tree. The consequence of this would be that in order to deliver source-only site traffic across provider's network, all PE's in a given MVPN would have to join the P-tunnel announced for C-G. If C-S traffic starts unconditionally flowing from a VPN site towards a PE before a single (C-*, C-G) Join was received from any VPN site, this traffic will be dropped at the PE. This is because no inter-PE P-tunnel has been built yet for C-G traffic. Since there are no receivers yet for this traffic dropping it optimizes the inter-PE Napierala Expires - May 2008 [Page 25] Segmented Multicast MPLS/BGP VPNs behavior of C-Bidir. No C-G traffic is unnecessarily sent across MVPN until there is a least a single receiver for C-G. This approach has also positive security implications to service providers because it prevents a coordinated attack of unconditional traffic from C-Bidir sources with no receivers for this traffic. 10.2.1 Active C-Group Announcement in C-Bidir Announcing a P-tunnel for C-Bidir traffic only when at least one receiver already exists for this traffic might introduce a potential delay in receiving traffic from C-Bidir sources by the upcoming receivers. Namely, when one or more C-Bidir sources start unconditionally sending traffic to a C-G group with no active membership and the receivers subsequently join the C-G, the inter-PE P-tunnel has first to be announced and built before the source traffic can be delivered to the receivers. This can be easily remedied by announcing an active C-Bidir group upon receiving unconditional source traffic with no active membership. A PE upon receiving unconditional source traffic for C-G with empty membership (i.e., the PE's olist list for (C-*, C-G) is empty), will announce the active group C-G to its DF-PE. If olist for (C-*, C-G) is non-empty on this PE or this PE has already received a P-tunnel announcement for C-G, the PE will not announce that C-G is active because this fact is already known in the MVPN. When BGP is used as the delivery mechanism, a new route type has to be defined for active C-group announcements. A new route type, a Group Active auto-discovery route, is defined as follows: +-----------------------------------+ | RD (8 octets) | +-----------------------------------+ | Multicast Group Length (1 octet) | +-----------------------------------+ | Multicast Group (Variable) | +-----------------------------------+ The RD is encoded as described in [iii]. The Multicast Group field contains the C-G address or C-Generic LSP Identifier Value. If the Multicast Group field contains an IPv4 address or a C-Generic LSP Identifier Value, then the value of the Multicast Group Length field is 32. If the Multicast Group field contains an IPv6 address, then the value of the Multicast Group Length field is 128. New Group Active auto-discovery route type will be assigned Route Type value of 7 of the MCAST-VPN NLRI defined in [viii]. Napierala Expires - May 2008 [Page 26] Segmented Multicast MPLS/BGP VPNs If BGP is used as P-tunnel announcement delivery mechanism, once DF- PE receives Group Active auto-discovery route for C-G, it will announce the P-tunnel to be used for C-G via Group-Only S-PMSI auto- discovery route, defined in section 6.3. The procedure defined in this section should not be a default behavior for handing C-Bidir traffic but it should be implemented as an option to be turned on or off per C-G in provider's network. 10.3 Bidir C-Group Becomes Inactive The state (C-*, C-G) is removed on a DF-PE after its olist becomes empty. Upon (C-*, C-G) state removal DF-PE will send the P-tunnel withdrawal message for C-G. This is the P-tunnel the DF-PE announced on active C-G discovery. PE's attached to the participants of C-G, upon receiving C-G P-tunnel withdrawal message, will remove the P- tunnel information. 10.4 P-tunnels for C-Bidir Traffic The procedure defined in this document requires that C-Bidir traffic is carried over MP2MP P-tunnels across provider's network, which can be built with PIM-Bidir or with MP2MP LSP's. This is because in this procedure only one P-tunnel is announced by and rooted at a DF-PE for a C-Bidir group. In fact, using MP2MP P-tunnels in provider's network is the only scalable approach to C-Bidir. During routing convergence or when different routing policies for C- Bidir are supported, PE's in a given MVPN might choose different upstream PE's as the best next-hops to C-RPA. Each PE attached to C- RPA announces a distinct MP2MP P-tunnel. At any given time, a PE in the MVPN joins only one P-tunnel that was announced by its chosen DF- PE. Once the MVPN converges, each set of mVRF's with the same multicast routing policy will have a single DF-PE for a C-RPA. When an option for ignoring specific multicast VRF routing policies is turned on, all PE's in the MVPN will choose the same next-hop PE to C-RPA. A PE that joined a P-tunnel announced by a "transient" DF-PE has to join the P-tunnel announced by the converged DF-PE, and stop sending and accepting traffic on the tunnel announced by the transient DF-PE. 10.5 DF-PE Redundancy with Fast Convergence To speed up C-Bidir convergence certain optimizations could be added to C-Bidir support. In case when Bidir C-RPA is redundantly connected, a PIM join could be sent to all PE's connected to C-RPA site, not only to the DF-PE with the highest IP address. Each such Napierala Expires - May 2008 [Page 27] Segmented Multicast MPLS/BGP VPNs candidate DF-PE would announce its own P-tunnel for C-G traffic. All those P-tunnels could be joined by the PE's on C-Bidir tree, but each such PE will send and/or receive C-G traffic only over the P-tunnel announced by its current best DF-PE for C-G. This procedure introduces a notion of primary and backup P-tunnels. A P-tunnel announced by currently active DF-PE will be referred to as a primary P-tunnel. P-tunnels announced by non-active candidate DF-PE's will be referred as backup P-tunnels. In case of the current DF-PE failure, upon the failure detection, new DF-PE could immediately send PIM join across the redundant connection and all PE's with participants in C-G will stop sending/receiving C-G traffic over primary P-tunnel and will start sending/receiving traffic over the backup P-tunnel. Since this alternate P-tunnel already exists, the data loss is minimized. This is a trade-off between fast-convergence and increased backbone bandwidth usage. The procedure defined in this section should be implemented as an option to service provider. 10.6 Using MP2MP LSP's as P-Tunnels for C-Bidir C-Bidir signaling procedure defined so far is based on P-tunnel announcements by DF-PE's. Announcing the MP2MP tunnel by a DF-PE allows for P-tunnel aggregation based on congruency of multicast flows. If C-Bidir were to be supported without aggregation or with an aggregation not based on on congruency of flows then a different solution for C-Bidir is possible. Instead of announcing MP2MP tunnels by the DF-PE's, such tunnels could be algorithmically derived based on C-group and DF-PE addresses. This is possible when P-tunnels are MP2MP LSP's [ix]. This is the same technique as described in sections 6.5 and 9.3 except that MP2MP rather than P2MP LSP's are being used. Egress PE selects the "root" PE of the P-tunnel, which is its best next-hop PE towards C-RPA, and builds the MP2MP LSP towards this root/DF-PE. Different PE's may choose different upstream PE's to reach C-RPA in the same MVPN. Since the address of the root PE is also used in the MP2MP LSP identification algorithm, a distinct MP2MP LSP per root is built. At any given time, a PE sends and receives C-G traffic only on one MP2MP LSP that is rooted at the DF-PE chosen by this PE. Hence, multiple MP2MP LSP's can simultaneously carry the same C-RPA traffic without duplication and looping of packets. This technique allows for further aggregation of traffic without causing traffic loops. Instead of generating one MP2MP LSP per C-G, one MP2MP LSP per DF-PE could be used for all C-groups for which the C-RPA is an active RP. In this case, C-group address should not be used when generating the MP2MP LSP identifier, C-RPA address should be used instead. Such aggregation may cause loss of bandwidth optimality but it will not generate loops in MVPN. Napierala Expires - May 2008 [Page 28] Segmented Multicast MPLS/BGP VPNs 11.IANA Considerations To be supplied. 12.Security Considerations To be supplied. 13.References [i] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [ii] E. Rosen, R. Aggarwal, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast. Work in progress. [iii] E. Rosen, E., Rekhter, Y., "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [iv] Kim, D., Meyer, D., Kilmer, H., and D. Farinacci, "Anycast Rendevous Point (RP) mechanism using Protocol Independent Multicast (PIM) and Multicast Source Discovery Protocol (MSDP)", RFC 3446, January 2003. [v] Farinacci, D. and Y. Cai, "Anycast-RP Using Protocol Independent Multicast (PIM)", RFC 4610, August 2006. [vi] H. Holbrook, B. Cain, "Source-Specific Multicast for IP", RFC 4607, August 2006. [vii] B. Fenner et al., "Protocol Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification (Revised)", RFC 4601, August 2006. [viii] R.Aggarwal, E.Rosen, et al., "BGP Encoding for Multicast in MPLS/BGP IP VPNs", draft-ietf-l2vpn-2547bis-mcast-bgp. Work in progress. [ix] I. Minei, I. Wijnands, et. al., "Label Distribution Protocol Extensions for Point-to-Multipoint and Multipoint-to-Multipoint Label Switched Paths", draft-ietf-mpls-ldp-p2mp. Work in progress. Napierala Expires - May 2008 [Page 29] Segmented Multicast MPLS/BGP VPNs [x] M. Handley, I. Kouvelas, T. Speakman, L. Vicisano, "Bi- directional Protocol Independent Multicast (Bidir-PIM)", draft- ietf-pim-bidir-09. Work in progress. 14.Acknowledgments The author thanks Yakov Rekhter, Bill Fenner, Toerless Eckert, Ice Wijnands, and Lee Breslau for their comments and insights. 15.Author's Addresses Maria Napierala AT&T Labs 200 Laurel Avenue, Middletown, NJ 07748 Email: mnapierala@att.com 16. Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietfipr@ietf.org. Napierala Expires - May 2008 [Page 30] Segmented Multicast MPLS/BGP VPNs 17. Copyright Notice Copyright (C) The IETF Trust (2007). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Napierala Expires - May 2008 [Page 31]