Network Working Group Internet Draft Maria Napierala Document: draft-mnapierala-mvpn-rev-04.txt AT&T Expires: August 24 2008 February 24 2008 Segmented Multicast MPLS/BGP VPNs Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes inter-site signaling procedures in MPLS/BGP IP VPNs that allow the same multicast stream to flow simultaneously on multiple inter-PE paths without duplicates being sent to receivers. Those procedures are independent of multicast tunnel technology used in service provider network as well as of the protocol used to exchange multicast signaling among PE's. The document specifies necessary information elements and their exchange process for the desired MVPN operation. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [i]. Napierala Expires - August 2008 [Page 1] Segmented Multicast MPLS/BGP VPNs Table of Contents 1. Introduction................................................2 2. Terminology.................................................3 3. Overview of the Solution....................................4 3.1 Overview of Inter-PE Procedures for PIM-SM...............5 4. PE-to-PE Signaling Information Elements.....................6 5. Inter-Site Signaling Procedures for PIM-SM..................6 5.1 C-Sources Discovered by PIM Control Messages.............7 5.2 C-Sources Not Discovered by PIM Control Messages........12 5.3 Group-Only S-PMSI Auto-discovery Route..................15 5.4 Handling Initial Packets Sent on C-Shared Tree..........16 5.5 Using P2MP LSP's as P-Tunnels for C-Shared Trees........17 6. Supporting C-Shared Trees..................................17 7. Support of Anycast C-RP....................................18 8. Inter-Site Signaling Procedures for PIM-SSM................18 8.1 C-Receiver Pruning......................................19 8.2 P-tunnel Withdrawal for C-S.............................19 8.3 Using P2MP LSP's as P-Tunnels for PIM-SSM C-Trees.......19 9. Inter-Site Signaling Procedures for PIM-Bidir..............20 9.1 Preventing C-Bidir Packet Loops in MVPN.................20 9.2 Active Group P-tunnel Announcement in C-Bidir...........21 9.3 Bidir C-Group Becomes Inactive..........................23 9.4 P-tunnels for C-Bidir Traffic...........................23 9.5 DF-PE Redundancy with Fast Convergence..................24 9.6 Using MP2MP LSP's as P-Tunnels for C-Bidir..............24 10. C-Multicast Traffic Aggregation............................25 11. Supporting Source-Specific Host Reports in PIM-SM..........26 12. IANA Considerations........................................27 13. Security Considerations....................................27 14. APPENDIX: Preserving C-Multicast Traffic Patterns in MVPN..27 15. References.................................................32 16. Acknowledgments............................................32 17. Author's Addresses.........................................33 18. Intellectual Property Statement............................33 19. Copyright Notice...........................................33 1. Introduction Multicast VPN (cf.[ii]) extends MPLS/BGP VPN services (cf.[iii]) by enabling customers to run native IP multicast within their IP VPN's. From VPN customer perspective there is no change in the multicast operational model. Multicast distribution trees are built in service provider network to carry VPN multicast traffic. Those trees are essentially point-to-multipoint (P2MP) or multipoint-to-multipoint (MP2MP) tunnels that encapsulate IP VPN multicast packets for Napierala Expires - August 2008 [Page 2] Segmented Multicast MPLS/BGP VPNs transport across provider's network. Throughout this document whenever we refer to a VPN we mean MPLS/BGP IP VPN and whenever we refer to an MVPN we mean MPLS/BGP Multicast IP VPN. This document defines procedures for exchanging multicast VPN routing that allow for the same multicast stream to traverse multiple inter- PE paths without duplicate packets being sent to egress mVRF's. As a consequence, inter-PE C-multicast traffic can flow on multiple tunnels and simultaneously utilize multiple paths in a redundant topology. Different downstream PE's or even different multicast VRF's are allowed to choose different upstream PE's to a customer RP or a customer source. Only a single copy of any C-multicast stream is delivered to any egress mVRF in a converged network. This includes PIM-SM C-streams that are either flowing on C-shared tree or C-shortest-path tree. According the procedures defined in this document, an egress PE receives a PIM-SM C-stream either from the C- RP or directly from the C-source but never from both. The lack of support of parallel paths for multicast traffic would prevent different multicast VRF's of the same VPN to have different routing policies and choose different paths to reach C-RP or the C- source. As a result it would break any kind of "anycast" sourcing of a multicast stream in IP VPN, including Anycast RP [iv][v] operation by not allowing multiple RP's to send traffic in parallel to their closest receivers. The proposed duplicate-free operation of Multicast VPN's is independent of multicast tunnel technology used by the service provider as well as of the protocol used to exchange multicast signaling among PE's. The proposed inter-PE multicast signaling does not impose any restrictions on customer's multicast routing or requirements on multicast service offering, e.g., it does not require customer to outsource its RP functionality to the service provider or service provider to participate in customer's RP protocol by running MSDP with the customer. The procedures defined in this document include the support of PIM- SM [vii], PIM-SSM [vi], and PIM-Bidir [x] based C-tress. 2. Terminology In this document when we use the "C-" prefix when we refer to the MVPN customer multicast addresses and multicast trees. We will prefix MVPN customer multicast trees, sources, groups, Rendezvous Points, and PIM routes with "C-", as in: C-tree, C-S, C-G, C-RP, (C-*, C-G), (C-S, C-G). When we use the "P-" prefix when we refer to provider's multicast addresses and multicast trees/tunnels. We assume familiarity with PIM protocol [vii][vi][x] and the terminology used in [ii]. Napierala Expires - August 2008 [Page 3] Segmented Multicast MPLS/BGP VPNs 3. Overview of the Solution In order to support multiple inter-PE trees carrying the same C- multicast traffic without duplicate packets at the egress mVRF's, we segment a multicast VPN into sets of multicast VRF's such that each set has the same best route to C-S or C-RP. Each set is served by a different P-tunnel to deliver C-S or C-RP traffic. Each such P-tunnel is rooted at a unique PE that, for a given set of mVRF's, is the best next-hop to C-RP, C-Source, or C-RP Address. This allows for the same C-group or the same C-source traffic to enter provider's network at multiple PE's without creating duplicates to C-receivers. In case of PIM-SM the proposed signaling procedure supports Anycast C-RP's by partitioning the MVPN by C-RP location, i.e. by the upstream PE attached to C-RP. In case of PIM-SM and PIM-SSM the proposed procedure supports partitioning the MVPN by C-source location, i.e. by upstream PE attached to C-S. This allows C- multicast traffic to be simultaneously sent from each C-source location to a different set of C-receiver locations. In case of PIM- Bidir the proposed signaling procedure supports partitioning the MVPN by C-RPA location, i.e. by upstream PE attached to C-RPA. In PIM- Bidir the partitioning of MVPN by C-RPA location avoids multicast packet loops during routing convergence. In order to trigger a P-tunnel rooted at a PE attached to C-RP/A or C-source to carry their traffic, the active C-groups and C-sources have to be discovered in provider's network. This is straightforward when C-trees are built with PIM-SSM and PIM-Bidir. In PIM-SSM an active C-S is discovered when a PE attached C-S receives customer initiated (C-S, C-G) Join. In PIM-Bidir a group C-G is discovered when a PE attached to C-RPA receives customer initiated (C-*, C-G) Join. When a PE attached to C-S (in PIM-SSM) or C-RPA (in PIM-Bidir) receives, respectively, (C-S, C-G) Join or (C-*, C-G) Join, it announces a P-tunnel for C-S or C-G traffic rooted at this PE. Those procedures are defined in detail in sections 8 and 9, respectively. Supporting multiple inter-PE paths for the same C-multicast flow is more complex in PIM-SM. In sparse mode, assigning the (C-S, C-G) streams to an S-PMSI presupposes that there is a way of discovering the C-sources. Plain PIM-SM does this by examining the data plane to see who is sourcing the (*, G) traffic. This document proposes instead to discover the C-sources by using the control plane but without requiring a customer to outsource its RP functionality to the service provider or without relying on running MSDP with the customer. In MVPN context, the state created along the SPT from C-RP to C-S can be used by PE's to discover customer sources. An active C- source can be discovered by a PE attached to C-RP when it receives (C-S, C-G) Join initiated by C-RP and destined to C-S. Receiving (C- Napierala Expires - August 2008 [Page 4] Segmented Multicast MPLS/BGP VPNs S, C-G) Join on a PE attached to C-RP triggers a Source Active advertisement, which, when received by a PE attached to C-S, causes that PE to announce a P-tunnel for (C-S, C-G) traffic. An egress mVRF (i.e., mVRF with receivers of C-G) will join only this P-tunnel for (C-S, C-G) that was announced by its best next hop to C-S. 3.1 Overview of Inter-PE Procedures for PIM-SM In native PIM-SM mode the same multicast traffic does not necessarily flow over a single tree but it can simultaneously flow on both shared and shortest path trees, without duplicates being sent to receivers. According to the inter-PE signaling procedures defined in this document a PIM-SM C-stream is never delivered to an egress PE from both the C-RP and directly from the C-source. In order to support this duplicate-free operation of PIM-SM in MVPN, the specified procedures assure that if a (C-S, C-G) stream is carried in an S- PMSI, and for the same C-G, the (C-*,C-G) stream is carried in an S- PMSI, then the (C-S, C-G) traffic must not be carried in the (C-*, C-G)'s S-PMSI. In order to assign (C-S, C-G) stream to an S-PMSI the C-S has to be discovered in MVPN. In order to discover PIM-SM C-sources based on PIM control messages, we decompose PIM-SM C-multicast into two types of topologies: (1) when the C-SPT from C-Source to C-RP is across provider's network and (2) when the C-SPT from C-source to C-RP is outside of provider's network. In topology (1) the C-sources can be discovered based on (C-S, C-G) Joins received by PE on the interface towards the C-RP's. The traffic from such C-sources is carried only on (C-S, C-G) S-PMSI i.e., it is carried on C-SPT's across provider's network. In topology (2) the C-sources cannot be discovered from control messages and the traffic from such sources is carried only on (C-*, C-G) S-PMSI, i.e., it stays on C-shared across provider's network. In other words, the (C-*, C-G) S-PMSI is only used for those data packets whose source has not been learned from PIM control messages. This decomposition of PIM-SM routing is explained in detail in Appendix A. Moreover, the PIM-SM C-multicast signaling defined in this document allow for multiple entry points for C-G or C-S traffic into provider's network without duplicate packets being sent to egress mVRF's. This is because according to the specified procedures, multicast traffic from a customer source or from a customer RP is never sent to a downstream multicast VRF over a tunnel that is not rooted at this mVRF's best next-hop PE towards the source or the RP. We observe further that it is not necessary to perform customer initiated RPT-to-SPT switchover across provider's network. The procedures defined in this document discover customer sources by Napierala Expires - August 2008 [Page 5] Segmented Multicast MPLS/BGP VPNs observing the (C-S, C-G) Join messages from the C-RP. Such C-source discovery mechanism does not depend on receiving SPT Joins from sites attached to receivers and thus avoids customer-initiated inter-PE RPT-to-SPT switchover. According to the procedure defined in this document, inter-PE C-multicast traffic is being sent either only on SPT's or on shared trees, regardless of whether it was or wasn't switched to SPT's in customer domain. This avoids significant shifts of traffic in provider's network and leads to simplification of PE- to-PE multicast routing. The following PIM messages are eliminated between PE's: (C-S, C-G, rpt) Prunes and customer initiated (C-S, C- G) Joins associated with C-RPT to C-SPT switchover. The latter elimination has only one exception associated with dually homed receiver sites where C-RPT and C-SPT diverge (defined in section 5.1.1). 4. PE-to-PE Signaling Information Elements The following information elements are required in support of the multicast signaling procedures defined in this document: - active C-source announcements - P-tunnel announcements and withdrawals for (C-*, C-G) traffic - P-tunnel announcements and withdrawals for (C-S, C-G) traffic. When BGP is used as an auto-discovery mechanism in MVPN, a new BGP NLRI (MCAST-VPN) is already defined in [viii] to handle different route types in MVPN. For active C-source announcements, Source Active auto-discovery route defined in [viii] can be used. The P-tunnel announcements and withdrawals for (C-S, C-G) traffic can use S-PMSI auto-discovery route also defined in [viii]. The S-PMSI auto- discovery route for P-tunnel announcements and withdrawals for (C-*, C-G) traffic is defined is section 5.3 of this document. Optionally, there can be an additional route type defined for active C-group announcements. This route type and its purpose are defined in section 9.2.2 of the document. 5. Inter-Site Signaling Procedures for PIM-SM An MVPN source C-S and its C-RP could communicate either across provider's network or outside of provider's network. In either topology, a PE attached to C-RP, upon receiving (C-*, C-G) PIM Join from another PE or from a locally attached site, will send (C-*, C-G) Join towards the C-RP. This PE will also announce a P-tunnel for the group C-G to all PE's in a given MVPN and it will add the P-tunnel interface to (C-*, C-G) outgoing interface list (olist). There could more than one PE to which the same C-RP is attached. This could be because the C-RP is multi-homed or because it is Anycast-RP. Each PE that is attached to the C-RP and receives (C-*, C-G) Join will announce a distinct P-tunnel for C-G. This allows for the same C-G traffic to enter provider's network at multiple ingress points. Napierala Expires - August 2008 [Page 6] Segmented Multicast MPLS/BGP VPNs Different PE's attached to receivers of C-G may receive C-G traffic on different P-tunnels without duplicate packets sent to receivers. An egress PE, or more precisely an mVRF of a given MVPN attached to receiver(s) of C-G will "join" or participate in only that C-G tunnel which was announced by mVRF's best next-hop PE to C-RP. If there is more than one best next-hop PE to C-RP in the mVRF, the egress PE will choose as the next-hop the PE with the highest IP address or it may utilize multicast multipath load splitting algorithm when there are multiple C-RP's behind the same PE's. All PE's in the given MVPN will store the C-G's P-tunnel information until they receive the P- tunnel withdrawal message for C-G. The conditions for C-G P-tunnel withdrawal are defined in section 5.1.3. In meantime, a VPN source C-S might have sent a PIM Register message to C-RP with encapsulated multicast data it in. The C-RP extracts the multicast data packet from the Register message and sends it to MVPN over the P-tunnel for group C-G. If the P-tunnel is not built yet, which is very unlikely because the P-tunnel creation was triggered upon receiving the first (C-*, C-G) Join, the initial data packet(s) to be sent across provider's network will be dropped. We describe the probability of dropping the initial C-multicast traffic in section 5.4. From this point on, depending on whether the SPT from C-S to C-RP is built across provider's network or outside of provider's network, the inter-PE procedures differ. They are defined in sections 5.1 and 5.2, respectively. 5.1 C-Sources Discovered by PIM Control Messages A PE with attached C-RP site, as PE2 in Figure 1 in Appendix A, upon receiving (C-S, C-G) PIM Join from CE attached to C-RP (CE2 in Figure 1), will create (C-S, C-G) state and will add the CE-PE interface to its olist. The olist of the (C-S, C-G) entry is also populated with a copy of the olist from the (C-*, C-G) entry except the P-tunnel used for C-G traffic. This is to avoid duplicate traffic, i.e. the same C-S traffic being sent on both shortest-path tree as well as shared-tree across provider's network. The PE attached to C-RP will propagate (C-S, C-G) Join toward C-S. (NOTE that if there is a receiver C-R of C-G at the C-RP site, it might happen that the 1st (C-S, C-G) Join that arrives at the PE attached to this site is from the C-R rather than from C-RP. This does not change the outcome and is transparent to the proposed procedure.) When a site with C-S and a site with C-RP are attached to the same PE (as C-S2 and C-RP in Figure 1), this PE, upon receiving the first C-S Napierala Expires - August 2008 [Page 7] Segmented Multicast MPLS/BGP VPNs packet on (C-S, C-G) state, will start sending (C-S, C-G, rpt) Prunes towards the C-RP. This is to stop receiving C-S traffic over the C- shared tree, i.e., to stop receiving packets de-capsulated from Register messages. The traffic arriving on C-RPT tree will eventually stop flowing when the Register Stop message from C-RP is received by the C-S. This will result in no more (C-S, C-G, rpt) Prunes being sent to the C-RP. To optimize further the traffic flow, the PE attached to C-RP should use so-called "turnaround rules" to prevent multicast traffic from unnecessarily reaching the C-RP if there are no interested receivers behind it. In case a site with C-S and a site with C-RP are attached to the same PE, this PE will not announce a new P-tunnel for (C-S, C-G) traffic and it will send the C-S traffic over already announced P-tunnel for C-G. In case the C-S is not attached to the same PE as C-RP (as C-S1 in Figure 1), the PE attached to C-RP will announce the active source C- S of C-G to all PE's in a given MVPN. Upon receiving active source C- S announcement message, a PE that is the next-hop to source C-S (as PE1 in Figure 1) will send a P-tunnel announcement for (C-S, C-G) traffic to all PE's in the MVPN. The PE's will store the C-S P-tunnel information until they receive the P-tunnel withdrawal message for (C-S, C-G). A PE that does not have any interested receivers for C-G when it receives (C-S, C-G) P-tunnel announcement message, it will store this information so it can join this P-tunnel for late receivers. The conditions for (C-S, C-G) P-tunnel withdrawal are defined in sections 5.1.3 and 5.1.4. If C-S is dually connected to two different PE's, both of those PE's will announce their distinct P-tunnels for C-S traffic. The PE's attached to receivers of C-G, upon receiving the P-tunnel announcement for (C-S, C-G) traffic, will initiate (C-S, C-G) Joins based on (C-*, C-G) PIM Joins received from locally attached CE's. Each such egress PE will send (C-S, C-G) Join to the best next-hop PE towards C-S in an mVRF of the specified MVPN. The egress PE will also connect to the P-tunnel announced by the best next-hop PE to C-S in the mVRF. Egress PE's will continue participating in the C-shared tree to receive traffic from all other C-sources sending to C-G. If there is more than one best next-hop to C-S in the mVRF (i.e., there are multiple equal cost paths), the egress PE will choose as the next-hop the PE with the highest IP address. PE might utilize multicast multipath load splitting algorithm if there are multiple C- sources behind the same PE's. All PE's have to use the same load splitting algorithm in order to choose the same upstream PE for the same C-S. The P-tunnel announced for (C-S, C-G) traffic is also joined by the Napierala Expires - August 2008 [Page 8] Segmented Multicast MPLS/BGP VPNs PE attached to C-RP that has (C-S, C-G) state with the interface towards C-RP in its olist (as PE2 in Figure 1). This is in order for C-RP to receive C-S traffic natively on the C-SPT. When the first C-S packet arrives over C-S P-tunnel at the PE attached to C-RP (PE2 in Figure 1), this PE will start sending (C-S, C-G, rpt) Prunes towards the C-RP. This is in order to stop receiving C-S traffic over the C- shared tree, i.e., to stop receiving packets de-capsulated from Register messages. The traffic arriving on C-RPT tree will eventually stop flowing when the Register Stop message, sent by C-RP, is received by the C-S and no more (C-S, C-G, rpt) Prunes will be sent to the C-RP. To optimize further the traffic flow, the PE attached to C-RP should use so-called "turnaround rules" to prevent multicast traffic from unnecessarily reaching the C-RP if there are no interested receivers behind it. Upon receiving packets directly from a source C-S, customer last-hop routers might switch to SPT and send (C-S, C-G) Joins towards the C- S. When the SPT between C-RP and C-S is built across provider's network, regardless whether C-RP and C-S are attached to the same PE or different PE's, egress PE's do not need to propagate the (C-S, C- G) Join towards C-S. More precisely, when C-RP and C-S are attached to different PE's, egress PE does not need to propagate (C-S, C-G) Join received from locally attached CE because in this scenario egress PE's have already switched to SPT when P-tunnel for C-S was announced. When C-RP and C-S are attached to the same ingress PE, egress PE does not need to propagate (C-S, C-G) Join received from locally attached CE because in this scenario the ingress PE has already joined the source C-S and pruned C-S traffic from the C- shared tree. 5.1.1 Dually Connected C-Receivers In this section we describe a scenario where a dually homed VPN site with receiver(s) chooses a different next-hop PE depending on whether a shared (C-*, C-G) tree or source (C-S, C-G) tree is joined. This means that shared and source trees diverge at this site. C-S C-RP | | CE1 CE2 / \ | / \ | PE1 PE2 PE3 | | | Provider's Network | | PE4 PE5 ^ \ / ^ (C-*,C-G) | \ / | (C-S,C-G) Napierala Expires - August 2008 [Page 9] Segmented Multicast MPLS/BGP VPNs Join | CE3 | Join | | C-R Figure 3: Dually connected C-Receiver Figure 3 depicts an example of such scenario. Customer receiver C-R is dually connected to provider's network via PE4 and PE5. Let's assume that C-RPT and C-SPT diverge at CE3 and that PE4 is on C-RPT and PE5 is on C-SPT for (C-S, C-G). Let's also assume that PE1 is the best next-hop PE to C-S on PE4 and that PE2 is the best next-hop PE to C-S on PE5. When a dually connected VPN receiver site switches from shared to shortest path tree, the egress PE on C-SPT (PE5 in Figure 3) will receive (C-S, C-G) Join from this site, while it never received (C-*, C-G) Join from it before. The egress PE will create (C-S, C-G) state, if it does not exist yet, and will add the interface on which it received (C-S, C-G) Join to its olist. If there is already (C-*, C-G) state in the same multicast VRF, the olist of (C-*, C-G) entry is copied into the olist of new (C-S, C-G) entry. This is a standard PIM procedure to allow C-S traffic to flow to (C-*, C-G) receivers. If C- S and C-RP are not attached to the same PE and if the egress PE received a P-tunnel announcement for (C-S, C-G) traffic from the best next-hop PE to C-S in the specified mVRF (the latter condition guarantees that C-RP and C-S communicate across provider's network), the egress PE will propagate (C-S, C-G) Join towards C-S. This is to cover the case when C-S is dually connected and the egress PE on C- RPT (as PE4 in Figure 3) chooses a different upstream PE to C-S than the egress PE on C-SPT (as PE5 in Figure 3). The egress PE on C-SPT will join the P-tunnel for either C-G or C-S of C-G if it was not joined yet. A PE always joins the most specific P-tunnel that was announced for (C-S, C-G) traffic, i.e., it will only join a P-tunnel that was announced for the C-G if there was no P-tunnel announcement for the C-S of the C-G. Once a multicast packet is received on the C-SPT at a dually connected site, the PE which is on the C-RPT will receive (C-S, C-G, rtp) Prune message from that site to prune off C-S traffic off C- shared tree. The PE on the C-RPT (as PE4 in Figure 3) does not need to propagate (C-S, C-G, rtp) Prune message to C-RP, regardless whether C-RP and C-S are attached to the same or different PE's. This is because C-S has been already pruned off the C-shared tree. The PE on the C-RPT might also stop joining the P-tunnel for (C-S, C-G) if there are no other receivers for (C-S, C-G) attached to it (i.e., if C-S traffic was pruned off on all (C-*, C-G) outgoing interfaces). Napierala Expires - August 2008 [Page 10] Segmented Multicast MPLS/BGP VPNs 5.1.2 C-Shared Tree Switchback If a site attached to egress PE switches back from C-SPT to C-RPT because C-S traffic rate fell below the SPT-threshold, the PE on C- RPT will receive (C-*, C-G) Join to rejoin the shared tree. Since this (C-*, C-G) Join is sent without a (C-S, C-G, rpt) Prune it will cause the (C-S, C-G) Prune state along C-RPT to be deleted, which in turn will permit (C-S, C-G) traffic to begin flowing down the C-RPT again. If the egress PE stopped participating in the P-tunnel for C- S it has to rejoin this tunnel to receive the C-S traffic. When a customer site switches back from C-SPT to C-RPT, the PE on the C-SPT attached to this site will receive (C-S, C-G) Prune message. In general, the egress PE does not need to propagates the (C-S, C-G) Prune message to a PE attached to C-S, even if C-S and C-RP are not attached to the same PE. This is because in this scenario, inter-PE C-trees are always SPT's. However, there is one exception, namely when SPT and RPT diverge at a dually connected site, as described in section 5.1.1. In this scenario, given that C-S and C-RP are attached to different PE's, when the egress PE receives (C-S, C-G) Prune message it will remove the interface on which it received the Prune from the olist for (C-S, C-G). If the olist for (C-S, C-G) is empty, the egress PE on C-SPT will send (C-S, C-G) Prune message up the C-SPT. It will also stop joining the P-tunnel for (C-S, C-G) traffic. This is to cover the case when C-S is dually connected and the egress PE on C-SPT (as PE5 in Figure 3) chooses a different upstream PE to C-S than the egress PE on C-RPT (as PE4 in Figure 3). 5.1.3 C-Receiver Pruning and P-tunnel Withdrawal An egress PE will send (C-*, C-G) Prune message towards C-RP when the olist for (C-*, C-G) in an mVRF of a given MVPN becomes empty. The C- RP could be locally attached to this PE or it can be attached to a different PE. In the latter case, the mVRF with empty olist for (C-*, C-G) will stop joining C-G P-tunnel announced by its best next-hop to C-RP. The egress PE will keep the C-G P-tunnel information in case it receives a new (C-*, C-G) Join from a locally attached site. This PE will also send (C-S, C-G) Prunes for all C-sources for which it triggered SPT's in the specified mVRF. The mVRF will also stop participating in P-tunnels announced for those C-sources but the P- tunnel information will be kept on the egress PE until it receives C- S tunnel withdrawals. The state (C-*, C-G) is removed from a PE, or more specifically from an mVRF, attached to C-RP when its olist for (C-*, C-G) becomes empty. This means that the P-tunnel for C-tree rooted at this PE is not longer needed. Upon (C-*, C-G) state removal the PE attached to C-RP will send the P-tunnel withdrawal message for C-G. It will also Napierala Expires - August 2008 [Page 11] Segmented Multicast MPLS/BGP VPNs stop joining the P-tunnels for (C-S, C-G) that it previously joined and it will remove their P-tunnel information. Upon receiving C-G tunnel withdrawal message, all PE's in given MVPN will remove the C-G tunnel information. Every egress PE that previously joined this C-G tunnel in any of its mVRF's will also remove information about any P-tunnel for C-S of C-G associated with those mVRF's. 5.1.4 C-Source Becomes Inactive The state (C-S, C-G) expires or is removed on a PE attached to C-S when C-S stops sending traffic or/and the state (C-S, C-G) was pruned by the PE because there were no receivers for this traffic (the latter condition was described in section 5.1.3). When (C-S, C-G) state expires on PE attached to C-S because C-S becomes inactive, this PE will send P-tunnel withdrawal message for (C-S, C-G) to all PE's in a given MVPN. Upon receiving C-S P-tunnel withdrawal message, PE’s attached to receivers of C-G (including the PE attached to C-RP), will stop joining this P-tunnel and will remove this P-tunnel information. After C-S stops sending traffic, the (C-S, C-G) state will also expire on PE's attached to receivers of (C-S, C- G). Upon receiving C-S P-tunnel withdrawal message, PE attached to C-RP of C-G will, if applicable, stop sending periodic (C-S, C-G, rtp) Prune messages towards the C-RP's. 5.2 C-Sources Not Discovered by PIM Control Messages Even if from provider's network perspective C-S and C-RP are reachable via different PE's or via different interfaces on the same PE, the SPT between the C-S and the C-RP could be engineered by a customer to be outside of provider's network. See Figure 1a in Appendix A. When the SPT from C-S to C-RP is built outside of provider's network, the C-S cannot be discovered via control messages. In this scenario, the C-S traffic will be carried over C- shared tree between PE's. Moreover, the inter-PE signaling is simplified by not switching to C-SPT's at the egress PE's at all. Hence, C-trees will be the shared trees from egress PE's to C-RP's, regardless whether customer last-hop routers switched to SPT's. (NOTE that if there is a receiver C-R of C-G at the C-RP site and the source-tree from C-R to C-S is across provider's network while the source-tree from C-RP to C-S is engineered to be outside provider's network, then the PE attached to this site will receive (C-S, C-G) Join. In this scenario the C-S will be discovered and announced in MVPN, following the procedure defined in section 5.1.) Napierala Expires - August 2008 [Page 12] Segmented Multicast MPLS/BGP VPNs Traffic from all C-sources that can't be discovered in MVPN is kept on the same P-tunnel, regardless whether it is flowing on shared tree or source tree in the customer network. This is the P-tunnel that was announced for the group C-G by the PE attached to C-RP. In fact, this procedure allows for further aggregation of traffic without generating duplicates. Namely, the traffic for all C-G's for which the C-RP is the active RP could be aggregated onto the same P-tunnel. Such aggregation may cause loss of bandwidth optimality by delivering traffic to PE's that don't need but it will not generate duplicate traffic to C-receivers. Upon receiving packets directly from source C-S, customer last-hop routers might switch to SPT's and sent (C-S, C-G) Joins. However, the egress PE that received (C-S, C-G) Join from a locally attached CE will not propagate it to C-S and the egress PE will not switch to C- SPT's. This includes the topologies where PE attached to C-S is either the same or different from the PE attached to C-RP. In addition, when C-RP and C-S are attached to the same PE, there is no switching to C-SPT's regardless whether C-RP and C-S are behind the same or different CE's. 5.2.1 Dually Connected C-Receivers There is one scenario that needs to be separately addressed, namely a dually homed VPN receiver site with shared and source trees diverging. C-S C-RP \ / \ / R-1 | CE1 / \ / \ PE1 PE2 | | | | Provider's Network | | | | PE3 PE4 ^ \ / ^ (C-*,C-G) | \ / | (C-S,C-G) Join | CE2 | Join | C-R Napierala Expires - August 2008 [Page 13] Segmented Multicast MPLS/BGP VPNs Figure 4: Dually connected C-Receiver Figure 4 depicts an example of such scenario. Customer receiver C-R is dually connected to provider's network via PE3 and PE4. Let's assume that C-RPT and C-SPT diverge at CE2 and that PE3 is on C-RPT and PE4 is on C-SPT for (C-S, C-G). Let's also assume that PE1 is the best next-hop PE to C-RP on PE3 and that PE2 is the best next-hop PE to C-RP on PE4. When such dually connected site switches from shared to shortest path tree, the egress PE on C-SPT (PE4 in Figure 4) will receive from this site (C-S, C-G) Join message. The egress PE on C-SPT will create (C- S, C-G) state in the relevant mVRF, if it does not exist yet, and it will add the site's interface to the (C-S, C-G) olist. If there is already (C-*, C-G) state in the same multicast VRF, the olist of (C- *, C-G) entry is copied into the olist of new (C-S, C-G) entry. This is a standard PIM procedure to allow C-S traffic to flow to (C-*, C- G) receivers. The egress PE on C-SPT will not propagate (C-S, C-G) Join towards C-S because there is no C-RPT to C-SPT switching across provider's network. The egress PE on C-SPT will convert (C-S, C-G) Joins to (C-*, C-G) Joins and will sent them to its upstream PE towards the C-RP. This is necessary because the best next-hop to C-RP on the egress PE on C-SPT (as PE4 in Figure 4) might be different than the best next-hop to C-RP on the egress PE on C-RPT (as PE3 in Figure 3). The egress PE will join the P-tunnel announced for C-G by the best next-hop PE to C-RP in the relevant mVRF, if it did not join it yet. Once multicast traffic is received on the C-SPT at dually connected site, the PE which is on the C-RPT tree will start receiving (C-S, C- G, rtp) Prune messages to prune C-S traffic off C-shared tree. The egress PE will not propagate the (C-S, C-G, rtp) Prune towards C-RP because the C-RPT will not be switched to C-SPT across provider's network. 5.2.2 C-Shared Tree Switchback If a site attached to an egress PE switches back from C-SPT to C-RPT because C-S traffic rate fell below the SPT-threshold, the PE on C- RPT will receive (C-*, C-G) Join from a customer site to rejoin the shared tree. Since (C-*, C-G) Join will be sent without a (C-S, C-G, rpt) Prune this will cause the (C-S, C-G) Prune state along C-RPT to be deleted, which will permit (C-S, C-G) traffic to begin flowing down the C-RPT again. In case a receiver site is dually connected and it receives the C-S traffic on C-RPT, it will send (C-S, C-G) Prune message to the PE on C-SPT. The PE on C-SPT will prune the interface on which it received (C-S, C-G) Prune message off the C-SPT. If its olist for (C-S, C-G) is empty and there is no (C-*, C-G) state or olist for (C-*, C-G) Napierala Expires - August 2008 [Page 14] Segmented Multicast MPLS/BGP VPNs becomes empty, the egress PE on C-SPT will stop sending (C-*, C-G) Joins towards C-RP and it will also stop joining the P-tunnel for C-G traffic. This is to stop unneeded traffic to be sent to the egress PE. 5.2.3 C-Receiver Pruning and P-tunnel Withdrawal An egress PE will send (C-*, C-G) Prune message towards C-RP when the olist for (C-*, C-G) becomes empty in an mVRF. The C-RP could be locally attached to this PE or it can be attached to a different PE. The mVRF on egress PE with empty (C-*, C-G) olist will stop participating in the P-tunnel for C-G that it previously joined. The state (C-*, C-G) is removed on PE attached to C-RP when its olist for (C-*, C-G) becomes empty. This means that C-G tunnel rooted at this PE is not longer needed. Upon (C-*, C-G) state removal the PE attached to C-RP will send the P-tunnel withdrawal message for C-G to all PE's in a given MVPN. Upon receiving C-G tunnel withdrawal message, all PE's in the MVPN will remove the C-G tunnel information. 5.3 Group-Only S-PMSI Auto-discovery Route When BGP is used for an auto-discovery mechanism in MVPN, a new BGP NLRI (MCAST-VPN) is already defined in [viii] to handle different route types in MVPN. According to procedures defined in sections 5.1 and 5.2, MCAST-VPN NLRI definition has to be extended to include a new Route Type called Group-Only S-PMSI auto-discovery route. The Group-Only S-PMSI auto-discovery route is an announcement of an active VPN C-group and the P-tunnel to be used for its traffic. The P-tunnel information is carried in a BGP attribute called PMSI P- tunnel attribute already defined in [viii]. Group-Only S-PMSI auto-discovery route type will be assigned Route Type value of 6 of the MCAST-VPN NLRI and will consist of the following: +-----------------------------------+ | RD (8 octets) | +-----------------------------------+ | Multicast Group Length (1 octet) | +-----------------------------------+ | Multicast Group (Variable) | +-----------------------------------+ | Originating Router's IP Addr | +-----------------------------------+ The RD is encoded as described in [iii]. Napierala Expires - August 2008 [Page 15] Segmented Multicast MPLS/BGP VPNs The Multicast Group field contains the C-G address or C-Generic LSP Identifier Value. If the Multicast Group field contains an IPv4 address or a C-Generic LSP Identifier Value, then the value of the Multicast Group Length field is 32. If the Multicast Group field contains an IPv6 address, then the value of the Multicast Group Length field is 128. The Originating Router's IP Address field MUST be set to the IP address that the PE places in the Global Administrator field of the VRF Route Import extended community of the VPN-IP routes advertised by the PE. 5.4 Handling Initial Packets Sent on C-Shared Tree According to the procedures described in sections 5.1 and 5.2, the initial C-G multicast packets send over C-shared tree could be dropped by PE attached to C-RP until a P-tunnel for C-G traffic is build. Since the C-G tunnel is announced when the first (C-*, C-G) PIM Join is received by the PE attached to C-RP of C-G, this P-tunnel should be built in time to carry the initial C-S packets. In PIM-SM there are two scenarios to consider: (A) source registers first before there are any interested receivers, or (B) receivers join the group first, waiting for traffic on the shared tree. We will analyze these two scenarios based on inter-PE PIM-SM procedures defined in this document. In scenario (A), whether in plain PIM or in MVPN context, the initial source packets are discarded because there are no receivers on shared tree. According to PIM-SM procedure when there are no receivers on the shared tree, the C-RP sends (C-S, C-G) "Register-Stop" message to the 1st-hop router to stop sending Register messages. The Register process will restart in 3 minutes (at the earliest, depending whether C-S is still active). If in meantime the C- receivers join the group C-G there is plenty of time for C-G P- tunnel to be announced and created. In scenario (B), there exists a short window of time during which the initial C-source packets could be dropped, namely when the first active C-S registers with C-RP immediately after the first C-receiver joined the C-G, not giving enough time for C-G P-tunnel to be built. This is the only scenario under which there could be packet discards in MVPN while there are not similar drops in plain PIM-SM multicast. However, even in plain PIM-SM there could be packet drops especially with bursty sources since only a bounded amount of traffic can be encapsulated in PIM Register or MSDP SA messages. Napierala Expires - August 2008 [Page 16] Segmented Multicast MPLS/BGP VPNs 5.5 Using P2MP LSP's as P-Tunnels for C-Shared Trees If P-tunnels are built with receiver-driven P2MP MPLS LSP's [ix], the P-tunnel for C-G can be algorithmically and uniquely chosen by the egress PE's. An egress PE selects the "root" PE of the P-tunnel, which is its best next-hop PE towards C-RP, and builds the P-tunnel towards this root PE. Different PE's may choose different upstream (i.e., root) PE's to reach C-RP in the same MVPN. This might happen if C-RP is dually connected or if Anycast C-RP is used. When the address of the root PE is used in the tunnel identification algorithm, a distinct P2MP LSP per root can be built. Hence, multiple P-tunnels can be simultaneously used to carry the same C-G traffic without creating duplicates at the C-receivers. The P2MP LSP is triggered by the egress PE when (C-*, C-G) Join is received from a locally attached receiver. This technique allows for further aggregation of traffic without generating duplicates. Instead of one P2MP LSP per root PE per C-G, one P2MP LSP per root could be used for all C-groups for which the C- RP is the active RP. In this case, C-group address has to be ignored in the P2MP LSP identifier; instead the C-RP address should be used. Such aggregation may cause loss of bandwidth optimality but it will not generate duplicate traffic to C-receivers. In most typical MVPN network topology, a data center or a hub location is where one-to-many multicast applications are being sourced. Typically, customer's Rendezvous Points are also located at the data centers/hubs. In this topology there is no advantage to switch from shared to source trees since multicast VPN traffic is already on the shortest path in provider's network. Moreover, it is beneficial to MVPN customer to stay on shared trees because no unnecessary multicast states are created. If is known that a C-tree never switches to SPT then P2MP LSP with inbound signaling is sufficient in supporting such C-trees. 6. Supporting C-Shared Trees The last hop customer routers might never switch traffic to SPT's for certain multicast C-groups if SPT-threshold of "infinity" is specified for those groups. The procedures defined in section 5 of this document preserve C-shared trees, regardless of whether a path between C-RP and C-S is outside or across provider's network. The procedures defined in section 5.2 of this document preserve C- shared trees in case a path between C-RP and C-S is outside of provider's network. This is in order to preserve the multicast states and traffic patterns in MVPN customer network. According to procedures in section 5.1, inter-PE traffic is automatically switched to source trees for those C-sources whose path to C-RP is across Napierala Expires - August 2008 [Page 17] Segmented Multicast MPLS/BGP VPNs provider's network. However, in this scenario it is transparent to the VPN customer whether multicast traffic is sent on shared or source trees across provider's network. In other words, from customer network perspective multicast traffic is still on shared trees. 7. Support of Anycast C-RP The expected Anycast C-RP behavior is that different egress PE's could choose different upstream PE's as the next-hops to the C-RP. Support of multiple upstream PE's for Anycast C-RP is required. There are two ways to support Anycast C-RP: based on provider's network IGP cost or based on VPN customer routing. If there are multiple next-hops to static C-RP installed in mVRF, the closest PE, based on provider's network IGP cost, should be chosen as best next- hop to C-RP and only as a tie breaker the PE with the highest IP address. IGP cost-based next-hop selection provides PIM-like support of Anycast C-RP's, i.e., C-receivers join the closest Anycast C-RP across provider's network. Another option is to always use the highest IP address as a tie breaker for RPF neighbor selection and leave it to MVPN routing policy to reach different Anycast-RP's. This allows MVPN customer to define its own Anycast C-RP selection, based on other criterion than the closest distance. Both Anycast C-RP options described above should be supported by the MVPN implementation. 8. Inter-Site Signaling Procedures for PIM-SSM With PIM-SSM an active C-source is discovered when a PE attached to C-source receives the first (C-S, C-G) Join, either from directly connected CE or from another PE in MVPN. When a PE attached to C-S receives the first (C-S, C-G) Join from another PE, this PE will announce the P-tunnel to be used for (C-S, C-G) traffic to all other PE's in the MVPN. In PIM-SSM the source discovery and P-tunnel announcement is one and the same message. The PE's will store the C-S P-tunnel information until they receive the P-tunnel withdrawal message for (C-S, C-G). A PE that does not have any interested receivers for (C-S, C-G) when it receives the P-tunnel announcement message, it will store this information so it can join this P-tunnel for late (C-S, C-G) receivers. The conditions for (C-S, C-G) P-tunnel withdrawal are defined in section 8.2. Each PE attached to C-S, when it receives (C-S, C-G) Join, will announce its distinct P-tunnel for (C-S, C-G) traffic. Napierala Expires - August 2008 [Page 18] Segmented Multicast MPLS/BGP VPNs An egress PE, or more precisely an egress mVRF with receiver(s) of (C-S, C-G) will "join" the P-tunnel announced for (C-S, C-G) only if the PE that sent this announcement is the best next-hop to C-S in this mVRF. If there is more than one best next-hop to C-S in the mVRF, the PE will choose as the next hop the PE with the highest IP address or PE may utilize multicast multipath load splitting algorithm. PIM-SSM allows the source to continuously send traffic even if there are no receivers for this traffic. (The drawback of this behavior is waste of sender resources and the first-hop router/link bandwidth). If the C-S is already active when the (C-S, C-G) Join reaches the C- router attached to C-S, the C-S traffic starts immediately flowing on the C-source tree towards the PE. If the P-tunnel for (C-S, C-G) has not yet been built up to the PE attached to C-S, few initial packets arriving from C-S will be dropped. It is rather unlikely that there are PIM-SSM applications where sender can be active without receivers and yet any initial packet drop cannot be tolerated. 8.1 C-Receiver Pruning An egress PE will send (C-S, C-G) Prune message towards C-S when its olist for (C-S, C-G) in an mVRF becomes empty. The egress PE will also remove the (C-S, C-G) state from the mVRF. Upon (C-S, C-G) state removal the mVRF will stop joining the P-tunnel announced for (C-S, C-G) traffic. 8.2 P-tunnel Withdrawal for C-S The state (C-S, C-G) will be removed by PE attached to C-S after the olist for (C-S, C-G) becomes empty. Upon (C-S, C-G) state removal, PE attached to C-S will send P-tunnel withdrawal message for (C-S, C-G). The egress PE's in a given MVPN, upon receiving (C-S, C-G) P-tunnel withdrawal message, will remove the P-tunnel information. 8.3 Using P2MP LSP's as P-Tunnels for PIM-SSM C-Trees If P-tunnels are built with receiver-driven P2MP MPLS LSP's [ix], the P-tunnel for (C-S, C-G) can be algorithmically and uniquely chosen by the egress PE's. Egress PE selects the "root" PE of the P-tunnel, which is the best next-hop PE towards C-S in mVRF, and builds the P2MP LSP towards this root PE. Different PE's and different mVRF's may choose different upstream PE's to reach C-S in the same MVPN. If the address of the root PE is used in the LSP identification algorithm, a distinct P2MP LSP per root is built. Hence, there could be multiple entry points for C-S traffic into provider's network without duplicates at the C-receivers. The LSP is triggered by the Napierala Expires - August 2008 [Page 19] Segmented Multicast MPLS/BGP VPNs egress PE when (C-S, C-G) Join is received from a locally attached receiver. The advantage of using P2MP LSP's for PIM-SSM C-trees is that no out-of-band signaling is required. However, without out-of- band signaling the aggregation of P2MP LSP's is not possible because it could result in duplicate traffic being sent to customer. 9. Inter-Site Signaling Procedures for PIM-Bidir Some multicast applications use many-to-many model where each participant is the receiver as well as the sender. Using PIM-SM for such applications results in increased memory and protocol overhead. Bi-directional PIM [x] eliminates both Register message encapsulation and source-specific states by allowing packets to be natively forwarded from a source to the Rendezvous Point using shared tree state only. This ensures that only (*,G) entries will appear in multicast forwarding tables and that the path taken by packets flowing from the source and/or receiver to the Rendezvous Point Address (RPA) and vice versa will be the same. Membership to a Bidir group is signaled via explicit (*, G) join messages. Traffic from sources is unconditionally sent up the shared tree toward the RPA and passed down the tree toward the receivers on each branch of the tree. This is in contrast with PIM-SM where traffic flows are unidirectional. The olist of a (*, G) entry for Bidir group G includes all the interfaces on which (*, G) Joins were received. If a router is located on a sender-only branch, a Bidir implementation might also create (*, G) state but the olist will not include any interfaces. Traffic in a Bidir group is always forwarded to the RPA of that group. If no receivers are along the way to the RPA, the traffic will be dropped off only at the RPA. Traffic will be forwarded to the RPA even if there are no receivers at all. 9.1 Preventing C-Bidir Packet Loops in MVPN IP Bi-directional PIM chooses a single Designated Forwarder (DF) for upstream packets (away from the source) on every network segment and point-to-point link. The DF procedure selects one router as the DF for every RPA of bidirectional groups. DF is responsible for forwarding multicast packets upstream to RPA as well as sending (*,G) Join/Prune messages towards RPA. To avoid packet loops DF election procedure eliminates parallel downstream paths from any RPA. It enforces consistent view of the DF on all routers on network segment, and during periods of ambiguity or routing convergence the traffic forwarding is suspended. To avoid loops, customized routing in downstream routers does not affect the choice of DF. In Bidir the path from a source/receiver to DF is always the best metric unicast path. Napierala Expires - August 2008 [Page 20] Segmented Multicast MPLS/BGP VPNs In MVPN context a Designated Forwarder for Bidir C-RPA is a PE attached to C-RPA. Different mVRF's in a given MVPN might have different next-hop PE's to C-RPA due to different routing policies or they might have temporarily different next-hop PE's to C-RPA due to routing transients. The MVPN solution for C-Bidir cannot rely on all mVRF's in a given MVPN to either have common routing view to C-RPA or to reach a common routing view to C-RPA in time to prevent packet looping. Rather, a VPN has to be treated as a collection of sets of multicast VRF's, each having the same but distinct from other sets reachability towards C-RPA. Resolving C-Bidir packet loops in MVPN inevitably results in the ability to partition an MVPN into disjoined sets of mVRF's, served by disjoined P-tunnels. Each such set would have a distinct view of converged network, i.e., it would have the same upstream PE as the best next-hop towards the C-RPA. If there is more than one best next-hop PE to C-RPA in a set, the tie breaker will be the upstream PE with the highest IP address. As an option, the MVPN implementation of C-Bidir should allow to ignore specific multicast routing policy in mVRF, and instead make all PE's in a given MVPN choose the same next-hop PE to C-RPA. Among all candidate next-hop PE's, the single chosen upstream PE to C-RPA could be the PE with the highest IP address. This approach to C-Bidir might be desirable to customers that do not want a permanent splitting of their MVPN's into disjoined C-Bidir trees. Note that the unicast routing policy in a VPN cannot influence VPN multicast routing from a multi-homed site. This is the nature of Bidir that the path from a source/receiver site towards the C-RPA is always the best metric unicast path and that choice is made locally at the VPN site. 9.2 Active Group P-tunnel Announcement in C-Bidir The (C-*, C-G) state is first created on a PE attached to C-RPA (i.e., on a DF-PE) by a (C-*, C-G) Join from a locally connected or remote C-receiver. Once (C-*, C-G) state is created a DF-PE announces a P-tunnel for active group C-G to all PE's in a given MVPN. If BGP is used as P-tunnel announcement delivery mechanism, the P-tunnel for the active C-Bidir group is announced via the Group-Only S-PMSI auto- discovery route, defined in section 5.3. A PE that does not have (C- *, C-G) state when it receives a C-G P-tunnel announcement message will store this information so it can join the P-tunnel for late group members. This procedure allows for further aggregation of C-Bidir traffic without causing traffic loops. Instead of generating one P-tunnel per C-G, one P-tunnel per DF-PE could be used for all C-groups for which the C-RPA is the active RP. Such aggregation may cause loss of Napierala Expires - August 2008 [Page 21] Segmented Multicast MPLS/BGP VPNs bandwidth optimality by delivering the traffic to PE's that don't need it but it will not generate loops in MVPN. If C-S traffic starts unconditionally flowing from a VPN site towards a PE before a single (C-*, C-G) Join was received from any VPN site, this traffic will be dropped at the PE. This is because no inter-PE P-tunnel has been built yet for C-G traffic. Since there are no receivers yet for this traffic dropping it optimizes the inter-PE behavior of C-Bidir. No C-G traffic is unnecessarily sent across MVPN until there is a least a single receiver for C-G. This approach has also positive security implications to service providers because it prevents a coordinated attack of unconditional traffic from C-Bidir sources with no receivers for this traffic. 9.2.1 Supporting Source-Only C-Branches PIM-Bidir supports source-only branches i.e., branches that do not lead to any receivers, but that are used to forward packets traveling upstream from sources towards the RPA. In plain IP PIM-Bidir it is up to the implementation whether to maintain group state for source-only branches [x]. However, the procedures defined in this document require that in MVPN context PE's on C-source-only branches maintain (C-*, C-G) state. The existence of this state indicates that a PE is on C-Bidir tree and has to join a P-tunnel used for its traffic. If (C-*, C-G) state was not maintained for source-only sites, a PE would not know whether or not it is on C-G's Bidir tree. The consequence of this would be that in order to deliver source-only site traffic across provider's network, all PE's in a given MVPN would have to join the P-tunnel announced for C-G. 9.2.2 Active C-Group Announcement in C-Bidir Announcing a P-tunnel for C-Bidir traffic only when at least one receiver already exists for this traffic might introduce a potential delay in receiving traffic from C-Bidir sources by the upcoming receivers. Namely, when one or more C-Bidir sources start unconditionally sending traffic to a C-G group with no active membership and the receivers subsequently join the C-G, the inter-PE P-tunnel has first to be announced and built before the source traffic can be delivered to the receivers. This can be easily remedied by announcing an active C-Bidir group upon receiving unconditional source traffic with no active membership. A PE upon receiving unconditional source traffic for C-G with empty membership (i.e., the PE's olist list for (C-*, C-G) is empty), will announce the active group C-G to its DF-PE. If olist for (C-*, C-G) is non-empty on this PE or this PE has already received a P-tunnel Napierala Expires - August 2008 [Page 22] Segmented Multicast MPLS/BGP VPNs announcement for C-G, the PE will not announce that C-G is active because this fact is already known in the MVPN. When BGP is used as the delivery mechanism, a new route type has to be defined for active C-group announcements. A new route type, a Group Active auto-discovery route, is defined as follows: +-----------------------------------+ | RD (8 octets) | +-----------------------------------+ | Multicast Group Length (1 octet) | +-----------------------------------+ | Multicast Group (Variable) | +-----------------------------------+ The RD is encoded as described in [iii]. The Multicast Group field contains the C-G address or C-Generic LSP Identifier Value. If the Multicast Group field contains an IPv4 address or a C-Generic LSP Identifier Value, then the value of the Multicast Group Length field is 32. If the Multicast Group field contains an IPv6 address, then the value of the Multicast Group Length field is 128. New Group Active auto-discovery route type will be assigned Route Type value of 7 of the MCAST-VPN NLRI defined in [viii]. If BGP is used as P-tunnel announcement delivery mechanism, once DF- PE receives Group Active auto-discovery route for C-G, it will announce the P-tunnel to be used for C-G via Group-Only S-PMSI auto- discovery route, defined in section 5.3. The procedure defined in this section should not be a default behavior for handing C-Bidir traffic but it should be implemented as an option to be turned on or off per C-G in provider's network. 9.3 Bidir C-Group Becomes Inactive The state (C-*, C-G) is removed on a DF-PE after its olist becomes empty. Upon (C-*, C-G) state removal DF-PE will send the P-tunnel withdrawal message for C-G. This is the P-tunnel the DF-PE announced on active C-G discovery. PE's attached to the participants of C-G, upon receiving C-G P-tunnel withdrawal message, will remove the P- tunnel information. 9.4 P-tunnels for C-Bidir Traffic The procedure defined in this document requires that C-Bidir traffic is carried over MP2MP P-tunnels across provider's network, which can be built with PIM-Bidir or with MP2MP LSP's. This is because in this procedure only one P-tunnel is announced by and rooted at a DF-PE for Napierala Expires - August 2008 [Page 23] Segmented Multicast MPLS/BGP VPNs a C-Bidir group. In fact, using MP2MP P-tunnels in provider's network is the only scalable approach to C-Bidir. During routing convergence or when different routing policies for C- Bidir are supported, PE's in a given MVPN might choose different upstream PE's as the best next-hops to C-RPA. Each PE attached to C- RPA announces a distinct MP2MP P-tunnel. At any given time, a PE in the MVPN joins only one P-tunnel that was announced by its chosen DF- PE. Once the MVPN converges, each set of mVRF's with the same multicast routing policy will have a single DF-PE for a C-RPA. When an option for ignoring specific multicast VRF routing policies is turned on, all PE's in the MVPN will choose the same next-hop PE to C-RPA. A PE that joined a P-tunnel announced by a "transient" DF-PE has to join the P-tunnel announced by the converged DF-PE, and stop sending and accepting traffic on the tunnel announced by the transient DF-PE. 9.5 DF-PE Redundancy with Fast Convergence To speed up C-Bidir convergence certain optimizations could be added to C-Bidir support. In case when Bidir C-RPA is redundantly connected, a PIM join could be sent to all PE's connected to C-RPA site, not only to the DF-PE with the best route or with the highest IP address. Each such candidate DF-PE would announce its own P-tunnel for C-G traffic. All those P-tunnels could be joined by the PE's on C-Bidir tree, but each such PE will send and/or receive C-G traffic only over the P-tunnel announced by its current best DF-PE (or one with highest IP address if they are equal cost) for C-G. This procedure introduces a notion of primary and backup P-tunnels. A P- tunnel announced by currently active DF-PE is a primary P-tunnel. P- tunnels announced by non-active candidate DF-PE's are backup P- tunnels. In case of the current DF-PE failure, upon the failure detection, all mVRF's with participants in C-G and whose primary tunnel was the one announced by failed DF-PE will stop sending/receiving C-G traffic over the primary P-tunnel and will start sending/receiving traffic over the backup P-tunnel. Since this alternate P-tunnel already exists, the data loss is minimized. This is a trade-off between fast-convergence and increased backbone bandwidth usage. The procedure defined in this section should be implemented as an option to service provider. 9.6 Using MP2MP LSP's as P-Tunnels for C-Bidir C-Bidir signaling procedure defined so far is based on P-tunnel announcements by DF-PE's. Announcing the MP2MP tunnel by a DF-PE allows for P-tunnel aggregation based on congruency of multicast Napierala Expires - August 2008 [Page 24] Segmented Multicast MPLS/BGP VPNs flows. If C-Bidir were to be supported without aggregation or with an aggregation not based on on congruency of flows then a different solution for C-Bidir is possible. Instead of announcing MP2MP tunnels by the DF-PE's, such tunnels could be algorithmically derived based on C-group and DF-PE addresses. This is possible when P-tunnels are MP2MP LSP's [ix]. This is the same technique as described in sections 5.5 and 8.3 except that MP2MP rather than P2MP LSP's are being used. Egress mVRF selects the "root" PE of the P-tunnel, which is its best next-hop PE towards C-RPA, and builds the MP2MP LSP towards this root/DF-PE. Different PE's or mVRF's may choose different upstream PE's to reach C-RPA in the same MVPN. Since the address of the root PE is also used in the MP2MP LSP identification algorithm, a distinct MP2MP LSP per root is built. At any given time, a PE sends and receives C-G traffic only on one MP2MP LSP that is rooted at the DF- PE chosen by this PE/mVRF. Hence, multiple MP2MP LSP's can simultaneously carry the same C-RPA traffic without duplication and looping of packets. This technique allows for further aggregation of traffic without causing traffic loops. Instead of generating one MP2MP LSP per C-G, one MP2MP LSP per DF-PE could be used for all C-groups for which the C-RPA is an active RP. In this case, C-group address should not be used when generating the MP2MP LSP identifier, C-RPA address should be used instead. Such aggregation may cause loss of bandwidth optimality but it will not generate loops in MVPN. 10. C-Multicast Traffic Aggregation The basic technique for providing scalability is to aggregate a number of customer multicast flows onto a single multicast distribution tree (P-tunnel) through the P routers. The inter-PE multicast procedures defined in this document support, by definition, the following aggregation of C-multicast flows into a single P-tunnel per root PE: - traffic from all PIM-SM C-sources discovered in an MVPN that attach to the same PE as their C-RP (root PE is each PE attached to the C- RP) - traffic from all undiscovered PIM-SM C-sources in an MVPN (root PE is each PE attached to one or more C-RP's) - all PIM-Bidir traffic in an MVPN (root PE is each PE attached to the C-RPA). Such aggregation may cause loss of bandwidth optimality by delivering the traffic to PE's that don't need it but it will not deliver duplicates to egress PE's. The aggregation of PIM-SM traffic from C-sources that are discovered by the procedures defined in this document and such that they are attached to a different PE then their C-RP requires "explicit tracking" of receiver mVRF's. Explicit tracking means that the Napierala Expires - August 2008 [Page 25] Segmented Multicast MPLS/BGP VPNs transmitting PE has to know which mVRF's need to receive which multicast streams. To assure that no duplicates are sent to receivers, a root PE can only aggregate traffic from those C-sources (attached to it) such that exactly the same mVRF's want to receive this C-source traffic from the root PE. Since the set of receiver mVRF's can dynamically change (e.g., a new mVRF can be added and "break" the congruency of existing aggregation), the aggregation of C-source traffic might need to be dynamically adjusted. However, if the identity of the transmitting PE is known and is supported by the forwarding plane, the egress mVRF can discard those packets that came from the "wrong" PE, i.e., a PE that is not the mVRF's best next-hop to the source of those packets. The ingress PE information is provided by all P2MP tunnel encapsulation techniques defined in [ii] or it can be provided by so called "PE label" in case of MP2MP LSPs [ii]. Knowing the identity of the root PE relaxes the requirement for perfect congruency of receivers for the discovered C- sources, however, it requires the support of upstream assigned PE labels. To allow the aggregation of C-multicast traffic belonging to different MVPN's requires that the MVPN implementation supports the upstream assigned demultiplexing label, defined in [ii]. The demultiplexing label allows the egress PE's to determine the MVPN to which the packet belongs. With such aggregation, in order to avoid duplicates to receivers, the PE label identifying the transmitting PE has to be also used. 11. Supporting Source-Specific Host Reports in PIM-SM PIM-SM [vii] permits "a receiver to join a group and specify that it only wants to receive traffic for a group if that traffic comes from a particular source. If a receiver does this, and no other receiver on the LAN requires all the traffic for the group, then the DR may omit performing a (*,G) join to set up the shared tree, and instead issue a source-specific (S,G) join only." Such a behavior of end systems in PIM-SM means that any PE can receive Join (C-S, C-G) for a sparse mode group even if no PE has ever received Join (C-*, C-G). It also means that (as in PIM-SSM) source trees might be triggered even for sources that are not active. In the MVPN we want to prevent useless S-PMSI creation for C-sources operating in sparse groups which are not active. The procedures for this case are specified below: - If a PE, which is not attached to C-RP, receives a (C-S, C-G) Join without a previous (C-*, C-G) Join on the same interface, and the PE previously received a P-tunnel announcement for (C-S, C-G) traffic or a P-tunnel announcement for C-G traffic, it will treat the (C-S, C-G) Join as if it were a join initiated as a result of C-RPT to C- Napierala Expires - August 2008 [Page 26] Segmented Multicast MPLS/BGP VPNs SPT switching. This procedure has been already specified in section 5.1.1 of this document. - If a PE receives a (C-S, C-G) Join without a previous (C-*, C-G) Join on the same interface and the PE has no P-tunnel information for C-S or C-G traffic, it will treat the (C-S, C-G) Join as if it were a (C-*, C-G) Join, provided that the interface in question does not have the C-RP for C-G behind it. The procedure for handling (C- *, C-G) Joins is already specified in section 5 this document. This scenario implies that there was no previous (C-*, C-G) Join in the entire MVPN and that the 1st join in sparse group C-G that is received by a PE in this MVPN is a source-specific join. - If a PE, which is not attached to C-RP, receives a (C-S, C-G) Join on an interface on which it previously received (C-*, C-G) Join, the PE ignores the (C-S, C-G) Join as already specified by the procedures defined in section 5.1 of this document. - If a PE receives a (C-S, C-G) Join on an interface which is the PE's next hop to the C-RP, it will announce source C-S to all PE's in an MVPN. This is according to procedures already defined in this document. If there is a C-receiver behind the same interface as the C-RP, it might be the case that the (C-S, C-G) Join was requested by the C-receiver and not by the C-RP (more precisely, the (C-S, C-G) Join was sent based on receiver's source specific report). If this C- receiver has never requested to join (C-*, C-G) then there is no guarantee that the source C-S is active and transmitting packets. Hence, before the PE attached to C-S announces an S-PMSI for C-S, it has to make sure that C-S is active. A PE attached to a site with C-S upon receiving C-S announcement message from the PE attached to the C-RP will not immediately announce the S-PMSI for (C-S, C-G) traffic. It will announce it only when the 1st packet is received on the (C-S, C-G) state, which is indicated by setting the (C-S, C-G) "SPTbit". This assures that S-PMSI for (C-S, C-G) traffic is announced only if C-S is transmitting. 12.IANA Considerations To be supplied. 13.Security Considerations To be supplied. 14.APPENDIX: Preserving C-Multicast Traffic Patterns in MVPN Napierala Expires - August 2008 [Page 27] Segmented Multicast MPLS/BGP VPNs This Appendix describes the routing topologies of PIM-SM C-multicast from the provider's network view. It provides detailed analysis of multicast routing scenarios to show that the mechanisms defined in this document work correctly and do not trigger unexpected multicast states in customer's network. We decompose PIM-SM C-multicast into two scenarios where: (1) the shortest path tree (SPT) between C-S and C-RP is via service provider network, and (2) the shortest path tree (SPT) between C-S and C-RP is outside of service provider network. C-S1 C-RP C-S2 | | / CE1 CE2 CE2' | | / | | / PE1 PE2 \ / Provider's Network | PE3 | | CE3 | C-R Figure 1: Scenario (1) - Path between C-Si and C-RP via provider's network In scenario (1), shown in Figure 1, we assume that C-RP communicates with source C-S, e.g., C-S1 and C-S2, over provider's network. As a consequence, the SPT from C-S to C-RP is built across the network. The (C-S, C-G) state is created within the site with C-S by a PIM Join issued by C-RP towards the C-S. Hence, switching at the egress PE's to SPT will not introduce new multicast states or change multicast traffic patterns within the site with C-S (or any other VPN site). In this scenario, immediate switching to SPT's at the egress PE's is transparent to the customer. As a consequence, in scenario (1), PIM-SM C-trees can be by default automatically triggered as SPT's by all egress PE's with no inter-PE RPT-to-SPT switchover initiated by C-routers. Regardless of whether or not the traffic in customer's network switched to SPT's, inter-PE MVPN traffic is sent only on SPT's. Note that if there is a receiver C-R of C-G at the C-RP site, it might happen that the 1st (C-S, C-G) Join that arrives at the PE attached to this site is from the C-R rather than from C-RP. This does not change the C-multicast traffic flows described above. Napierala Expires - August 2008 [Page 28] Segmented Multicast MPLS/BGP VPNs Even if from provider's network perspective C-S and C-RP are reachable via different PE's (as C-S1 and C-RP in Figure 1a) or via different interfaces on the same PE (as C-S2 and C-RP in Figure 1a), a better multicast path between the C-S and the C-RP could be engineered by a customer to be outside of provider's network. C-S1 ======= C-RP ==== C-S2 | | / CE1 CE2 CE2' | | / | | / PE1 PE2 \ / Provider's Network | PE3 | | CE3 | C-R Figure 1a: Scenario (1a) - Path between C-S and C-RP engineered to be outside of provider's network Figure 1a depicts this scenario. From provider's network perspective CE1 is reachable via PE1 and C-RP is reachable via PE2. Hence, from provider's perspective the reachability between C-S and C-RP is via provider's network. Yet, the SPT between C-S and C-RP has been engineered by VPN customer to be outside of provider's network (which is depicted by a double line between C-S1 and C-RP in Figure 1a). Similarly, from provider's network perspective CE2 and C-RP are both reachable via PE2. Hence, from provider's perspective the path between C-S2 and C-RP is via provider's PE router. Yet, the best path between C-S and C-RP has been engineered by VPN customer to be outside of provider's network (which is depicted by a double line between C-RP and C-S2 in Figure 1a). Handling such topologies would complicate inter-PE C-multicast routing because it requires full C- RPT to C-SPT switching between PE's. Such scenarios are unusual and could be a result of unintentional or incomplete route advertisement by the customer. To avoid full RPT-to-SPT switching, in the scenarios depicted in Figure 1a, the C-S traffic will be kept on inter-PE C- shared trees. Note that if there is a receiver C-R of C-G at the C-RP site and the source-tree from C-R to C-S is across provider's network while the source-tree from C-RP to C-S is engineered to be outside provider's network, then the PE attached to this site will receive (C-S, C-G) Napierala Expires - August 2008 [Page 29] Segmented Multicast MPLS/BGP VPNs Join. In this scenario the C-S will be discovered and announced in MVPN, following the procedure described under scenario (1). In scenario (2) customer source and customer RP are located at the same site. In this scenario, the optimal path from C-S to C-RP might not overlap with the optimal path from CE towards C-RP. Figure 2 depicts an example of such scenario. In this topology, if PE3 unconditionally switches to C-SPT, (C-S, C-G) state is created on CE1 which would not be otherwise created. If, in customer network, switching from RPT to SPT is based on a non-zero SPT-threshold then a specific source C-S traffic might never be switched to SPT if C-S rate does not reach the configured threshold. Hence, under scenario (2), to preserve PIM-SM multicast states in customer network, C-RPT to C-SPT switching cannot be initiated by provider's network. C-S C-RP \ / \ / R-1 | CE1 | | PE1 | | Provider's Network | | PE3 | | CE3 | C-R Figure 2: Scenario (2) - Path between C-S and C-RP outside of provider's network In scenario (2) there is no advantage to switch inter-PE traffic from C-RPT to C-SPT. Even more, it is beneficial to the customer not to switch to SPT's at all because customer's multicast traffic is already on the shortest path across provider's network. In addition, in scenario (2), if customer initiates switching to SPT for C-S traffic at a remote site (e.g., CE3 in Figure 2), this would not change the C-S traffic pattern within the site with C-S. This is because at this site the path from C-RP to C-S intersects with the path from provider's network towards C-S. Hence, staying on inter-PE shared tree for C-S will not change the C-S traffic pattern even if customer switched to SPT for C-S at a remote site. Napierala Expires - August 2008 [Page 30] Segmented Multicast MPLS/BGP VPNs Based on the these observations, the C-G traffic from any source C-S that is located at the same site as C-RP will be kept on inter-PE C- shared tree, regardless whether or not the customer network initiated the switching to SPT's. There could a scenario, with C-S and C-RP located at the same site, where RPT-to-SPT switchover is initiated by the customer to alleviate C-RP from carrying too much traffic. The example of such scenario is depicted in Figure 2a. In Figure 2a it is assumed that the best path from source C-S to C-RP is directly via CR1 only and not via CE1. When a remote CE3 switches to SPT, C-S traffic does not need to flow through the C-RP. However, this requires (C-S, C-G) state to be created on CE1. In scenario (2a) a path from CE1 to C-RP does not intersect with the SPT from C-RP to C-S. Hence, when staying on the shared tree the C-S traffic cannot be to be "picked off" as it flows along the SPT to the C-RP. In Figure 2a, if the best path from C-RP to C-S were via CE1, the benefit of switching to SPT would be eliminated because the C-S traffic would not flow via C-RP while on the shared tree. Another benefit is that (C-S, C-G) state would not be created on CE1. CR1---C-S / | C-RP | \ | \ | CE1 | PE1 | Provider's Network | PE3 | CE3 | C-R Figure 2a: Scenario (2a) - Path between C-S and C-RP outside of provider's network It is beneficial to a VPN customer to assure that the best path from the C-RP to C-S (when they are located at the same site) intersects with the path from the provider's network towards C-S. Such topology gains all the benefits of staying on the shared-tree because C-S traffic can be "picked off" and send towards provider's network as it flows along the SPT to the C-RP. We assume that staying on C- shared trees in topologies exemplified by Figure (2a) has a minimal impact to the customer or that this impact can be easily eliminated by a straightforward routing or topology adjustment in customer Napierala Expires - August 2008 [Page 31] Segmented Multicast MPLS/BGP VPNs network. In addition, such adjustment is beneficial to customer because it results in fewer multicast states on customer routers. 15.References [i] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [ii] E. Rosen, R. Aggarwal, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast. Work in progress. [iii] E. Rosen, E., Rekhter, Y., "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [iv] Kim, D., Meyer, D., Kilmer, H., and D. Farinacci, "Anycast Rendevous Point (RP) mechanism using Protocol Independent Multicast (PIM) and Multicast Source Discovery Protocol (MSDP)", RFC 3446, January 2003. [v] Farinacci, D. and Y. Cai, "Anycast-RP Using Protocol Independent Multicast (PIM)", RFC 4610, August 2006. [vi] H. Holbrook, B. Cain, "Source-Specific Multicast for IP", RFC 4607, August 2006. [vii] B. Fenner et al., "Protocol Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification (Revised)", RFC 4601, August 2006. [viii] R.Aggarwal, E.Rosen, et al., "BGP Encoding for Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp. Work in progress. [ix] I. Minei, I. Wijnands, et. al., "Label Distribution Protocol Extensions for Point-to-Multipoint and Multipoint-to-Multipoint Label Switched Paths", draft-ietf-mpls-ldp-p2mp. Work in progress. [x] M. Handley, I. Kouvelas, T. Speakman, L. Vicisano, "Bi- directional Protocol Independent Multicast (Bidir-PIM)", draft- ietf-pim-bidir-09. Work in progress. 16. Acknowledgments Napierala Expires - August 2008 [Page 32] Segmented Multicast MPLS/BGP VPNs The author thanks Yakov Rekhter, Eric Rosen, Bill Fenner, Toerless Eckert, Ice Wijnands, and Lee Breslau for their comments and insights. 17. Author's Addresses Maria Napierala AT&T Labs 200 Laurel Avenue, Middletown, NJ 07748 Email: mnapierala@att.com 18. Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. 19. Copyright Notice Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS Napierala Expires - August 2008 [Page 33] Segmented Multicast MPLS/BGP VPNs OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Napierala Expires - August 2008 [Page 34]