The last 40 years have witnessed phenomenal progress in the field of IP packet networks. Despite great advances in function and performance, the profitably and operational integrity of IP bearer networks rests with the construction of highly reliable 'IP routes'. Unreliable networks, on the other hand, can impinge heavily on a carrier's profits. For example, the SLA's penalty clause for a 1-hour communication interruption fines an offending carrier 1 day's income.
Protection Switchover, the Quicker the Better?
While it is theoretically correct to assume that the quicker the protection switchover is, the better for service restoration, it is not so when an IP bearer network is combined with an optical network, as an over-emphasis on IP bearer network switchover speed will increase switchover times, thus reducing network reliability.
BT, AT&T and NTT were among 7 carriers who participated in Telemark's survey addressing carriers' requirements. Telemark concluded that: "The 3 elements which carriers are most concerned about when deploying communication services are network reliability, network usability and network fault processing capabilities. The top 3 elements all belong to the reliability category." Network fault processing is the most direct and effective means of improving network service reliability since it covers rapid fault detection and service protection switchover capabilities.
Carrier-class IP bearer networks have specific requirements for network protection switchover duration. When operating delay-sensitive VoIP services, for instance, the network should ensure uninterrupted calls and avoid packet loss during protection switchover. Operational experience indicates that a protection switchover in a bearer network that lasts less than a second achieves this. Packet loss is minor and communication remains relatively unaffected.
However, in most networks during networking, a multi-service IP bearer network is combined with optical transmission equipment, with the latter enabling a switchover speed of 50ms. If the IP bearer network's protection switchover speed is faster than the optical transmission's, the IP network's rate of route convergence and switchover will double, and this affects communication stability.
When a network link below the IP layer is failed, both the optical transmission equipment and the IP bearer network can detect the fault. If the IP bearer network has switched to the second-best path before the switchover of the optical transmission equipment, the IP bearer network will make a second switchover, as the original best path is usable again after the switchover of the optical transmission equipment. Thus, it requires a slower switchover speed in the IP bearer network than in the optical transmission network, that is, the switchover duration should be longer than 50ms.
If NGN is viewed in terms of operation and billing, the IP bearer network's protection switchover should synchronize with TDM system. An application scenario can be described when a PSTN local network accesses telephone services and converts TDM voice signals to VoIP signals via the media gateway (MGW). The signals are transmitted to remote PSTN networks by a multi-service IP bearer network. If IP bearer network fails and cannot implement protection switchover, the heartbeat detection between local and other MGWs and the softswitch is interrupted. The MGWs inform the connected PSTNs to disconnect calls and cease billing. Hence the IP bearer network's switchover speed should exceed that of the NGN system, otherwise the NGN system will interrupt call connections because it assumes the IP network has failed and cannot implement switchover.
At present, mainstream carriers charge by the second, in order to prevent the NGN system from interrupting services before switchover is completed in the IP bearer network, it is required that the protection switchover duration in the IP bearer network should not be over 500ms.
Based on bearing, operation and billing service requirements and on cooperation with other network equipment, the ideal protection switchover duration in multi-service IP bearer networks is between 50 and 500ms.
Must-have Reliability Guarantee Procedures
Carrier-class IP operations are characterized by broad bandwidth, low delay, low packet loss rate and reliability. Procedures to ensure this must exist at the equipment node, partial network and entire network levels.
Equipment node reliability
The reliability of multi-service IP bearer networks depends on the reliability of its constituent basic nodes, i.e., the network equipment.
The hot standby redundancy design is mostly used for the key components of mainstream network equipment such as main processing unit, switching units, power supply and cooling systems. It forms the fundamental requirement for ensuring carrier-class IP bearer network reliability.
Equally, fast fault sensitivity and switchover function in interfaces and line cards are important aspects guaranteeing network performance. Traditional Ethernet interfaces do not possess special fault detection technology as they normally bear pure data services that are insensitive to time delay. This results in a fault detection speed of 1 second, which cannot meet the requirements of real-time telecom services such as VoIP. Consequently, rapid detection mechanisms such as BFD and OAM have been introduced and, by interacting with the line card control part, each interface or link enables a fault detection duration of less than 50ms. At present, these mechanisms are still under standardization, and some technical details are being optimized.
During network operation, the switchover of the main processing unit might lead to service interruption even if the main processing unit adopts the redundant backup technology. This is due to the fact that adjacent network equipment stops existing connections during the switchover and prevents the continuous forwarding of data packets. NSF-GR and NSR technologies were consequently introduced since they obviate interruptions by maintaining neighboring connections and message forwarding during the main processing unit's switchover.
Partial Network reliability
A multi-service IP bearer network consists of the access, convergence and core layers. By implementing partial reliability strategies in different layers, network reliability is guaranteed section by section. In contrast to network node equipment, this expansion technology can greatly improve service reliability without demanding large-scale upgrading on network equipment.
At the access layer, either the redundancy backup or load sharing access strategy is recommended to enable dual-homing access of service system equipment, such as MGW, CE, to 2 PE (Provider Edge) devices. Some supporting technologies, e.g. VRRP and RSTP, can be used to enable fast protection switchover. If conditions are restricted and it is not possible to set two PE devices for redundant access, the service systems should at least be accessed to two interface boards in the same PE.
At the convergence and core layers, a dual-node redundancy backup strategy is recommended, as a backup node can guarantee non-stop service forwarding should an in-use node fail. Link congestion can also be reduced by the link redundancy strategy, which depends on port bundling technology. This increases each port's overall bandwidth and allows each bundled group of ports to form mutual backups.
For link connections at the core layer, POS interfaces can be used for full mesh connections. Each POS interface offers a fast fault detection mechanism similar to SDH, while the full mesh connection ensures that no more than one network hop is added due to traffic bypass should a link fail. Moreover, this mode can prevent core layer traffic from bypassing through the convergence or access layers, protecting the former from congestion or failure due to excessive amounts of traffic.
In terms of the MPLS bearer mode, a popular solution is to adopt MPLS FRR to protect the nodes and links on the active LSP of the MPLS TE via the pre-duplicated MPLS tunnel. Based on the fast fault detection capability enabled by the BFD, the protection switchover duration of the MPLS FRR will not exceed 50ms.
Entire network reliability
Both reliability technology for equipment nodes and partial network is designed to reduce the influence of network failure on services and decrease the probability of service unavailability through enhancing partial reliability. The whole network reliability, however, is ensured by end-to-end service protection and guarantees the reliability of the whole range of services in the multi-service IP bearer network.
ECMP is always adopted for service protection when using IP packets to bear services. However, a problem exists with traditional IGP ECMP. Connection faults are detected by the IGP protocol, which enables seconds convergence duration that is substandard in terms of carrier-class services. IGP FC enhancement technology was introduced to solve this, as its capability limits fault switchover duration to within several hundred milliseconds. In addition, various strategic routing technologies are available to increase network-level protection switchover capability.
MPLS borne services allow end-to-end MPLS TE and MPLS FRR technologies to be adopted for service protection. MPLS TE makes use of explicit routes, and is able to control service forwarding paths in the network range according to network topology and service distribution. MPLS FRR uses the pre-set tunnel to implement protection on the active LSP. If the active LSP fails, the segmented BFP quickly detects the fault and triggers the corresponding node or link's protection tunnel. Protection duration is under 200ms, the LSP is not deleted, and services are not interrupted. Although this solution ensures effective protection, its manual implementation and poor scalability options confine it to small-scale networks.
Huawei's Reliability Systems: Holistic, Highly-Effective, Highly-Efficient
Huawei has long been active in researching multi-service IP bearer network reliability and has released a series of customized solutions. By establishing a range of reliability system platforms, Huawei offers high reliability guarantees for carrier-class services in the network range.
Service access reliability system
For service system equipments, Huawei recommends that a layer-2 access mode and enhanced VRRP technology are applied to service system equipment in order to enable service access in a single network with dual nodes.
Figure1 MGWs accessed in the dual-plane mode
Figure 1 illustrates the access configuration of a typical MGW whereby 2 MGWs access 2 PEs in the same equipment room in way of dual-homing. The TSR with layer-2 bridge function is used as a PE, which enables MGW accessed into PE via layer-2 bridge mode. As a result, PEs adopt this mode at the user access side and implement layer-3 forwarding at the network uplink side. Faults at the access side will not permeate to the network, so faults on each partial access are isolated from the whole network.
The VRRP protocol is adopted between 2 PEs to implement automatic protection switchover on layer-2 access services. As the switchover duration enabled by VRRP exceeds 3 seconds, Huawei recommends the enhanced VRRP (E-VRRP) protocol so that either BFD or OAM fast fault detection technology can cooperate with the VRRP switchover function, decreasing access system protection switchover to less than 500ms.
As proven by commercial application, this service access mode is appropriate for carrier-class services in IP bearer networks. Access flexibility and service bearing quality are ensured since this mode can correspond specific traffic to specific logic layers, optimize access according to traffic features and provide a redundancy backup function. Load sharing access is permitted for overall traffic, further enhancing operational efficiency.
In most cases, services that run between two nodes will not interfere with each other due to clear network architecture and service bearer systems. In extreme conditions, traffic can bypass via the other plane, ensuring non-stop service forwarding. Furthermore, network failures can be isolated by the layer-2 access mode, decreasing the demands placed on service system equipment such as MGW.
VPN reliability system
It is unusual that IP packets are directly used to bear services in a multi-service IP bearer network. Safe and controllable service bearing is generally the responsibility of the MPLS/BGP VPN, which encapsulates different users and services to different VPN tunnels.
Current mainstream network protection switchover schemes focus on core layer nodes and links, but not the PE equipment located on the VPN head node. The MPLS VPN method defined in RFC2547bis specifies the use of the BGP KeepAlive packet (which has a detection rate of 3+ seconds) to detect faults on PE nodes. Subsequently, the network recovers services via end-to-end route and LSP convergence. The service convergence duration is closely related to the number of internal routes on the MPLS VPN and the hops of the bearer network. During typical networking, the switchover duration is about 5 seconds.
To solve this problem, Huawei proposed VPN FRR patent technology, which, as illustrated in Figure 2, has reformed traditional technologies. The PE1 node selects the appropriate VPNv4 route according to the matching strategy, and both the best and second-best route information is saved in the forwarding table. In case of PE2 node failure, by using multi-hop BFD and MPLS OAM technologies PE1 senses that the outer tunnel between PE1 and PE2 is unusable. (It should be noted that the end-to-end fault sensitivity duration in typical networking is less than 200ms.) The 'unusable' sign is then set in the corresponding LSP tunnel state table and the relevant messages are sent to the forwarding engine, which in turn marks them with a PE3 allocated inner-layer label and forwards them again by using corresponding information contained in the second-best route. The messages are switched to PE2 along the outer LSP tunnel between PE1 and PE3 and forwarded to MGW2, allowing service restoration from MGW2 to MGW1.
The solution therefore enables rapid end-to-end service convergence in the event of PE2 failure. VPN FRR on PEs at both sides of the network guarantee that highly effective reliability protection can be provided to bi-directional VPN services.
Figure2 Basic principle of the VPN FRR
Furthermore, VPN FRR ensures that the convergence duration during PE node failure is only determined by the remote PE fault detection duration and the time involved in modifying the public PSTN tunnel status in the corresponding forwarding engine. Convergence duration in this case is not dependent on the number of VPN routes. Although VPN FRR supports end-to-end PE node protection, it is only deployed on local PE nodes. The technology is completely transparent to other network equipment and will not influence the interoperability of different vendors' equipment.
End-to-End full-path reliability system
MPLS FRR is poor in scalability and is difficult to deploy. Nevertheless, other technologies cannot effectively implement end-to-end full path protection over IP bearer networks. Given this, Huawei introduced MPLS OAM tunnel protection group technology, which, by utilizing MPLS OAM packets, provides a full range of fast fault detection mechanisms for the LSP logic interface. Cooperation occurs with the pre-duplicated LSP to implement protection switchover. As the LSP is established on the service-based end-to-end bearer path, the MPLS OAM tunnel protection group technology realizes service-oriented end-to-end bearing and offers full path protection. IP bearer networks can therefore bid farewell to the time when only partial protection was enabled.
Commercial Applications Worldwide
China Mobile, Vodafone and Etisalat are just some of the mainstream carriers who have cooperated with Huawei in the construction of highly effective multi-service IP bearer networks capable of deploying a multitude of telecom services.
NGN bearer network for China Mobile
China Mobile selected Huawei's NetEngine series high-end routers in November 2004 to construct the world's largest NGN bearer network, which covers 31 provinces, municipalities and autonomous regions in China and offers services to over 200 million subscribers. On the eve of Chinese New Year, 2006, the network handled 471,000 erls of traffic within one hour (19:00--20:00), while more than 40 million subscribers made long-distance toll calls via China Mobile's IP bearer network.
With the test condition context of closing the protection switchover function of the optical transmission equipment, China Mobile's NGN bearer network enabled an average fault restoration duration of 22.7ms, far below the designed value of 50ms. During operations spanning more than 18 months, the network has encountered several transmission intermittent bit error bursts that did not affect service operations, thus demonstrating the network's proven ability to meet voice services' operational requirements.
CPN for Vodafone Romania
Vodafone Romania selected the NetEngine series high-end routers to construct their carrier-class multi-service IP bearer network, CPN (Common Packet Network). The CPN utilizes many cutting edge technologies such as VPN FRR, E-VRRP and the MPLS OAM tunnel protection group and has given a confirmed protection switchover duration of 50-200ms.
NGN bearer network for Etisalat
Etisalat has cooperated with Huawei to construct an NGN bearer network that integrates services such as VoIP, IPTV and enterprise VPN. On a unified IP bearer network platform, the NGN bearer network smoothly accesses equipment from various vendors. When bearing multiple types of services, the network's confirmed protection switchover duration is 50-500ms, thus meeting requirements of the various services.