The first sign of strain usually comes in a CPU Overload alarm on various mobile nodes. The Network Operations Center comes alive and starts to pray that the burst is short-lived and does not exceed max peak rate capacity.  If it does, ALL consumers are denied service access. They also pray because the hunt for the culprit is often arduous, involves applications completely out of their control and the problem can't easily be resolved without solid network analytics to engage app and device developers. This is why signaling was such an integral part of the Alcatel-Lucent Mobile Apps Rankings report and why LTE World 2014 devotes an entire pre-conference day to the topic. 

There are three kinds of "signalasis": microbursts that can be measured in seconds, extended bursts that can last minutes to hours, and suddenly sustained signaling growth where signaling jumps significantly at one point in time and continues to increase over weeks and months. Facebook's 60% jump on Nov 2012 is an example of the latter. Below are signaling burst examples - observed with our Wireless Network Guardian (WNG) - demonstrating the impact and resolution of signaling jumps ranging from 36% to 92%.

Samsung, Google and a pre-loaded app - microbursts


Figure 1: SGW signaling up by 44%, six times per day due to pre-loaded app

Then one day, one spike proved too much for the SGW: some of its blades were brought down by the overload, causing a signaling disruption and partial service outage.  After the incident, the traffic was diverted to a higher capacity backup SGW as a temporary measure until the issue could be isolated and resolved. 

Analyzing it using the WNG, the problem could be isolated to Samsung S4 devices with Android version 4.2 and 4.3 and traffic originated from the device trying to reach Google.com.  Equipped with that information, Samsung could determine that one of the pre-loaded apps on that device generated the spikes. It connected with Google API to determine the user's location to deliver local news to the consumer. This app was already removed in Android version 4.4.  To address the signaling spikes, the operator thought they could simply remove the offending app on device with Android version 4.2 and 4.3. However, pre-loaded apps can't easily be removed - instead, multiple updates have been pushed to test devices, trying to iteratively eliminate the behavior.


Figure 2: SGW signaling disruption due to pre-loaded app

Viber outage - extended bursts

On April 29th, CPU overload alarms reported that RNCs (Radio Network Control) were inundated with requests.  The signaling spike could be matched to Viber - flows showed that Viber servers were no longer responding. Viber was down.  But, why would this app outage have such a significant impact on signaling resources?  The answer is in Viber's call failure handling: the app would retry repeatedly to connect with the server, creating a larger signaling wave as more users tried unsuccessfully and repeatedly to connect. 


Figure 3: GGSN experiencing 92% jump in signaling for 4 ½ hours during Viber outage

The impact of Viber's outage on operator networks varied. Operators with few consumers using the Viber app would not have noticed. Others, with a high proportion of Viber users experienced a spike. Whether their network tolerated this spike depended on whether they had enough peak hour signaling capacity.  Differences can also be attributed to the timing of Viber outage in different geographies: an outage during peak time (of the network and Viber usage) would be more significant.

Microsoft Exchange and iOS - microbursts                                                         

This case exemplifies other short term outages where the signaling spike exceeded the signaling capacity on a daily basis.   A 36% signaling jump occurred everyday at midnight, but the reason for the spike remained mysterious.  The WNG narrowed it down: the signaling was initiated by devices trying to reach the Microsoft Exchange server. This interaction lasted less than 1 second in duration. It only involved iPhone devices and was more predominant with iOS version 6.1.  Equipped with this information, operators could contact Apple and identify the root cause. A fix was issue in a subsequent iOS version update.


Figure 4: SGWs subjected to 36% signaling microburst at midnight due to iOS-based Microsoft Exchange

These signaling spike cases are proof that app signaling design is an important aspect of customer experience. They bring home three important points:  the need for a robust and well-dimensioned signaling plane that can absorb sudden spikes; the need for our device and app ecosystem to design their product with optimized network interactions in mind; and the need for strong network analytics that can track the signaling of each app, detect signaling anomalies, and identify root causes quickly.

Our Analytics Beat studies examine a representative cross-section of mobile data customers using the  Alcatel-Lucent Wireless Network Guardianand are made possible by the voluntary participation of our customers. Collectively, these customers provide mobile service to millions of subscribers worldwide.

distributed by