Resolving Inefficiencies in Call Logging for a leading Telco

Facing challenges with cross-functional microservices teams and ineffective log analytics, we engineered a solution leveraging Splunk to streamline triaging and enhance service latency visibility for a leading telco client. This transformation led to a drastic reduction in issue resolution time and a significant improvement in operational efficiency across teams.

  • Data & AI

Business Issue

The client’s enterprise systems comprised multiple upstream and downstream systems that participated in fulfillment of a business function. There were issues with effective triaging across cross-functional microservices teams. Lack of standardization in logs and ineffective usage of log analytics tools like Splunk, led to significant time spent by the Client’s teams in analyzing the issue before it was assigned to the right team. Many times, this delay used to impact critical business activities.

Additionally, there was no unified way of tracking and visualization of request from source service to the leaf node in the hierarchy (L0 to L5) and no visibility into significant latency issues in microservices.

Solution

To surmount these challenges, we built effective triaging dashboards in Splunk using inbuilt indexing and analytics features, captured through a logging framework, identified high-latency downstream APIs, and resolved latency issues by tracking contribution of each service to its hierarchy.

We also built a dynamic sequence of events which reduced the need to maintain sequence diagrams for system design and architectural improvements.

Outcomes 

The work resulted in drill down capability to track all the logs. The cross-functional and geo-based teams found it very effective. They were able to identify the team responsible for resolving the issues.

Turnaround time for issue resolution was reduced from multiple days to minutes, and the dynamic performance metrics helped identify and resolve bottlenecks with no effort spent in identifying it.