Back to Research

Toronto Metropolitan University  

Project 1046 - Conversion and retention of the users

SHARE THIS POST

Running from 2017 to present

Conversion and retention of the users

Cloud computing is ubiquitous: more and more companies are moving their workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. The high velocity and share volume aspects of the Cloud Operational Logs make it challenging to store and analyze them for Cloud components' health monitoring. It is a formidable challenge to keep track of thousands of Cloud components and discover anomalies, and it's even harder to identify the problematic components.

For the IBM Cloud Platform, we have designed and deployed an automated monitoring system that collects, transforms, and stores component logs using Cloud-based technologies. This monitoring system utilizes deep neural networks to simultaneously detect anomalies in near-real-time in multiple Platform components and alerts the DevOps team via Slack. The solution is monitoring the production environment of the Console of the IBM Cloud Platform for the last year. It can detect complex anomalies in multi-dimensional time series up to 20 minutes earlier than the previous monitoring solution and has significantly lowered the false alarm rate. We further aim to help identify cloud infrastructure's problematic zones by analyzing IBM Cloud data in the form of heatmaps. The heatmap visualizations show how workloads differ in various data centers. By playing the animation and looking at the variation of colours over time, the operations team can easily find data centers with heavier workloads or identify the hot zones in real-time. Thus, it can help them detect components that are most likely to have abnormal behaviours at an early stage. Therefore, our solution will save time for the operations team to monitor its cloud infrastructure's health and identify anomalous behaviour.

Public Impact Statement:

Motivation. Cloud computing market size is expected to reach USD 1251.1 billion in 2028, up from USD 274.8 billion in 2020, at a compound annual growth rate of 19.1% [grand2021cloud].A Cloud service consists of a huge number of hardware and software components. If any of these components fail, an outage could potentially affect customers using the Cloud platform. Health monitoring of these components is therefore crucial.

Goal. We focused on two failure-related aspects to help DevOps teams and site reliability engineer (SRE) teams build and maintain large, robust Cloud solutions. The aspects are as follows:

  1. Detect such a failure before it affects a significant fraction of a customer base and
  2. Prevent tampering with logs to identify transparently who (provider or client) is at fault, to prevent future failures, and to identify breaches in service level agreements.

Our solutions and outcomes. To address the first item, we developed an anomaly detector based on deep neural networks that detects a system outage before it can affect a significant fraction of a customer base [pourmajidi2017challenges, pourmajidi2021challenging].

We validated our solutions on the IBM Cloud Console --- a large scale system that orchestrates all IBM Cloud services. It resides in 10 data centers, distributed worldwide, with each instance of the platform executing approximately 100 interrelated components/microservices [islam2021anomaly]. Our solution processes data in near-real-time (less than two minutes from raw data arrival), and detects anomalies (i.e. signs of potential failures) 20 minutes earlier than the prior testing approaches, with fewer false alarms and missed problems: F1 approx 0.85 cite[Appendix A]{islam2021arxiv. We reported our lessons learned in [islam2021anomaly] so that other researchers and practitioners can benefit from our experience when implementing their solutions.

If an anomaly is spotted, SRE and DevOps teams can fix the problematic components before they affect other parts of the Console and clients. Early detection is vital for meeting Service Level Agreements (SLAs). For example, relatively high availability of 99.9% translates into 10 minutes of downtime per week, which negatively impacts the customer experience.

We addressed the second item by ensuring the transparency of Cloud services through the use of blockchain-based technology. We developed a solution, Logchain, that cryptographically secures service logs and stores them in a ledger [pourmajidi2018logchain]. Logs secured in this way provide a record of services' activities that is immutable. We extended Logchain to an 'as a Service' solution built on top of public [pourmajidi2019immutable, pourmajidi2021tsc] and private blockchains [pourmajidi2021tsc]. We also evaluated Logchain's scalability and financial feasibility [pourmajidi2021tsc] so that practitioners can use it. Lastly, we tested the prototype within IBM, and we plan to incorporate it into the Console's health monitoring solution.

Logchain is the first scalable generic log tracking solution of its kind, which inspired other researchers to generate similar solutions to audit transactions in various areas ranging from health to legal, with our publications yielding a total of 61 citations.

To conclude, we have developed two novel components for monitoring the infrastructure of IBM Cloud services using artificial neural networks and blockchain technologies. These components enhance IBM Cloud services' robustness and transparency, which in turn improves customer satisfaction.

Learn More about the Research Team.  

Explore the product that harvests this research results  

Research team:

  • PI: Prof. Andriy Miranskyy, Toronto Metropolitan University
  • Student: Lei Zhang, Toronto Metropolitan University
  • Student: Mohammad Islam, Toronto Metropolitan University
  • Student: Sarah Sohana, Toronto Metropolitan University
  • Student: William Pourmajidi, Toronto Metropolitan University
  • IBM Project Lead (RCL): John Steinbacher, IBM
  • IBM Manager (RCM): John Steinbacher, IBM
  • IBM Sponsor (RCS): John Steinbacher, IBM
  • IBM Contributor (RCC): Anthony W. Erwin, IBM

Institution:

Toronto Metropolitan University   

SHARE THIS POST