Nutanix Prism dashboard

Dashboard design

Nutanix Prism: Troubleshooting Hyper-converged Infrastructure

Dashboard design for troubleshooting and monitoring server clusters in a large-scale enterprise.

2 months

duration

Internship

project

May '22 - July '22

timeline

What's the problem?

Enhancing IT monitoring with Nutanix Prism.

The goal of my internship was to enhance Nutanix’s Prism platform to provide IT administrators with a unified monitoring experience for complex, hybrid IT environments.

This involved addressing the diverse roles in IT, bridging infrastructure monitoring with business goals, and extending Prism’s capabilities to monitor all workloads and dependencies within one interface.

Deliverables included an analysis report highlighting gaps and opportunities, UX concepts for data visualization and predictive analytics, and iterative validation with internal teams, culminating in a final presentation of proposed enhancements.

The timeline

Understanding Nutanix Prism Analytics

Simplifying system administration.

Prism Analytics, internally referred to as ‘xAnalysis,’ aggregates key metrics like memory, CPU, and disk usage to provide IT administrators with a centralized view of resource utilization.

Admins can monitor performance, set up alerts, and quickly identify bottlenecks, ensuring optimized resource allocation and infrastructure health.

This feature streamlines root cause analysis by pinpointing rogue systems, enabling faster resolution of performance issues.

Nutanix Prism dashboard

Typical user journey for a Root Cause Analysis

A day in the life of an IT administrator.

To understand how administrators interact with the tool, it was essential to map out a typical journey for identifying the root cause of an issue.

Take Priya Sharma, a Senior IT Administrator at a large enterprise. Her day begins by logging into Nutanix Prism to assess the health of her company’s infrastructure.

Priya’s primary goal is to proactively detect and resolve any potential issues before they escalate and disrupt business operations.

Challenges faced during this journey

Alert Fatigue

The current system generates a high volume of alerts, which can overwhelm users and create 'noise,' potentially leading to important issues being overlooked.

Limited Application Insights

xAnalysis currently lacks deeper visibility into specific applications, making it difficult for Priya to pinpoint app-level issues without involving additional team members.

Complex Workflow

For admins new to the system, the detailed customization options in xAnalysis may require additional time to learn, affecting the speed of issue resolution.

Lack of Forecasting

The lack of forecasting to predict when a resource, such as memory or storage, may hinder proactive management and increases the likelihood of unexpected outages.

Competitor study

Comparing the big players.

To gain deeper insights, I conducted a thorough competitor analysis of the top full-stack monitoring tools, including Datadog, Dynatrace, AppDynamics, and Sematext.

The study focused on evaluating critical features such as machine learning-based alerts, code-level visibility, and real-user monitoring to identify gaps and opportunities for improvement.

FeatureNutanix Prism (xAnalysis)DatadogDynatraceAppDynamicsSematext
Infrastructure Monitoring
Application Performance Monitoring (APM)
Log Management
Real User Monitoring (RUM)
Synthetic Monitoring
Machine Learning-Based Alerts
Code-Level Visibility
Container Monitoring
Predictive Analytics
Topology Mapping

Key observations

Where xAnalysis stands and the road ahead.

Nutanix Prism’s xAnalysis focuses on infrastructure monitoring but lacks advanced features such as real user monitoring (RUM), synthetic monitoring, and machine learning-based alerts. Its limited integration ecosystem further restricts its compatibility with diverse tools.

Competitors like Datadog, Dynatrace, AppDynamics, and Sematext offer more robust monitoring solutions. Datadog leads with over 400 integrations and comprehensive capabilities across APM, RUM, synthetic monitoring, and machine learning alerts. Dynatrace shines with AI-driven insights, predictive analytics, and topology mapping, ideal for complex IT environments. AppDynamics combines application monitoring and business-oriented insights, while Sematext provides flexible pricing with a balanced feature set.

For xAnalysis to compete effectively, incorporating APM, RUM, synthetic monitoring, and machine learning-based alerts is crucial. Expanding its integration capabilities would further position Nutanix Prism as a holistic, user-centered monitoring solution.

Converging avenues

Narrowing the focus to make an impact.

As my internship progressed into its first month, I realized the need to prioritize and converge on a specific solution for enhancing Nutanix Prism.

While there were multiple problems and opportunities identified, it became critical to define a focused scope to ensure meaningful progress within the limited timeframe.

Advanced Monitoring Capabilities

Customizable APM and RUM dashboards with modular components for metric selection, thresholds, and widgets could enhance usability by providing flexibility and control over performance monitoring.

Visual Representations with Topology Mapping

Topology views can offer admins a clear understanding of infrastructure dependencies, aiding in multi-point issue identification and impact assessment.

Mobile Integration for On-the-Go Monitoring

A mobile app or responsive interface for xAnalysis would enable remote monitoring, real-time alerts, and incident response, ensuring seamless infrastructure visibility and control.

Forecasting Dashboard

Forecasting capabilities with intuitive visualizations can predict resource trends, optimize performance, and aid in proactive capacity management, enhancing resource allocation efficiency.

Improved Alert Consolidation Interface

A redesigned alert interface with grouping, filtering, and customization options can streamline incident resolution, reduce noise, and prioritize critical issues effectively.

Converging on Improved Alert Consolidation

A well-received direction.

Following our mid-intern presentation, the design team praised the thorough analysis, user-centered approach, and practical recommendations. This validation strengthened our confidence in the solutions proposed.

The Alert Consolidation Interface garnered particular interest, as improved alert management was not only a frequently requested feature by customers but also aligned closely with the team’s roadmap for xAnalysis. This alignment highlighted its strategic value and potential impact.

Encouraged by the team’s feedback, we prioritized this feature for the final phase of the internship, focusing on creating a streamlined, efficient interface that could significantly enhance user experience and operational efficiency.

The Final Designs

Introducing the Problems Page.

The final designs featured a dedicated Problems page to tackle alert consolidation effectively. This page provided a centralized hub for admins to manage alerts, reducing noise and prioritizing critical issues.

Key enhancements included grouping and filtering functionalities, customizable priority settings, and an intuitive layout that helped admins quickly identify and resolve pressing problems.

The Problems page seamlessly integrated with Nutanix’s design system, ensuring consistency and aligning with the platform’s overall user experience. This design was a pivotal step in improving alert management and operational efficiency for IT administrators.

The problems page

A consolidation of multiple issues under 1 collective 'problem'. This reduces the amount of alerts and in turn reduces analysis paralysis. It gives the exact info on which are the affected entities and shows the severity too. Recommendation based on the mutliple alerts provides the admin with a solution-based approach to each problem. They can also click on 'Add to Analyse' to do a deeper RCA, view the different charts for clusters which are causing this issue, and finally mark it as 'Resolved' if the issue is fixed.

Quick-add

Among the collection of minor features that I proposed, the quick-add feature for Charts to the Analysis page was one of the most well-received. This gives suggestions to the user based off of the 'Problems' which are currently active, or were recently marked as Solved. It can also take cues from the name or description of the Analysis session.

Adding granularity

While the current design had the user show a modal for anytime they wanted to add chart, this would take the focus away from the user of the already existing Analysis session. The new design allows the user to add charts by just selecting from the dropdown, making it quick and easier.

Critical alerts

Every alert can show a Trend or a Metric which is causing that issue. One of the concepts I presented also included an addition of a 'Abnormal trend detected' warning which lets users know that this Metric usually doesn't take this form. This is such that its easier to track what's going wrong in a Problem analysis page.

What did I learn from this?

Key takeaways from my first industry experience.

This internship gave me a deep understanding of enterprise IT complexities, including hybrid infrastructure monitoring and team dynamics within large organizations. Working across global time zones honed my skills in asynchronous communication and expectation management.

I also faced setbacks, which was crucial in teaching me resilience, emphasizing that failure is a step toward improvement and innovation.