Customer Environment Monitoring

Overview

At Pricefx, we monitor various technical aspects of each customer’s environment to ensure system reliability and performance. Due to the complexity and customization of many implementations, it’s important to clarify what is and isn’t monitored, what customers should report, and how we handle detected issues.

What is Actively Monitored vs. What Isn’t?

Actively Monitored

  • System Health: Uptime of core services, node and database connectivity.

  • Job Failures: Critical or recurring job failures with wide impact.

  • Resource Usage: CPU, memory, disk space, and pod count (threshold-based).

  • API Errors: Spikes in HTTP 5xx errors from integrations.

  • Scheduled Down/Up Times: Hibernation and scheduled maintenance windows.

  • Login/Auth Failures: Unexpected login issues or spikes in failure rate.

Not Actively Monitored:

  • Customer-specific workflows/business rules (e.g., missing price lists, invalid uploads).

  • Delayed or partially completed jobs that don’t technically fail.

  • Performance degradation below alerting thresholds.

  • Incorrect configurations or poorly optimized logic (e.g., formulas, interface behavior).

Detection of these issues usually requires a ticket submission or proactive customer observation.

Note: In some cases, tailored monitoring can be implemented.

What Triggers a Flag or Alert?

Alerts are triggered based on defined thresholds and error patterns:

Alerting Conditions

  • Immediate Alerts: CPU/memory exceeds 85–90%, key services outages, DB replication failures.

  • Error-based Alerts: High failed background job count, API 5xx error spikes, login/auth issues.

Non-alerting Issues

  • Long-running jobs that stay within SLA.

  • Subtle performance degradation

Intermittent user-interface slowness generally do not auto-trigger alerts unless they coincide with infrastructure-level anomalies.

Internal Process When Something Is Detected

  1. Incident Creation:
    Internal ticket initiated by Support or TechOps, tiered severity assigned.

  2. Customer Notification:

    • Yes: For likely or actual business-impacting issues.

    • No: For minor, mitigated internally before impact, or under ongoing analysis. We try to balance noise vs. value in communication.

    • Case-by-case: Early-stage communication is decided together with SLM if customer action is needed.

What We Expect Customers to Detect & Report

Customers are responsible for raising tickets for:

  • Functional issues related to pricing logic, workflows, UI anomalies, and data discrepancies.

  • Delayed or failed jobs that are not associated with infrastructure-level alerts.

  • Unexpected behavior in Agreements & Promotions, Quoting, Rebates, etc.

  • Changes in integration behavior (if not accompanied by API-level alerts).

Why Are Some Issues Not Flagged Immediately?

Common Causes:

  • Performance drops may evolve slowly and not cross alert thresholds.

  • Complex job chain failures may go unnoticed if only a sub-component breaks (e.g., ProductPricing workflows.)

  • Noise suppression: To avoid false positives, some early-stage degradation signs are monitored silently until patterns emerge.

  • Reactive process: Some issues are only triaged after customer ticket submission, especially if no alerts were triggered.

Tracking is primarily internal and includes:

Tool/System

Purpose

Grafana + Prometheus

Monitoring system health, resource usage, and performance patterns

SRE Dashboards

Internal detailed stats on nodes, DB activity, and usage metrics

Capacity Planning Reports

Manual assessments of environment growth, pod scaling, memory changes

Salesforce + Logs

Tracking infrastructure upgrades, DNC incidents, or environment resizing

We are expanding customer access to this data via PlatformManager (except infrastructure monitoring), though access is still limited today.