Optimizing SQL Join Queries in Pricefx for Distributed Data Sources and Datamarts

Analytics Query Optimization

In Pricefx, performance issues can arise when performing complex joins across different data structures in a distributed database environment. This article explains how to optimize SQL Join Queries between Datamarts and Data Sources to ensure efficient execution.

Distributed Query Context

Pricefx uses a distributed architecture (Citus/PostgreSQL) to handle large datasets. Performance is highest when a query can be fully distributed across all nodes. A common performance bottleneck occurs when a Datamart is distributed by a specific key (e.g., "sku") but the joined Data Source is not distributed or is distributed differently.

The Join Problem

When you use DatamartContext.newSqlQuery() and executeSqlQuery() to join a distributed Datamart with a non-distributed Data Source, the database engine may struggle to coordinate the data across nodes. This often results in a suboptimal query plan where the system cannot effectively parallelize the workload.

Optimization Strategies

To resolve performance issues in distributed environments, consider the following architectural adjustments:

Align Distribution Keys

Changing both data entities to be distributed ensures that the database can co-locate related data on the same physical nodes. When a Data Source and Datamart share the same distribution key, the join operation happens locally on each node, significantly reducing query time.

Filter Placement and Subselects

Query performance is heavily influenced by where filters are applied in the SQL structure.

Subselects – Conditions should be placed in subselects whenever possible. This allows the database to filter the dataset before performing the join, reducing the volume of data processed in memory.
Source Level Filtering – Avoid filtering at the aggregation level if the same filter would work at the source level. Filtering early in the execution pipeline prevents the system from processing records that will eventually be discarded.

Trade-offs and Considerations

While distributing all data entities improves query speed, it introduces complexity in data management:

Data Loading – Distributed Data Sources may require more specific Data Load configurations to maintain alignment.
Query Complexity – Moving conditions into subselects requires more advanced SQL construction within your Calculation Logics.

A query that appears fast on small datasets may fail to scale if the distribution setup and/or query logic are not optimized. For queries executed repeatedly on datasets that are expected to grow, it is worthwhile reaching out to the Performance Team to check whether the query plan is optimal.

If you encounter a wrongly placed filter, investigate the query definition to see if the filter can be moved closer to the data source.