Correctness issue for queries using SELECT DISTINCT
06 Feb 2024
Description
The impacted releases contain a correctness issue for queries using the SELECT DISTINCT clause. The queries may return incorrect results when the following conditions are true.
- BatchNestedLoopJoin is enabled
- JOIN clause is present in the query
- A DISTINCT clause is present in the query
- The table has range indexes
The issue does not apply to hash or merge joins.
Mitigation
Upgrade to a release with the fix.
Details
- The issue is based on how the query planner handles the optimization of DISTINCT operations in the presence of join conditions involving columns from the inner relation.
- The issue concerns a query like
SELECT DISTINCT t2.k FROM t1 JOIN t2 ON t1.k = t2.k
; where a distinct index scan is chosen on table t2 (inner relation). - Previously, the planner erroneously treated
t2.k
as a constant during the distinct index scan because the conditiont2.k = <some terms not related to t2>
made it appear constant. However, this assumption is incorrect in batch nested loop joins because a single scan on t2 can return multiple values fort2.k
matching different values fromt1.k
. - The correct behavior is to include
t2.k
in the distinct prefix (the columns used to eliminate duplicates) unless the condition is of the formt2.k = C
, whereC
is a true constant value. - In the case of a nested loop join with
foreach $1 in t1.k do; distinct index scan on t2 where t2.k = $1
, it is correct to treatt2.k
as a constant and exclude it from the distinct prefix. But for a batch nested loop join withforeach ($1, ..., $1024) in t1.k do; distinct index scan on t2 where t2.k in ($1, ..., $1024)
, it is incorrect to treatt2.k
as a constant because the scan can return multiple values fort2.k
. - The issue does not apply to hash or merge joins.