Skip to content

fix(sql): fix bug where queries using IN clauses with INT/LONG columns containing -1 return incorrect results.#6439

Merged
bluestreak01 merged 13 commits intomasterfrom
victor_longhashset_fix
Nov 25, 2025
Merged

fix(sql): fix bug where queries using IN clauses with INT/LONG columns containing -1 return incorrect results.#6439
bluestreak01 merged 13 commits intomasterfrom
victor_longhashset_fix

Conversation

@kafka1991
Copy link
Collaborator

@kafka1991 kafka1991 commented Nov 24, 2025

Fixes a bug where queries using IN clauses with INT/LONG columns containing -1 could return incorrect results.

Example

 SELECT * FROM table WHERE int_column IN (-1, 1);

Root cause: The internal HashSet implementation used -1 as a special marker value, which conflicted with actual -1 values in the data, causing incorrect query results.

close #6438

@coderabbitai
Copy link

coderabbitai bot commented Nov 24, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce a new DirectLongHashSet off-heap hash set implementation for long keys with configurable memory management and load factors, replace the usage of LongHashSet with DirectLongHashSet in InLongFunctionFactory, and add comprehensive tests including a test case that reproduces the original bug involving mixed positive and negative values in IN clauses.

Changes

Cohort / File(s) Summary
New DirectLongHashSet Data Structure
core/src/main/java/io/questdb/std/DirectLongHashSet.java
Added new off-heap hash set implementation for long keys with support for zero-key handling via dedicated flag, configurable load factor and capacity, insertion/lookup/removal/clear semantics, linear probing collision handling, rehashing on capacity overflow, iteration-friendly serialization via toSink, and lifecycle methods (close, clear). Implements Closeable, Mutable, and Sinkable interfaces.
InLongFunctionFactory Refactoring
core/src/main/java/io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java
Replaced LongHashSet with DirectLongHashSet for in-value collections, introduced memory tagging and explicit RSS memory management, added try-catch around DirectLongHashSet population for proper deallocation on failure, updated helper methods to accept DirectLongHashSet, extended InLongConstFunction and InLongRuntimeConstFunction with close() and init() methods for resource management.
DirectLongHashSet Unit Tests
core/src/test/java/io/questdb/test/std/DirectLongHashSetTest.java
Added comprehensive test suite covering basic add/contains behavior, zero-key handling, size tracking, clear operations, excludes behavior, collision handling, mixed positive/negative values, rehashing/scaling, removal operations, sentinel values (Long.MIN_VALUE, Long.MAX_VALUE), and serialization via toSink.
IN Predicate Tests
core/src/test/java/io/questdb/test/griffin/InTest.java
Added testIntInWithSentinelValues test method validating IN predicates with sentinel values including normal values, null, and extreme values with parameterized binding.
Regex Replace Function Test Utility
core/src/test/java/io/questdb/test/griffin/engine/functions/regex/RegexpReplaceStrFunctionFactoryTest.java
Replaced hard-coded long SQL string in testStackOverFlowError with dynamically constructed SQL using StringSink and loop to generate 2000 "address" tokens, preserving failure expectation validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • DirectLongHashSet memory management: Off-heap allocation, deallocation paths, rehashing logic with linear probing, and zero-key special handling require careful verification
  • InLongFunctionFactory resource lifecycle: Integration of DirectLongHashSet with factory pattern, close() and init() methods, and try-catch error handling paths
  • Collision and edge case handling: Verify rehashing triggers, boundary conditions with zero keys, mixed positive/negative values, and sentinel values
  • Test coverage validation: Ensure test cases comprehensively cover the bug scenario (#6438) involving mixed positive and negative values in IN clauses

Suggested reviewers

  • bluestreak01

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Out of Scope Changes check ❓ Inconclusive Changes to RegexpReplaceStrFunctionFactoryTest and modifications to InTest appear unrelated to the sentinel value fix for InLongFunction and DirectLongHashSet implementation. Clarify whether test modifications in RegexpReplaceStrFunctionFactoryTest and InTest are necessary for this PR or should be in separate commits.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR fully addresses issue #6438 by replacing LongHashSet with DirectLongHashSet using Long.MAX_VALUE as sentinel, enabling IN lists with negative values to return correct results.
Title check ✅ Passed The title accurately describes the main fix: resolving a bug in IN clause queries where -1 values in INT/LONG columns were incorrectly handled due to sentinel value conflicts.
Description check ✅ Passed The description clearly explains the bug, provides a concrete example, identifies the root cause (sentinel value conflict with -1), and references the related issue.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kafka1991 kafka1991 changed the title fix(sql): fix LongHashSet's sentinel for InLongFunction fix(sql): fix LongHashSet's sentinel value for InLongFunction Nov 24, 2025
@kafka1991 kafka1991 force-pushed the victor_longhashset_fix branch from 543916e to fa2ef16 Compare November 24, 2025 10:02
@kafka1991
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 24, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
core/src/main/java/io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java (1)

120-121: Avoid double load‑factor scaling when constructing DirectLongHashSet

DirectLongHashSet’s ctor already scales its capacity argument by 1 / loadFactor:

this.capacity = Math.max(Numbers.ceilPow2((int) (capacity / loadFactor)), MIN_CAPACITY);

But callers here pre-scale as well:

DirectLongHashSet inVals =
    new DirectLongHashSet((int) (argCount / DirectLongHashSet.DEFAULT_LOAD_FACTOR),
                          DirectLongHashSet.DEFAULT_LOAD_FACTOR,
                          MemoryTag.NATIVE_FUNC_RSS);

this.inSet =
    new DirectLongHashSet((int) ((valueFunctions.size() - 1) / DirectLongHashSet.DEFAULT_LOAD_FACTOR),
                          DirectLongHashSet.DEFAULT_LOAD_FACTOR,
                          MemoryTag.NATIVE_FUNC_RSS);

This effectively applies the 1 / loadFactor factor twice, significantly over-allocating off-heap memory for large IN-lists.

If the ctor’s capacity parameter is intended to be “expected element count” (which matches the name), callers should pass the raw element count and let the ctor do the scaling:

- DirectLongHashSet inVals = new DirectLongHashSet(
-         (int) (argCount / DirectLongHashSet.DEFAULT_LOAD_FACTOR),
-         DirectLongHashSet.DEFAULT_LOAD_FACTOR,
-         MemoryTag.NATIVE_FUNC_RSS);
+ DirectLongHashSet inVals = new DirectLongHashSet(
+         argCount,
+         DirectLongHashSet.DEFAULT_LOAD_FACTOR,
+         MemoryTag.NATIVE_FUNC_RSS);
...
- this.inSet = new DirectLongHashSet(
-         (int) ((valueFunctions.size() - 1) / DirectLongHashSet.DEFAULT_LOAD_FACTOR),
-         DirectLongHashSet.DEFAULT_LOAD_FACTOR,
-         MemoryTag.NATIVE_FUNC_RSS);
+ this.inSet = new DirectLongHashSet(
+         valueFunctions.size() - 1,
+         DirectLongHashSet.DEFAULT_LOAD_FACTOR,
+         MemoryTag.NATIVE_FUNC_RSS);

This keeps the load factor behavior while avoiding unnecessary off-heap over-allocation.

Also applies to: 230-231

core/src/test/java/io/questdb/test/std/DirectLongHashSetTest.java (1)

220-231: Minor: avoid fully‑qualified StringSink where it’s already imported

Here you can rely on the existing import instead of the fully qualified name:

-            io.questdb.std.str.StringSink sink = new io.questdb.std.str.StringSink();
+            StringSink sink = new StringSink();

This keeps the test code a bit cleaner and consistent with the other tests using StringSink.

core/src/main/java/io/questdb/std/DirectLongHashSet.java (1)

37-70: DirectLongHashSet implementation looks correct; only minor API/robustness nits

Algorithmically this is solid:

  • Open addressing with linear probing and backward‑shift deletion (removeAt + canMoveBack) is implemented correctly.
  • Zero is handled via hasZero so it doesn’t conflict with the internal 0‑as‑empty sentinel.
  • size(), clear(), rehash(), and toSink() all behave consistently with that design, including coverage for sentinel values and ordering (as validated in the tests).

Two small nits to consider:

  1. Clarify/lock in ctor semantics

    The ctor currently interprets its capacity argument as an expected element count and scales it by 1 / loadFactor internally. It would be good to document this explicitly in the Javadoc (and keep callers from pre‑scaling), e.g., “capacity is the expected number of elements”.

  2. Optional: guard against use after close()

    Methods like add, contains, remove, and keyIndex assume memStart != 0. If someone accidentally calls them after close(), they’ll operate on addresses derived from 0. Adding a lightweight guard (e.g., if (memStart == 0) throw new IllegalStateException("closed")) would make misuse fail fast instead of doing undefined off‑heap access. This is optional but can help future refactors.

Also applies to: 83-103, 115-188, 190-257, 259-285

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 555f6d4 and fa2ef16.

📒 Files selected for processing (5)
  • core/src/main/java/io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java (6 hunks)
  • core/src/main/java/io/questdb/std/DirectLongHashSet.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/InTest.java (2 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/regex/RegexpReplaceStrFunctionFactoryTest.java (2 hunks)
  • core/src/test/java/io/questdb/test/std/DirectLongHashSetTest.java (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-19T12:21:00.062Z
Learnt from: jerrinot
Repo: questdb/questdb PR: 6413
File: core/src/test/java/io/questdb/test/cutlass/pgwire/PGJobContextTest.java:11982-12002
Timestamp: 2025-11-19T12:21:00.062Z
Learning: QuestDB Java tests use a deterministic random seed. The test utilities (e.g., io.questdb.test.tools.TestUtils and io.questdb.std.Rnd) produce reproducible sequences, so rnd_* functions (including rnd_uuid4) yield deterministic outputs across runs. Do not flag tests in core/src/test/** that assert against values produced by rnd_* as flaky due to randomness.

Applied to files:

  • core/src/test/java/io/questdb/test/griffin/InTest.java
  • core/src/test/java/io/questdb/test/std/DirectLongHashSetTest.java
📚 Learning: 2025-07-31T08:21:17.390Z
Learnt from: mtopolnik
Repo: questdb/questdb PR: 5997
File: core/src/main/java/io/questdb/griffin/engine/functions/array/DoubleArrayAccessFunctionFactory.java:127-131
Timestamp: 2025-07-31T08:21:17.390Z
Learning: In DoubleArrayAccessFunctionFactory.java, overflow validation for array index casting from long to int is performed in the validateIndexArgs() method rather than at each individual cast site. The validateIndexArgs() method is called early in newInstance() to validate all index arguments, making subsequent casts safe.

Applied to files:

  • core/src/main/java/io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java
🧬 Code graph analysis (4)
core/src/test/java/io/questdb/test/griffin/InTest.java (1)
core/src/main/java/io/questdb/std/Numbers.java (1)
  • Numbers (44-3523)
core/src/main/java/io/questdb/std/DirectLongHashSet.java (5)
core/src/main/java/io/questdb/std/MemoryTag.java (1)
  • MemoryTag (27-184)
core/src/main/java/io/questdb/std/Numbers.java (1)
  • Numbers (44-3523)
core/src/main/java/io/questdb/std/Unsafe.java (1)
  • Unsafe (40-542)
core/src/main/java/io/questdb/std/Vect.java (1)
  • Vect (27-501)
core/src/main/java/io/questdb/std/Hash.java (1)
  • Hash (30-229)
core/src/test/java/io/questdb/test/griffin/engine/functions/regex/RegexpReplaceStrFunctionFactoryTest.java (1)
core/src/main/java/io/questdb/std/str/StringSink.java (1)
  • StringSink (31-194)
core/src/test/java/io/questdb/test/std/DirectLongHashSetTest.java (2)
core/src/main/java/io/questdb/std/DirectLongHashSet.java (1)
  • DirectLongHashSet (36-286)
core/src/main/java/io/questdb/std/Rnd.java (1)
  • Rnd (39-481)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (4)
core/src/test/java/io/questdb/test/griffin/engine/functions/regex/RegexpReplaceStrFunctionFactoryTest.java (1)

90-100: Dynamic SQL construction via StringSink is fine

The looped construction preserves the original SQL structure while avoiding a massive literal, and the comma/infix logic (i > 0) looks correct. Initial capacity 8192 is just a hint given StringSink’s growth semantics, so no risk there.

core/src/test/java/io/questdb/test/griffin/InTest.java (1)

27-29: Sentinel/null IN regression test is well-targeted

This test covers INT IN-lists with negative values, zero, and both literal null and Numbers.INT_NULL via bind variables, plus asserts the execution plan shows a sorted, null-aware in-set. That’s exactly the scenario that previously failed and should guard the new DirectLongHashSet-based implementation going forward. (Deterministic rnd_* usage is fine here.)

Also applies to: 246-305

core/src/main/java/io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java (1)

38-42: Off-heap set usage and lifecycle wiring look correct

Using DirectLongHashSet removes the sentinel collision issue and the ownership model is clear:

  • Const case: the set is built once, freed on failure in newInstance, and later freed in InLongConstFunction.close() via Misc.free(inSet).
  • Runtime-const case: the set is allocated in the ctor, repopulated in init() via clear() + parseToLong, and freed in close().

getBool() and toPlan() simply delegate to inSet.contains() / inSet.toSink(), so INT/LONG/null value semantics are now driven by the set rather than magic sentinels.

No correctness issues spotted in these paths.

Also applies to: 120-127, 183-217, 220-255

core/src/test/java/io/questdb/test/std/DirectLongHashSetTest.java (1)

35-272: Good behavioral and edge‑case coverage for DirectLongHashSet

These tests exercise all the important aspects of the new set: duplicates, zero handling, positive/negative/sentinel values, rehashing, removal (including collision cases), and toSink ordering. The use of try‑with‑resources keeps off‑heap memory safe.

@puzpuzpuz puzpuzpuz added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution labels Nov 24, 2025
@bluestreak01
Copy link
Member

hey @kafka1991 could you update PR description to make it use-facing. "fix" will be included in public release notes

@UtkarshDashora
Copy link

✅ Passed checks (3 passed)

@kafka1991 kafka1991 changed the title fix(sql): fix LongHashSet's sentinel value for InLongFunction fix(sql): fix bug where queries using IN clauses with INT columns containing -1 return incorrect results. Nov 25, 2025
@kafka1991 kafka1991 changed the title fix(sql): fix bug where queries using IN clauses with INT columns containing -1 return incorrect results. fix(sql): fix bug where queries using IN clauses with INT/LONG columns containing -1 return incorrect results. Nov 25, 2025
@kafka1991
Copy link
Collaborator Author

hey @kafka1991 could you update PR description to make it use-facing. "fix" will be included in public release notes

fixed

@glasstiger
Copy link
Contributor

[PR Coverage check]

😍 pass : 115 / 122 (94.26%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java 11 14 78.57%
🔵 io/questdb/std/DirectLongHashSet.java 104 108 96.30%

@bluestreak01 bluestreak01 merged commit 9968b7c into master Nov 25, 2025
41 checks passed
@bluestreak01 bluestreak01 deleted the victor_longhashset_fix branch November 25, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query returns incorrect results when INT column IN list contains mixed positive and negative numbers

5 participants