This release introduces automatic object store WAL cleanup on the primary node, TLS certificate expiration metrics for Prometheus alerting, and new array_sort() and array_reverse() functions. Performance improvements include faster ASOF and WINDOW JOINs through binary search–based frame positioning, along with a configurable WAL writer madvise mode. Several critical bug fixes address crashes in LATEST BY ALL queries, Parquet reads with missing statistics, and backup restore edge cases, while the ACL permission system has been expanded to support up to 256 permissions.
New Features
This feature introduces a WAL cleaner that runs on the primary node and automatically deletes replicated WAL data from object storage once it is no longer needed by any replica or backup. It determines what is safe to delete by consulting two sources of cleanup history — enterprise backup manifests and checkpoint history records — and always retains enough data to support the most recent N backups or checkpoints. The cleaner is conservative by default: it won't delete anything until sufficient history exists, and it picks the most conservative boundary when multiple sources or cluster nodes are involved. Key components include a checkpoint history tracker that records per-table transaction state to the shared replication object store on each
CHECKPOINT RELEASE, a backup instance name registry for coordinating cleanup boundaries across multiple nodes, rate limiting and throttling for object store delete operations with auto-tuned defaults per cloud provider (S3, GCS, Azure Blob, R2, etc.), and crash recovery with periodic progress persistence so cleanup resumes where it left off after a restart. Dropped tables are cleaned up after a cooloff period (default 1h) to guard against clock skew. Key configuration properties includereplication.primary.cleaner.enabled(defaulttrue),replication.primary.cleaner.interval(default10m),replication.primary.cleaner.backup.window.count(default 5),replication.primary.cleaner.delete.concurrency(auto-tuned 4–12),replication.primary.cleaner.max.requests.per.second(service-dependent), andcheckpoint.history.enabled(defaulttruewhen replication is enabled).This feature adds Prometheus gauge metrics for TLS certificate time-to-live (TTL) across all four TLS-enabled endpoints:
questdb_tls_cert_ttl_seconds_http,questdb_tls_cert_ttl_seconds_http_min,questdb_tls_cert_ttl_seconds_line, andquestdb_tls_cert_ttl_seconds_pg. Each gauge reports seconds until the active certificate expires. Values greater than 0 indicate seconds remaining, 0 means expired, and -1 means the certificate has not been loaded or could not be parsed. Gauges are only registered for endpoints where TLS is enabled. The TTL is computed from the certificate'snotAfterfield, which is extracted via a JNI call into a minimal DER/X.509 parser on the Rust side. The expiration epoch is cached and updated onreload_tls(), so the metric always reflects the active in-memory certificate, not the one on disk.This feature adds
array_sort(DOUBLE[])andarray_reverse(DOUBLE[])scalar functions that operate on double arrays of any dimensionality.array_sort()sorts each innermost-dimension slice independently, preserving the array's shape, and accepts optional boolean arguments for descending order and nulls-first placement.array_reverse()reverses each innermost-dimension slice. Both functions handle NULL arrays, empty arrays, NaN values, and multidimensional inputs. They support both contiguous unit-stride and non-vanilla array layouts via separate code paths. The sort buffer grows on demand and stays at peak size for the cursor's lifetime to avoid allocation churn on the hot path.This feature adds
minTimestampandmaxTimestampTIMESTAMP columns tosys.telemetry_walto capture the data timestamp range per WAL transaction event. WAL telemetry is now enabled by default regardless of the main telemetry setting, reducing the dependency on logs to investigate data writing shape. The WalWriter commit log message has been downgraded from info to debug level unless the commit has a replace range. Schema migration support has been added so that when the column count mismatches the expected schema, the table is dropped and recreated, which is safe given the 1-week TTL.
Improvements
This improvement mirrors the optimization already present in
HORIZON JOIN. Without it, the first lookup linearly scans through all slave time frames preceding the master's first timestamp, which is O(N) in the number of frames. With theseekEstimate()optimization, the initial positioning is O(log P) where P is the number of partitions. Specifically,AbstractAsOfJoinFastRecordCursor.openSlaveFrame()now callsseekEstimate()on the first slave frame lookup to binary-search directly to the target partition instead of linearly scanning all preceding frames, benefiting allASOF JOINandLT JOINfast-path factories.WindowJoinTimeFrameHelper.findRowLo()andfindRowLoWithPrevailing()also now callseekEstimate()on the first lookup with the same partition-skipping behavior, benefiting both sync and asyncWINDOW JOINfactories.WalWriter previously hardcoded
POSIX_MADV_RANDOMfor memory-mapped column files, which hurts most workloads. This improvement makes the madvise hint opt-in via a new configuration propertycairo.wal.writer.madvise.modewith valid values:none(default, no hint),sequential, andrandom. Therandommode is beneficial when ingesting into many tables with many columns, as it prevents the OS from speculatively reading adjacent pages under memory pressure.Example configuration:
cairo.wal.writer.madvise.mode=randomThis improvement refactors the ACL permission system to support more than 64 permissions by migrating from 64-bit bitmasks to an exponent-based representation with 256-bit aggregate masks. Permission constants changed from
longbitmasks tointexponents, and a newPermissionMaskclass provides 256-bit storage (4 longs) for aggregate permission sets. The permission column type in the database schema changed fromlongtoshort(storing exponents instead of bitmasks), reducing storage overhead while supporting up to 256 distinct permissions.PermissionMask.ZEROis now immutable and throws on mutation attempts, and the sentinel value handling forALL_PERMISSIONSis properly supported acrosshas,set, andclearoperations.
Bug Fixes
This fix corrects
Numbers.MAX_SAFE_INT_POW_2from1L << 31to1L << 30. The old value (2^31) does not fit in a signed 32-bit int, so the rehash overflow guard let exactly 2^31 through. The subsequent(int)cast producedInteger.MIN_VALUE, andclear()fed approximately 18 EB to nativememset, causing a SIGSEGV. The crash chain occurred when aLATEST BY ALLquery on a large table filled anOrderedMapuntilkeyCapacityreached 2^30, thenrehash()doubled tonewKeyCapacity = 1L << 31, which truncated to a negative value and passed an enormous size to nativememset. The fix makes the guard rejectnewKeyCapacity = 2^31, throwing a cleanCairoException("map capacity overflow")instead of crashing the JVM. The constant was also deduplicated fromUnordered4Map,Unordered8Map, andUnorderedVarcharMap, each of which had a private copy with the same bug.The
cairo.partition.encoder.parquet.statistics.enabledconfiguration allows users to disable Parquet statistics, but the read path (ParquetTimestampFinder,TableWriter) and the O3 merge path (O3PartitionJob.processParquetPartition) hard-depended on timestamp column statistics. When statistics were absent,getMinValueLongwould hit an assertion crash with-eaor read garbage memory causing silent data corruption without-ea. This fix removes that hard dependency by addingrowGroupMinTimestampandrowGroupMaxTimestampmethods toPartitionDecoderthat try Parquet column statistics first at zero cost, then fall back to decoding the first/last row from actual data pages when statistics are absent. ThefindRowGroupByTimestampmethod also falls back to decoding instead of reading garbage memory, andO3PartitionJob,ParquetTimestampFinder, andTableWriterhave been migrated to use the new methods.During backup, partitions with
row_count=0may not producemeta.msgpack. The restore process previously always attempted to download this file when hash verification was enabled, causing restores to fail with "no partition metadata found" for empty partitions. This fix skips downloadingmeta.msgpackfor empty partitions during restore and skips hash verification in that case, while still requiring metadata for non-empty partitions.A primary instance restored from a backup could encounter a race condition between the WalPurgeJob and dropped table request processing, where the uploader could not open the
_txnlogbecause it had already been deleted. This fix ensures that the WalPurgeJob does not delete state that the uploader still needs. The bug was unique to backups since backups do not restore.pendingfiles used to control this workflow. At start-up, the adjusted WalUploader's replication logic now ensures that the appropriate.pendingfile is recreated when missing, which also patches instances that were already restored with older versions of QuestDB Enterprise.This fix introduces three new ACL permissions:
ALTER SYMBOL CAPACITY(column-level),SET REFRESH LIMIT(table-level), andSET REFRESH TYPE(table-level).ALTER SYMBOL CAPACITYreplaces the incorrect reuse ofALTER COLUMN TYPEfor symbol capacity changes. A startup migration automatically grantsALTER SYMBOL CAPACITYto every entity that previously hadALTER COLUMN TYPE, at the same scope (database, table, or column level), preserving grant options.SET REFRESH LIMITandSET REFRESH TYPEgate the previously unprotectedALTER MATERIALIZED VIEW ... SET REFRESH LIMIT/IMMEDIATE/MANUAL/EVERY/PERIODoperations. This fix also wires in the previously commented-out authorization check forALTER TABLE SET PARAMand adds a retry loop with recompile in access list reloading to handle table reference out-of-date exceptions during ACL reload.