zlacker

[parent] [thread] 2 comments
1. ruhith+(OP)[view] [source] 2026-02-04 04:50:35
The DuckDB-as-a-storage-engine approach is clever because it lets you keep your existing MySQL connections, tooling, and replication topology while routing analytical queries to a columnar engine underneath. That's a much easier sell operationally than standing up a separate analytics database and building a sync pipeline. The real question is how they handle consistency between the InnoDB and DuckDB copies of the same data, that's where every hybrid OLTP/OLAP system either shines or quietly loses rows.
replies(2): >>baotia+cg >>Captai+1j
2. baotia+cg[view] [source] 2026-02-04 07:25:33
>>ruhith+(OP)
On this page, we introduce how to implement a read-only Columnar Store (DuckDB) node leveraging the MySQL binlog mechanism. https://github.com/alibaba/AliSQL/blob/master/wiki/duckdb/du... In this implementation, we have performed extensive optimizations for binlog batch transmission, write operations, and more.
3. Captai+1j[view] [source] 2026-02-04 07:49:21
>>ruhith+(OP)
Nice question! We did spend a lot of time considering the issue of data consistency.

In the MySQL replication, GTID is crucial for ensuring that no transaction is missed or replayed repeatedly. We handle this in two scenarios (depending on whether binlog is enabled):

    - log_bin is OFF: We ensure that transaction in DuckDB are committed before the GTID is written to disk (in the mysql.gtid_executed table). Furthermore, after a crash recovery, we perform idempotent writes to DuckDB for a period of time (the principle is similar to upsert or delete+insert). Therefore, at any given moment after a crash recovery, we can guarantee that the data in DuckDB is consistent with the primary database.
    - log_bin is ON: Unlike the previous scenario, we no longer rely on the `mysql.gtid_executed` table; we directly use the Binlog for GTID persistence. However, a new problem arises: Binlog persistence occurs before the Storage Engine commits. Therefore, we created a table in DuckDB to record the valid Binlog position. If the DuckDB transaction fails to commit, the Binlog will be truncated to the last valid position. This ensures that the data in DuckDB is consistent with the contents of the Binlog.
Therefore, if the `gtid_executed` on the replica server matches that of the primary database, then the data in DuckDB will also be consistent with the primary database.
[go to top]