zlacker

[parent] [thread] 2 comments
1. willva+(OP)[view] [source] 2024-08-27 06:31:17
Something I anticipate is smarter storage that can do some filtering on push down predicates. There's compute on the storage nodes that is being wasted today.

I was kinda expecting BigQuery to do this under the hood, but it seems like they don't, which is a shame. BigQuery isn't faster than, say, trino on gcs, even though Google could do some major optimisations here.

replies(2): >>okr+i7 >>levent+6Ua
2. okr+i7[view] [source] 2024-08-27 08:13:29
>>willva+(OP)
I also wonder if Athena does this with AWS. Parquet supports pushdown. But i would suspect, pushdown predicates would mean that the file storage unit has to have some logic to execute custom code, bringing back the code to the data. The promise of spark, once. It would be a huge win, definitly. Hmmm.

But opens up also a threat vector. And you have competing users running their predicates. So one has to think also about queues and pipelining and so on. But probably also solvable, just like on any multiuser system.

Interesting.

3. levent+6Ua[view] [source] 2024-08-31 06:32:58
>>willva+(OP)
BigQuery Storage Read API claims to support filters and simple projections pushed down to the storage: https://cloud.google.com/bigquery/docs/reference/storage. See also this recent paper: https://research.google/pubs/biglake-bigquerys-evolution-tow...

I've also recently proposed a Table Read protocol that should be a "non-vendor-controlled" equivalent of BigQuery Storage APIs: https://engineeringideas.substack.com/p/table-transfer-proto...

[go to top]