zlacker

[return to "Pipe Syntax in SQL"]
1. mocamo+WB[view] [source] 2024-08-24 20:27:19
>>legran+(OP)
Question for people writing highly complex SQL queries.

Why not write simple SQL queries and use another language to do the transformations?

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?

Syntax is one thing, but actual performance (and safety/maintenance) is another deal?

◧◩
2. sagarm+ZF[view] [source] 2024-08-24 21:03:16
>>mocamo+WB
I've worked on a few SQL systems used for analytics and ETL.

My users fell into (for the purposes of this discussion) three categories:

1. Analysts who prefer sheets

2. Data scientists that prefer pandas

3. Engineers who prefer C++/Java/JavaScript/Python

I'm fairly sure SQL isn't the first choice for any of them, but in all three cases a modern vectorized SQL engine will be the fastest option for expressing and executing many analysis and ETL tasks, especially when the datasets don't fit on a single machine. It's also easier to provide a shared pool of compute to run SQL than arbitrary code, especially with low latency.

Even as a query engine developer, I would prefer using a SQL engine. Performing even the basic optimizations a modern engine would perform -- columnar execution, predicate pushdown, pre-aggregation for shuffles, etc -- would be at least a week of work for me. A bit less if I built up a large library to assist.

◧◩◪
3. xpe+8n1[view] [source] 2024-08-25 03:37:12
>>sagarm+ZF
Re #2: I prefer https://pola.rs over Pandas
◧◩◪◨
4. sagarm+sx1[view] [source] 2024-08-25 06:03:24
>>xpe+8n1
I've heard great things about Pola.rs performance. To get there, they have a lazy evaluation so they can see more of the computation at once, allowing them to implement optimizations similar to those in a SQL engine.
◧◩◪◨⬒
5. xpe+ti7[view] [source] 2024-08-27 12:44:10
>>sagarm+sx1
In the early days, even as I appreciated what Pandas could do, I never found its API sane. Pandas has too many special cases and foot-guns. It is a notorious case of poor design.

My opinion is hardly uncommon. If you read over https://www.reddit.com/r/datascience/comments/c3lr9n/am_i_th... you will find many in agreement. Of those who "like" Pandas, it is often only a relative comparison to something worse.

The problems of the Pandas API were not intrinsic nor unavoidable. They were poor design choices probably caused by short-term thinking or a lack of experience.

Polars is a tremendous improvement.

◧◩◪◨⬒⬓
6. sagarm+8o9[view] [source] 2024-08-28 00:55:41
>>xpe+ti7
Hey, I agree with you.

On eager vs lazy evaluation -- pytorch defaulting to eager seemed to be part of the reason it was popular. Adding optional lazy evaluation to improve performance later seems to have worked for them.

[go to top]