zlacker

[parent] [thread] 23 comments
1. andy80+(OP)[view] [source] 2024-08-24 16:29:10
The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.

replies(4): >>singro+H >>wvenab+a7 >>yen223+Qs >>crypto+0m1
2. singro+H[view] [source] 2024-08-24 16:34:25
>>andy80+(OP)
In the query plan, filtering before or after an aggregation is the same, so it's a strange quirk that SQL requires a different word.
replies(2): >>0cf861+I2 >>andy80+93
◧◩
3. 0cf861+I2[view] [source] [discussion] 2024-08-24 16:50:46
>>singro+H
Indeed. Just as I think git’s N different ways to refer to the same operation was a blunder.
replies(1): >>andy80+r3
◧◩
4. andy80+93[view] [source] [discussion] 2024-08-24 16:55:38
>>singro+H
I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.
replies(3): >>0cf861+F4 >>yen223+xQ >>singro+Sl1
◧◩◪
5. andy80+r3[view] [source] [discussion] 2024-08-24 16:57:29
>>0cf861+I2
But pre- and post- aggregation filtering is not really "the same" operation.
replies(1): >>0cf861+O3
◧◩◪◨
6. 0cf861+O3[view] [source] [discussion] 2024-08-24 17:01:29
>>andy80+r3
If I use a CTE and filter the aggregate, feels the same to me.
replies(1): >>andy80+Vd
◧◩◪
7. 0cf861+F4[view] [source] [discussion] 2024-08-24 17:10:04
>>andy80+93
I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.
8. wvenab+a7[view] [source] 2024-08-24 17:29:49
>>andy80+(OP)
> The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.

replies(1): >>andy80+Bi
◧◩◪◨⬒
9. andy80+Vd[view] [source] [discussion] 2024-08-24 18:20:20
>>0cf861+O3
If you perform an aggregation query in a CTE, then filter on that in a subsequent query, that is different, because you have also added another SELECT and FROM. You would use WHERE in that case whether using a CTE or just an outer query on an inner subquery. HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.
replies(1): >>RaftPe+CA
◧◩
10. andy80+Bi[view] [source] [discussion] 2024-08-24 18:53:44
>>wvenab+a7
Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.
replies(1): >>wvenab+Cj
◧◩◪
11. wvenab+Cj[view] [source] [discussion] 2024-08-24 19:01:58
>>andy80+Bi
A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.

I will already use subqueries to avoid issues with HAVING.

replies(1): >>magica+bA
12. yen223+Qs[view] [source] 2024-08-24 20:12:29
>>andy80+(OP)
Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?

To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.

◧◩◪◨
13. magica+bA[view] [source] [discussion] 2024-08-24 21:18:15
>>wvenab+Cj
> A lot of SQL engines don't support aliases in the HAVING clause

We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.

I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.

replies(2): >>wvenab+KI >>JoelJa+qM1
◧◩◪◨⬒⬓
14. RaftPe+CA[view] [source] [discussion] 2024-08-24 21:21:58
>>andy80+Vd
> HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.

Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:

1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result

2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)

replies(1): >>sgarla+J42
◧◩◪◨⬒
15. wvenab+KI[view] [source] [discussion] 2024-08-24 22:09:58
>>magica+bA
I'm surprised you had issues with CTEs -- MS SQL has one of the better CTE implementations. But I could see how it might take more than just trivial transformations to make efficient use of them.
replies(1): >>magica+HP
◧◩◪◨⬒⬓
16. magica+HP[view] [source] [discussion] 2024-08-24 23:02:47
>>wvenab+KI
I don't recall all off the top of my head.

One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.

One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.

There were other issues using plain joins, but I can't recall them right now.

replies(1): >>RaftPe+9c2
◧◩◪
17. yen223+xQ[view] [source] [discussion] 2024-08-24 23:09:37
>>andy80+93
I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.
◧◩◪
18. singro+Sl1[view] [source] [discussion] 2024-08-25 05:06:31
>>andy80+93
I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write

    SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...
and they decided this was common enough that they would introduce a HAVING clause for this case

    SELECT * FROM ... GROUP BY ... HAVING ...
But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.

Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.

19. crypto+0m1[view] [source] 2024-08-25 05:08:45
>>andy80+(OP)
You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.

You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.

Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.

But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).

So all in all I agree with you: keep HAVING.

◧◩◪◨⬒
20. JoelJa+qM1[view] [source] [discussion] 2024-08-25 10:27:52
>>magica+bA
Can you please share the SQL queries? If tables/columns are sensitive, maybe it can be anonymized replacing tables with t1,t2,t3 and columns c1,c2,c3.
◧◩◪◨⬒⬓⬔
21. sgarla+J42[view] [source] [discussion] 2024-08-25 13:54:17
>>RaftPe+CA
The main (only?) task I routinely use HAVING for is finding duplicates.
◧◩◪◨⬒⬓⬔
22. RaftPe+9c2[view] [source] [discussion] 2024-08-25 15:02:29
>>magica+HP
CTE's (at least in MS SQL land) are a syntax level operation, meaning CTE's get expanded to be as if you wrote the same subquery at each place a CTE was, which frequently impacts the optimizer and performance.

I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.

replies(1): >>wvenab+jx2
◧◩◪◨⬒⬓⬔⧯
23. wvenab+jx2[view] [source] [discussion] 2024-08-25 17:12:48
>>RaftPe+9c2
If you use temp tables you're subverting the optimizer. Sometimes that's what you want but often it's not.
replies(1): >>RaftPe+Q63
◧◩◪◨⬒⬓⬔⧯▣
24. RaftPe+Q63[view] [source] [discussion] 2024-08-25 20:51:47
>>wvenab+jx2
I use them on purpose to "help" the optimizer by reducing the search space for query plan ((knowing that query plan optimization is a combinatorial problem and the optimizer frequently can't evaluate enough plans in a reasonable amount of time).
[go to top]