zlacker

Here is one example from the PDF:

    FROM r JOIN s USING (id)
    |> WHERE r.c < 15
    |> AGGREGATE sum(r.e) AS s GROUP BY r.d
    |> WHERE s > 3
    |> ORDER BY d
    |> SELECT d, s, rank() OVER (order by d)

Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.

replies(18): >>ibash+h >>extr+j >>random+Q >>pajeet+61 >>andy80+D1 >>richbe+O1 >>jmull+s3 >>tmoert+O4 >>Glypto+U5 >>raceca+P6 >>hyperm+Qa >>thiht+di >>specia+Cq >>setr+Sq >>sklivv+dt >>iblain+Uu >>nextac+aN1 >>gajus+Yv5

>>Cianti+(OP)
Yes we can call it sql.

Language syntax changes all the time. Their point is that sql syntax is a mess and can be cleaned up.

>>Cianti+(OP)
Not bad, very similar to dplyr syntax. Personally i’m too used to classic SQL though and this would be more readable as CTEs. In particular how would this syntax fair if it was much more complicated with with 4-5 tables and joins?

>>Cianti+(OP)
> Can we call this SQL anymore after this?

Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?

replies(2): >>esafak+y5 >>eurode+W5

>>Cianti+(OP)
Yes, having |> isn't breaking SQL but rather enhancing it.

I really like this idea of piping SQL queries rather than trying to create the perfect syntax from the get go.

+1 for readability too.

replies(1): >>oxym0r+ec

>>Cianti+(OP)
The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.

replies(4): >>singro+k2 >>wvenab+N8 >>yen223+tu >>crypto+Dn1

>>Cianti+(OP)
IMO having SELECT before FROM is one of SQL's biggest mistakes. I would gladly welcome a new syntax that rectifies this. (Also https://duckdb.org/2022/05/04/friendlier-sql.html)

I don't love the multiple WHEREs.

replies(4): >>abelch+Lt >>virapt+nZ >>RaftPe+1i2 >>mr_toa+Dvc

>>andy80+D1
In the query plan, filtering before or after an aggregation is the same, so it's a strange quirk that SQL requires a different word.

replies(2): >>0cf861+l4 >>andy80+M4

>>Cianti+(OP)
The proposal here adds pipe syntax to SQL.

So it would be reasonable to call it SQL, if it gets traction. You want to see some of the big dogs adopting it.

That should at least be possible since it looks like it could be added to an existing implementation without significant disruption/extra complexity.

replies(1): >>crypto+Pn1

>>singro+k2
Indeed. Just as I think git’s N different ways to refer to the same operation was a blunder.

replies(1): >>andy80+45

>>singro+k2
I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.

replies(3): >>0cf861+i6 >>yen223+aS >>singro+vn1

>>Cianti+(OP)
The point of SQL pipe syntax is that there is no reordering. You read the query as a sequence of operations, and that's exactly how it's executed. (Semantically. Of course, the query engine is free to optimize the execution plan as long as the semantics are preserved.)

The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:

From the paper:

> Each pipe operator is a unary relational operation that takes one table as input and produces one table as output.

Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)

replies(8): >>quietb+Aq >>mattas+Zq >>aidos+Mt >>thwart+GG >>_a_a_a+YW >>crypto+5n1 >>vendid+du1 >>camgun+pB1

>>0cf861+l4
But pre- and post- aggregation filtering is not really "the same" operation.

replies(1): >>0cf861+r5

>>andy80+45
If I use a CTE and filter the aggregate, feels the same to me.

replies(1): >>andy80+yf

>>random+Q
You show a good example. Many people would call that SQL, and if pipes become popular, they too might simply be called SQL one day.

>>Cianti+(OP)
In the example would there a difference between `|> where s > 3` and `|> having s > 3` ?

Edit: nope, just that you don't need having to exist with the pipe syntax.

>>random+Q
> GoogleSQL

EssGyooGell: A Modest Proposal

>>andy80+M4
I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.

>>Cianti+(OP)
In that example, "s" has two meanings: 1. A table being joined. 2. A column being summed.

For clarity, they should have assigned #2 to a different variable letter.

>>andy80+D1
> The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.

replies(1): >>andy80+ek

>>Cianti+(OP)
This is an extension on top of all existing SQL. The pipe functions more or less as a unix pipe. There is no reordering, but the user selects the order. The core syntax is simply:

  query | operator

Which results in a new query that can be piped again. So e.g. this would be valid too:

  SELECT id,a,b FROM  table WHERE id>1
  |WHERE id < 10

Personally, I can see this fix so much SQL pain.

replies(1): >>larodi+Fg

>>pajeet+61
Honestly, it seems like a band-aid on legacy query language.

replies(1): >>lpapez+Df

>>0cf861+r5
If you perform an aggregation query in a CTE, then filter on that in a subsequent query, that is different, because you have also added another SELECT and FROM. You would use WHERE in that case whether using a CTE or just an outer query on an inner subquery. HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.

replies(1): >>RaftPe+fC

>>oxym0r+ec
SQL a legacy query language?

In order for a thing to be considered legacy, there needs to be a widespread successor available.

SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.

Calling it legacy is not realistic IMO.

replies(1): >>Spivak+mz

>>hyperm+Qa
okay, now I can see why this so much reminds of CTE

>>Cianti+(OP)
Honestly SQL screwed things up from the very beginning. "SELECT FROM" makes no sense at all. The projection being before the selection is dumb as hell. This is why we can’t get proper tooling for writing SQL, even autocompletion can’t work sanely. You write "SELECT", what’s it gonna autocomplete?

PRQL gives me hope that we might finally get something nice some day

replies(4): >>akira2+ii >>gfody+Hk >>parpfi+Kt >>scrlk+Du

>>thiht+di

    SELECT 1+2;

FROM clauses aren't required, and using multiple tables in FROM doesn't seem to work out too well when that syntax is listed first.

replies(3): >>yen223+ks >>Electr+8m1 >>thiht+XF1

>>wvenab+N8
Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.

replies(1): >>wvenab+fl

>>thiht+di
> what’s it gonna autocomplete?

otoh if you selected something the from clause and potentially some joins could autocomplete

replies(1): >>thiht+cG1

>>andy80+ek
A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.

I will already use subqueries to avoid issues with HAVING.

replies(1): >>magica+OB

>>tmoert+O4
> The point of SQL pipe syntax is that there is no reordering.

If you're referring to this in the comment you're replying to:

> Can we call this SQL anymore after this? This re-ordering of things ...

Then they're clearly just saying that this is a reordering compared to SQL, which is undeniably true (and the while point).

replies(1): >>tmoert+rs

>>Cianti+(OP)
My initial reaction is that the pipes are redundant (syntactic vinegar). Syntactic order is sufficient.

The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.

Drop the pipes and this would be terrific.

>>Cianti+(OP)
this is consistent, non-pseudo-english, reusable, and generic. The SQL standard largely defines the aesthetic of the language, and is in complete opposition to these qualities. I think would be fundamentally incorrect to call it SQL

Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords

>>tmoert+O4
> Vanilla SQL [...] QUALIFY is after analytic window functions

Isn't that FILTER (WHERE), as in SELECT avg(...) FILTER (WHERE ...) FROM ...?

>>akira2+ii
WITH clauses are optional and appear before SELECT. No reason why the FROM clause couldn't behave the same

replies(1): >>akira2+Rx

>>quietb+Aq
The post I was referring to said that this new pipe syntax was a big reordering compared to the vanilla syntax, which it is. But my point is that if you're going to understand the vanilla syntax, you already have to do this reordering in your head because the order in which the the vanilla syntax executes (inside out) is the order in which pipes syntax reads. So it's just easier all around to adopt the pipe syntax so that reading and execution are the same.

>>Cianti+(OP)
We can call it "Linq2SQL" and what a disaster it was...

>>thiht+di
I also hate having SELECT before FROM because I want to think of the query as a transformation that can be read from top to bottom to understand the flow.

But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative

>>richbe+O1
duckDB is what sql should be in 2024

https://duckdbsnippets.com/

replies(1): >>xigoi+6J1

>>tmoert+O4
Golly, QUALIFY, a new SQL operator I didn’t know existed. I tend not to do much with window functions and I would have reached for a CTE instead but it’s always nice to be humbled by finding something new in a language you thought you knew well.

replies(1): >>mrbung+PW

>>andy80+D1
Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?

To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.

>>thiht+di
The initial version of SQL was called "Structured English Query Language".

If the designers intended to create a query language that resembled an English sentence, it makes sense why they chose "SELECT FROM".

"Select the jar from the shelf" vs. "From the shelf, select the jar".

replies(1): >>xigoi+SI1

>>Cianti+(OP)
Looking at this reminds me of Apache Pig. That’s not a compliment.

>>yen223+ks
Isn't that strictly for CTEs? In which case, you are SELECTing from the CTE.

>>lpapez+Df
I mean kinda? It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."

We would invent the underlying engines for sure but not the language on top of it. It doesn't map at all to how it's actually used by programmers. SQL is the JS to WebAssembly, being able to write the query plan directly via whatever language or mechanism you prefer would be goated.

It has to be my biggest pain point dealing with SQL, having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit! is unbelievably frustrating.

replies(2): >>wvenab+zB >>lpapez+4E

>>Spivak+mz
By that definition JavaScript is also legacy.

> having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit'

That's not in the domain of SQL. If you're not getting the most optimized query plan, there is something wrong with the DBMS engine or statistics -- SQL, the language, isn't supposed to care about those details.

replies(1): >>Spivak+y82

>>wvenab+fl
> A lot of SQL engines don't support aliases in the HAVING clause

We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.

I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.

replies(2): >>wvenab+nK >>JoelJa+3O1

>>andy80+yf
> HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.

Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:

1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result

2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)

replies(1): >>sgarla+m62

>>Spivak+mz
> It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."

I don't think that definition of legacy is useful because so many things which hardly anyone calls "legacy" fit the definition - for example: Javascript as the web standard, cars in cities and bipartisan democracy.

I think many of us would say that that none of these is an ideal solution for the problem being solved, but it's what we are stuck with and I cannot think anyone could call it "legacy systems" until a viable successor is widespread.

>>tmoert+O4
> The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins

I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:

> *(the query engine is free to optimize the execution plan as long as the semantics are preserved)

since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.

>>magica+OB
I'm surprised you had issues with CTEs -- MS SQL has one of the better CTE implementations. But I could see how it might take more than just trivial transformations to make efficient use of them.

replies(1): >>magica+kR

>>wvenab+nK
I don't recall all off the top of my head.

One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.

One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.

There were other issues using plain joins, but I can't recall them right now.

replies(1): >>RaftPe+Md2

>>andy80+M4
I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.

>>aidos+Mt
Is not common at all, is a non ANSI SQL clause that afaik was created by Teradata, syntactic sugar for filtering using window functions directly without CTEs or temp tables, especially useful for dedup. In most cases at least, for example you can't do a QUALIFY in an query that is aggregating data just as you can't use a window function when aggregating.

Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.

Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.

>>tmoert+O4
'qualify' is now standard? Thought it was a vendor extension currently.

>>richbe+O1
Duckdb also supports prql with an extension https://github.com/ywelsch/duckdb-prql

>>akira2+ii
Beginning with Oracle Database Release 23 [released May 2, 2024], it is now optional to select expressions using the FROM DUAL clause.

>>tmoert+O4
> The point of SQL pipe syntax is that there is no reordering.

But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.

>>andy80+M4
I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write

    SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...

and they decided this was common enough that they would introduce a HAVING clause for this case

    SELECT * FROM ... GROUP BY ... HAVING ...

But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.

Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.

>>andy80+D1
You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.

You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.

Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.

But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).

So all in all I agree with you: keep HAVING.

>>jmull+s3
There may be trademark issues, but even if not, doing sufficient violence to the original thing argues for using a new name for the new thing.

>>tmoert+O4
This is an interesting point.

All these years I've been doing that reordering and didn't even realize!

>>tmoert+O4
This kind of implies there's better or worse ordering. AFAIK that's pretty subjective. If the idea was to expose how the DB is ordering things, or even make things easier for autocomplete OK, but this just feels like a "I have a personal aesthetic problem with SQL and I think we should spend thousands of engineering hours and bifurcate SQL projects forever to fix it" kind of thing.

>>akira2+ii
Doesn’t change anything, you can still have the select at the end, and optional from and joins at the beginning. In your example, the select could be at the end, it’s just that there’s nothing before.

>>gfody+Hk
Not reliably, especially if you alias tables. Realistically, you need to know what you’re selecting from before knowing what you’re selecting.

>>scrlk+Du
“Go to the shelf and select the jar”. You’re describing a process, so it’s natural to formulate it in chronological order.

replies(1): >>mr_toa+ewc

>>abelch+Lt
The very first example on that page is vulnerable to injection.

replies(1): >>richbe+Pn2

>>Cianti+(OP)
At this point I think that vanilla SQL should just support optionally putting the from before the select. It's useful for enabling autocompletion, among other things.

replies(1): >>RaftPe+tg2

>>magica+OB
Can you please share the SQL queries? If tables/columns are sensitive, maybe it can be anonymized replacing tables with t1,t2,t3 and columns c1,c2,c3.

>>RaftPe+fC
The main (only?) task I routinely use HAVING for is finding duplicates.

>>wvenab+zB
> That's not in the domain of SQL.

That's my point, I think we've reached the point where SQL the langage can be more of a hindrance than help because in a lot of cases we're writing directly to the engine but with oven mitts on. If I could build the query from the tree with scan, filter, index scan, cond, merge join as my primitives it would be so nice.

replies(1): >>mr_toa+gvc

>>magica+kR
CTE's (at least in MS SQL land) are a syntax level operation, meaning CTE's get expanded to be as if you wrote the same subquery at each place a CTE was, which frequently impacts the optimizer and performance.

I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.

replies(1): >>wvenab+Wy2

>>nextac+aN1
And a simple keyword that does a GROUP BY on all columns in select that aren't aggregates, just a syntax level macro-ish type of thing.

>>richbe+O1
That's a great list of friendlier sql in DuckDB. For most of that list I either run into it regularly or have wanted the exact fix they have.

>>xigoi+6J1
Which one?

replies(1): >>xigoi+Zp2

>>richbe+Pn2

  #!/bin/bash 
  function csv_to_parquet() {     
      file_path="$1"     
      duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }

replies(1): >>richbe+aF3

>>RaftPe+Md2
If you use temp tables you're subverting the optimizer. Sometimes that's what you want but often it's not.

replies(1): >>RaftPe+t83

>>wvenab+Wy2
I use them on purpose to "help" the optimizer by reducing the search space for query plan ((knowing that query plan optimization is a combinatorial problem and the optimizer frequently can't evaluate enough plans in a reasonable amount of time).

>>xigoi+Zp2
Eh, in the context of the site and other snippets that seems pedantic.

Could it be run on untrusted user input? Sure. Does it actually pose a threat? It's improbable.

>>Cianti+(OP)
What's the SQL equivalent of this?

>>Spivak+y82
Sounds like you don’t want SQL at all. Some sort of non-SQL, or not-SQL, never-SQL. Something along those lines.

replies(1): >>Spivak+lpd

>>richbe+O1
SQL was supposed to follow English grammar. Having FROM before SELECT is like having “Begun” before “these clone wars have.”

>>xigoi+SI1
SQL is a declarative language not a procedural one. You tell the query planner what you want, not how to do it.

>>mr_toa+gvc
That's the thing though, I still want my data to be relational so NoSQL databases don't fit the bill. I want to interact with a relational database via something other than the SQL language and given that this language already exists (Postgres compiles your SQL into an IR that uses these primitives) I don't think it's a crazy ask.