What’s your issue there?
Plus I can't use web tools, like "Read this page" in Mobile Safari.
And copying and pasting is harder.
And I can't link to individual sections.
I'm honestly baffled by people who prefer PDFs for this kind of information. Are they printing them out on paper and going at them with a highlighter or something?
https://static.simonwillison.net/static/2024/Pipe-Syntax-In-...
FROM r JOIN s USING (id)
|> WHERE r.c < 15
|> AGGREGATE sum(r.e) AS s GROUP BY r.d
|> WHERE s > 3
|> ORDER BY d
|> SELECT d, s, rank() OVER (order by d)
Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable....okay, if I dial-back my feelings of resignation to mediocrity, then I'll admit that Google probably does have enough clout to make this go somewhere - but they'd need to add this to all their database offerings (BigQuery, Spanner, Firebase's SQL mode) and contribute patches to Postgres and MySQL/Maria - maybe after Microsoft relents a decade later to add it to MSSQL we'll maybe start to see Oracle's people refer to it vaguely as a nice-to-have they'll implement only after they start losing more blue-chip customers[1].
Also, it's giving me M (Excel PowerQuery) vibes too.
-------
[1]For context, Oracle's DB lacked a `bit`/`bool` column type for the past 40 years until last year. People had to use `char(1)` columns with CHECK constraints to store '0'/'1' - or worse: 'T'/'F' or 'Y'/'N' (see https://stackoverflow.com/a/3726846/159145 )
Language syntax changes all the time. Their point is that sql syntax is a mess and can be cleaned up.
I do find common table expressions (eg with clause) lets me express what I want way easier.
Probably because it makes it easy to express a bunch of operators that are then pipelined…
Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?
I really like this idea of piping SQL queries rather than trying to create the perfect syntax from the get go.
+1 for readability too.
But piping is both intuitive and very easy to understand without having to get everything right from the beginning.
I really hope Postgres 16 can implement pipes, might be interesting to be able to use triggers and functions to pipe to and from, actually might simplify that aspect of it too.
Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.
But it is worth stating nonetheless. No one confesses when they search StackOverflow for something they've solved 100 times before, or when they ask Google a question a dozen times a day. Asking ChatGPT for some insight should equally not be considered sinful :)
I don't love the multiple WHEREs.
Also, why the need to match semantic evaluation order when there are far more important things happening under the hood that affect the execution plan (indexes, etc.)?
> Side-effects at a distance...The same columns are cross-referenced in SELECT, GROUP BY and ORDER BY, and corresponding edits are often required in three places
Can't this be avoided by using aliases?
I hope SQL pipes become commonplace so that I can use them anywhere I have to write SQL queries.
So it would be reasonable to call it SQL, if it gets traction. You want to see some of the big dogs adopting it.
That should at least be possible since it looks like it could be added to an existing implementation without significant disruption/extra complexity.
If you are truly blessed you get to see all of these in a single database. Also “Yes”/“No” with that specific casing.
Given it’s adding the pipe operator you can always only add it to the part of the query where it makes sense.
A nice sql linter for legibility could be great for this.
The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:
From the paper:
> Each pipe operator is a unary relational operation that takes one table as input and produces one table as output.
Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)
Edit: nope, just that you don't need having to exist with the pipe syntax.
For clarity, they should have assigned #2 to a different variable letter.
For example, I recently wrote an article about taking random samples using SQL. Even though I was writing it for my blog, which is HTML, I proofread the article by rendering it as a PDF doc, printing it out, and reviewing it with a blue pen in hand.
What surprised me is that I also found it easier to review the article on the screen when it was in PDF format. TeX just does a way better job of putting words on a page than does a web browser.
Actually, if you want to do the comparison yourself, I'll put both versions online:
HTML: https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....
PDF: https://blog.moertel.com/images/public_html/blog/pix-2024060...
I don't think either version is hard to read, but if I had my choice, I'd read the PDF version. But maybe that's just me.
Let me know which you prefer.
There have been so many attempts to alleviate the difficulty of writing SQL -- from ORMs, to alternate syntaxes, to alternate databases -- and none have been successful. The authors identify many reasons for this. The approach the authors have taken is incremental and makes only slight demands one people already familiar with elementary SQL -- and even people familiar with advanced SQL will likely find this approach to be easier to use for advanced queries.
That simple change would've largely solved the academic paper problem decades ago. It's bizarre that it still isn't a feature.
Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.
You can guess how many foreign keys that database had.
Piped SQL fits in perfectly with the overall SQL pot-luck buffet! I for one welcome Google to the table, enjoy the language that works everywhere and nowhere but is the best there is.
query | operator
Which results in a new query that can be piped again. So e.g. this would be valid too: SELECT id,a,b FROM table WHERE id>1
|WHERE id < 10
Personally, I can see this fix so much SQL pain.* Joins can be modelled as a "cross-referencing" operation that consume two (or more) data streams and produce a single data stream
* CTEs can be modelled as producing multiple data streams
* Recursive CTEs can be modelled as cycles in the execution graph
In order for a thing to be considered legacy, there needs to be a widespread successor available.
SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.
Calling it legacy is not realistic IMO.
But they do sort of acknowledge it in the paper. Eg on the first page it says:
> …we present a solution – adding pipe-structured data flow syntax to SQL. This makes SQL more flexible, extensible and easy to use. This paradigm works well in other languages like Kusto’s KQL[5]
Strange typo though, to say “Kusto’s KQL” instead of “Microsoft’s KQL”
Kusto is allegedly named after (sort of in reference to) Jacques Cousteau, so “Kusto’s” doesn’t make sense.
QQ: Do the math formulas render properly in reader mode for you? (On my test with Chrome, the answer seems to be no.)
PRQL gives me hope that we might finally get something nice some day
SELECT 1+2;
FROM clauses aren't required, and using multiple tables in FROM doesn't seem to work out too well when that syntax is listed first.It feels like google is only concerned with large query engines like BigQuery where you're relying on partitioning and sort order within partitions more than you can indexes or complicated multi table reference semantics.
> Can't this be avoided by using aliases?
In any language where variables are hard to declare and use pipes always feel like a good option; however, just adding some variables in is probably the better solution in most cases.
What syntax do you know that can represent a dag in text?
In the browser (iOS Safari) I use an extension (dark reader) to give it a dark theme, and the formulas render just fine there.
otoh if you selected something the from clause and potentially some joins could autocomplete
I will already use subqueries to avoid issues with HAVING.
When I last used KQL, it was infuriating that I could create my `let` clauses in chunks separated by whitespace because a blank line would be considered a terminated statement (and Kusto would say "Hey where's your expression?!"). This meant every Kusto file was a sea of text with no clear differentiation between subsequent clauses. I ended up using 3 blank comment lines as a "fake empty line" just to maintain my sanity.
Again, hope they fixed that by now...
The inclusion of pipe syntax or data processing using pipe-structured data flow syntax is not very novel at all. Splunk's SPL and similar languages have been a longstanding industry practice.
I wish the paper had provided more extensive references to the existing body of work in this domain, as it would have added depth and context to the discussion.
For SQL to Have a good/improved syntax for insert/update/delete/merge would be very nice (in theory)
If you're referring to this in the comment you're replying to:
> Can we call this SQL anymore after this? This re-ordering of things ...
Then they're clearly just saying that this is a reordering compared to SQL, which is undeniably true (and the while point).
The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.
Drop the pipes and this would be terrific.
Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords
Isn't that FILTER (WHERE), as in SELECT avg(...) FILTER (WHERE ...) FROM ...?
Some developers have problems, because they don't understand its abstraction. Periodically they try to re-implement feature to "fix" SQL. Every time it's an absolute disaster.
I realize this is how the world works, but it's just a waste of time.
But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative
These days, we have tooling that can generate an IR for SQL and transpile it to any dialect (see: sqlglot).
Doesn’t coupling SQL syntax to dialect-specific SQL semantics throw a wrench in this?
I think the point of this new syntax is to lower the cost of understanding the abstraction. You keep the familiar clause syntax but let clauses be chained into pipelines in which the actual (semantic) execution order is the same as the written order.
To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.
If the designers intended to create a query language that resembled an English sentence, it makes sense why they chose "SELECT FROM".
"Select the jar from the shelf" vs. "From the shelf, select the jar".
The big advantage is that the SQL engine can optimize the how if you focus on the what -- especially after so much research has been don building query compilers.
If you want to focus on the "how" then perhaps a "bare bones" DB like Redis is more in line with this kind of thinking.
When I was doing data science, all the other DS folks would be perfectly content to read and write queries that were hundreds of lines long. There were plenty of minor bits to pick, but it was a great lingua franca for describing data processing.
But engineers hate SQL because they generally only need a tiny little subset of the feature to enable transactional data updates. So they write an ORM to do the subset of SQL they need and never get the opportunity to be indoctrinated into the SQuLt
Why not write simple SQL queries and use another language to do the transformations?
Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?
Syntax is one thing, but actual performance (and safety/maintenance) is another deal?
> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?
Yes. With knowledge of the data (like indices) and statistics of the data it is usually very much more efficient than piping the data to another process to handle the same job.
Of course that requires you to write a good query and know how to tell what makes a good query, but if your data is in a relational database it is usually more efficient to do your filtering/matching/aggregating there.
I think this syntax could accidentally give the impression that this is what's happening, but it's still an abstraction thanks to the optimizer. The "how" a pipe syntax describes roughly aligns with how a human would naturally want to construct a machinated process to sort through a file cabinet. Which may have been why it's been such a big hit for Unix and functional programming, despite poorly mapping to the machine model before the optimizations of modern OSes and functional compilers came along.
Any choice whatsoever might give a false impression of the machine model. Maybe SQL is doing the right thing by essentially having its syntax elements in random order, thwarting any attempts by developers to relate what they are typing to what is actually happening, but I think the authors of this paper are right in their desire to make it more ergonomic.
I have tests. I have reusable blocks (SQL functions, WITH blocks and views). I don't have logging though.
I can put the result in a non-materialized view and have it update in real time as the data changes. Or I can toggle it to materialized view and now it's snapshotted data.
Finally, views that depend on views that depend on views get automatically optimized by the query planner. You need a lot of very tricky custom code to start approaching that.
Before you would write a Java/c++ class that would do the map/reduce job for you distributed over 100s of CPUS. And you would feel like you were on the bleeding edge doing innovative stuff.
Turns out that SQL is a perfect API for map/reduce.
Everything you write in SQL can be transformed into a massively parallel job. And you don't even know about it.
This is the secret behind BigQuery and Trino/Presto/Athena.
If I need to join a million rows to another table and then calculate an aggregate value, and do it all in a single query, it's fast. It might only take 0.01 seconds.
Whereas if I make separate queries to retrieve a million rows, and a million rows from another table, then it's incredibly slow just due to the data transfer. No matter how conceptually easier another language may be. So it might take 30 seconds for repeated sequential serialization and network and deserialization.
And even if you're looking up individual rows (not a million), with lots of joins that would be lots of round-trips to the database which multiplies latency and load -- so even if you can still get your final result quickly (e.g. 0.01 seconds rather than 0.001 seconds), the load you can handle drops by the same multiple (e.g. you can now only serve 10% as many users).
The general rule of thumb with databases is that they can be super-fast when everything is done on the database machine in a single query and your final result is a small amount of data (assuming everything is indexed properly and the query is written properly). But they become super-slow when you're doing a bunch of queries in a row, and where those intermediate queries can return massive amounts of data, or (even worse) need to send back massive amounts of data.
We would invent the underlying engines for sure but not the language on top of it. It doesn't map at all to how it's actually used by programmers. SQL is the JS to WebAssembly, being able to write the query plan directly via whatever language or mechanism you prefer would be goated.
It has to be my biggest pain point dealing with SQL, having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit! is unbelievably frustrating.
WITH a AS (select * from tbla),
b AS (select * from tblb)
select * from a join bMy users fell into (for the purposes of this discussion) three categories:
1. Analysts who prefer sheets
2. Data scientists that prefer pandas
3. Engineers who prefer C++/Java/JavaScript/Python
I'm fairly sure SQL isn't the first choice for any of them, but in all three cases a modern vectorized SQL engine will be the fastest option for expressing and executing many analysis and ETL tasks, especially when the datasets don't fit on a single machine. It's also easier to provide a shared pool of compute to run SQL than arbitrary code, especially with low latency.
Even as a query engine developer, I would prefer using a SQL engine. Performing even the basic optimizations a modern engine would perform -- columnar execution, predicate pushdown, pre-aggregation for shuffles, etc -- would be at least a week of work for me. A bit less if I built up a large library to assist.
Has anybody figured out whether they are proposing this pipeline syntax to be mixed with regular compositional forms like CTEs and subqueries? Or is it another limited, top-level syntax, similar to how some DB engines do not allow nesting of CTE syntax?
The only slightly annoying thing with CTEs is that you always have to name them. You might consider this an overhead if the flow is strictly linear. But when the flow is not linear (eg when doing two different aggregations over the same data, to join them later) you need a way to refer to the intended CTE, so always requiring a name does add some consistency and doesn’t seem a big deal overall.
> having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit'
That's not in the domain of SQL. If you're not getting the most optimized query plan, there is something wrong with the DBMS engine or statistics -- SQL, the language, isn't supposed to care about those details.
It's a great format for the problem it solves, but if browsers supported offline-only files the container format wouldn't (and shouldn't) need to be that complicated.
We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.
I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.
Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:
1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result
2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)
Depends on DB engines I suppose. I've come across that certain operations were not allowed in CTEs, and they can be an optimization barrier.
However if your query is dynamically modified at runtime, then CTEs can be a no-go. For example, we have a grid component which first does a count and then only selects the visible rows. This is great if you have expensive subselects as columns in a large table. However to do the counting it turns the main query into a sub-query, and it doesn't handle CTEs.
As others have said, yes. In most cases, the more complex the query the better the result.
> Doesn't working without reusable blocks / tests / logs make development harder?
SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want and then then the RDBMS goes off and finds the most performant way to retrieve the data that matches that shape. It doesn't compare well with procedural programming.
The closest "languages" that comparable to SQL are HTML and CSS. However you manage HTML and CSS is also how you can manage SQL.
I don't think that definition of legacy is useful because so many things which hardly anyone calls "legacy" fit the definition - for example: Javascript as the web standard, cars in cities and bipartisan democracy.
I think many of us would say that that none of these is an ideal solution for the problem being solved, but it's what we are stuck with and I cannot think anyone could call it "legacy systems" until a viable successor is widespread.
How about we get a SQL successor with algebraic data types, true boolean logic (as opposed to SQL's ternary logic), or functional composition? Null values are the bane of any query writer's existence, and we should have a reasonable solution by now...we've already done it with other programming languages.
I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:
> *(the query engine is free to optimize the execution plan as long as the semantics are preserved)
since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.
> SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want...
Exactly this. Generally speaking, your SQL queries won't have "bugs" the way that you can create bugs when writing a function.
The challenging parts of building a complex query are usually 1) getting it to function at all (just being a valid SQL statement for your tables that gives you your desired output fields), and 2) making sure it runs performantly on realistically large table sizes (usually in milliseconds as opposed to seconds), which may involve rewriting things like joins vs. subqueries and/or adding indexes.
A lot of bugs in functions come from edge cases or different combinations of paths through code or unexpected combinations of parameter values or math formula errors or whatever... but a SQL query won't usually really have any of those things. It's just a single transformation that basically either works or doesn't.
One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.
One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.
There were other issues using plain joins, but I can't recall them right now.
https://github.com/google/zetasql/blob/2024.08.2/docs/pipe-s...
FROM r JOIN s on r.Id = s.Id ORDER BY r.Id SELECT *
But the thing is, such changes will break pretty much all existing code, so the author added |> to distinguish, but why not use | instead? Don't make people typing one more character please.
Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.
Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.
I don't know that I'd go so far as to say I 'prefer' this, but there are a lot of PDFs out there, this works fine, and it's a nice change of pace given how much time I spend in front of a monitor / laptop screen.
I disagree that the paper not mentioning ‘OVER’ implies that the paper authors secretly think pipe syntax is a bad idea. They probably just wanted to keep the paper concise, or forgot about that one less-used bit of syntax.
Do you think that ‘OVER’ keyword implies something fundamentally wrong about pipe syntax? If so, how?
It definitely makes things easier to follow, but only for linear, ie. single table, transformations. The moment joins of multiple tables come into the picture things become hairy quick and then you actually start to appreciate the plain old sql which accounts for exactly this and allows you to specify column aliases in the entire cte clause. With this piping you lose scope of the table aliases and then you have to use weird hacks like mangling names of the joined in table in polars.
For single table processing the pipes are nice though. Especially eliminating the need for multiple different keywords for filter based on the order of execution (where, having, qualify (and pre-join filter which is missing)).
A missed opportunity here is the redundant [AGGREGATE sum(x) GROUP BY y]. Unless you need to specify rollups, [AGGREGATE y, sum(x)] is a sufficient syntax for group bys and duckdb folks got it right in the relational api.
It's difficult to learn and use? Compared to what? The "difficult to extend" might be accurate, but I really question the initial premise here.
Why are our programming languages, which have rich ADTs, Boolean logic, etc. serialising queries into an English syntax written for ad-hoc business intelligence tasks? Why not have a binary query interface to the database that provides a programmatic, rather than human-readable, API?
The first time I got a "too many bind variables" error I was flabbergasted. All I wanted to do was insert a ton of rows into a table. But the database expects me to construct an English sentence containing a placeholder for each value of each row?
But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.
SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...
and they decided this was common enough that they would introduce a HAVING clause for this case SELECT * FROM ... GROUP BY ... HAVING ...
But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.
You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.
Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.
But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).
So all in all I agree with you: keep HAVING.
Doesn't look compelling for the lockin.
All these years I've been doing that reordering and didn't even realize!
I vaguely remember someone telling me that it has mathematical underpinnings.
(Disclaimer: I'm an engineer at Microsoft, and I use Kusto basically every day)
This seems to me to be a deliberate design choice. Microsoft doesn't want engineers mutating the databases by hand. There are mechanisms to do that (mostly outside of Kusto, and usually to resolve privacy incidents), but the common case of querying is not supposed to allow for arbitrary changes to the database.
Yes, it's an extension (available by default), which means you can freely mix with regular SQL and use pipes for just parts of your query.
https://github.com/google/zetasql/blob/2024.08.2/docs/pipe-s...
> Pipe syntax can be mixed with standard syntax in the same query. For example, subqueries can use different syntax from the parent query.
FROM r JOIN s on r.Id = s.Id ORDER BY r.Id SELECT *
would indeed be pretty much that (linebreaks optional): FROM r JOIN s on r.Id = s.Id
|> ORDER BY r.Id
|> SELECT *
The question about “typing one more character” is answered in the paper's section 4.1.5 “Why use ‘|>’ for the pipe character?” (page 6): “The most natural and obvious choice would be to use ‘|’. Unfortunately…” — they don't want to break existing queries that use | for bitwise OR.(I wonder if one day, if the majority of users are using pipe syntax and don't mind updating their old queries or adding a special directive to enable single pipe…)
That's my point, I think we've reached the point where SQL the langage can be more of a hindrance than help because in a lot of cases we're writing directly to the engine but with oven mitts on. If I could build the query from the tree with scan, filter, index scan, cond, merge join as my primitives it would be so nice.
I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.
Improved UX at query development time is nice, but data teams need so much more than this. Stuff like lineage, linting, cross-dialect interop - these are no longer niche concerns, they are table stakes, and they come from having a shared syntax for data transformation.
At least in SQL Server CTE's are syntax level, so multiple uses of a CTE in a query causes it to get expanded in each of those places, which typically increases the complexity of the query and can cause issues with the optimizer and performance.
#!/bin/bash
function csv_to_parquet() {
file_path="$1"
duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }Could it be run on untrusted user input? Sure. Does it actually pose a threat? It's improbable.
Section 2.1.4 the paper lists the benefits of the pipe syntax over CTEs, and they are all based on ergonomics. As someone who has never had issues with the ergonomics of CTEs I must say I am not convinced that proposed syntax is better. It may be that I've been doing SQL for so long that I don't see its warts. Overall SQL feels like a very well designed and consistent language to me. The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.
My opinion is hardly uncommon. If you read over https://www.reddit.com/r/datascience/comments/c3lr9n/am_i_th... you will find many in agreement. Of those who "like" Pandas, it is often only a relative comparison to something worse.
The problems of the Pandas API were not intrinsic nor unavoidable. They were poor design choices probably caused by short-term thinking or a lack of experience.
Polars is a tremendous improvement.
On eager vs lazy evaluation -- pytorch defaulting to eager seemed to be part of the reason it was popular. Adding optional lazy evaluation to improve performance later seems to have worked for them.
It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output. It's still purely functional, but it has the advantage of making the execution order obvious. That is, the order is a linear top-down flow, not the inside-out flow implicit in vanilla SQL. Further, when your wanted flow doesn't match vanilla SQL's implicit ordering, you don't have to invent CTEs to wire up your flow. You just express it directly.
As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items. You must create a CTE to hold the top items and then wire it into a second SELECT statement to compute the stats. That's busywork. With pipe syntax, there's no need to invent that intermediate CTE.
Using an AI model that ate up half the internet to produce an efficient autocomplete is one path... Improving the language is another...
Maybe I used the wrong term. In my mental model, the query planner decides the order in which the query is evaluated based on what table stats predict is most efficient query plan, and I actually don't really want to think about the order too much. For example, if I create a CTE, I don't necessarily want it to be executed in that order. Maybe a condition on the later query can be pushed back into the earlier CTE so that less data can be scanned.
I will admit that technically there should be no difference in how a query planner handles either. But to me the pipe syntax does not hint as much at these non-linear optimizations than CTEs do. I called the CTE syntax more functional as it implies less to me.
> but it has the advantage of making the execution order obvious.
So we're back to ergonomics which I just never had an issue with...
> As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items.
Could I not compute the stats over all values, then order and limit them, and depend on the query planner to not do the stat calculation for items outside the limit? If the order/limit does not depend on a computed statistic that should be possible? Or does that not happen in practice?
I think it's unfair to expect the research team to invest additional hours in learning how to make good websites, so to solve your problem would require hiring additional talent whose only job is to translate academic PDFs into accessible web pages. I don't think that's a bad idea, and certainly Google has the funds to do something like that, but I don't imagine they'd find it to be a good use of money. Accessibility is an afterthought for most major companies these days.
As a concrete example, consider computing the average sales volume by category for the top 100 items. Here's the vanilla SQL for it:
WITH
TopItems AS (
SELECT category, sales_volume
FROM Items
ORDER BY sales_volume DESC
LIMIT 100
)
SELECT category, AVG(sales_volume) AS avg_sales_volume
FROM TopItems
GROUP BY category;
Because ORDER/LIMIT processing is implicitly last in vanilla SQL, if you need to do anything after that processing, you must do it in a new SELECT statement. Thus you must capture the ORDER/LIMIT results (e.g., as a CTE or, heaven forbid, as a nested SELECT statement) and then wire those results into that new SELECT statement via its FROM clause.In contrast, with SQL pipes you can express any ordering you want, so you can feed the ORDER/LIMIT results directly into the statistical computations:
FROM Items
|> ORDER BY sales_volume DESC
|> LIMIT 100
|> AGGREGATE AVG(sales_volume) AS avg_sales_volume
GROUP BY category
That's way simpler and the data flows just as it reads: from top to bottom.let A = select * from tbla
let B = select * from tblb
let C = select * from A join B