(pdf) https://www.redshiftresearchproject.org/white_papers/downloa...
(html) https://www.redshiftresearchproject.org/white_papers/downloa...
I've been told, very kindly, by a couple of people that it's the best explanation they've ever seen. I'd like to get more eyes on it, to pick up any mistakes, and it might be useful in and of itself anyway to reader, as MVCC on Redshift is I believe the same as MVCC was on Postgres before snapshot isolation.
https://aws.amazon.com/about-aws/whats-new/2022/05/amazon-re...
For concurrency scalability, AWS now configures SNAPSHOT ISOLATION by default if you use Redshift Serverless but non-serverless still defaults to SERIALIZABLE ISOLATION.
https://aws.amazon.com/blogs/database/manage-long-running-re...
I think this is the best video on that topic: https://www.youtube.com/watch?v=b2F-DItXtZs
I had a similar personal experience. In my previous job we used Postgres to implement a task queuing system, and it created a major bottleneck, resulting in tons of concurrency failures and bloat.
And most dangerously, the system failed catastrophically under load. As the load increased, most transactions ended up in concurrent failures, so very little actual work got committed. This increased the amount of outstanding tasks, resulting in even higher rate of concurrent failures.
And this can happen suddenly, one moment the system behaves well, with tasks being processed at a good rate, and the next moment the queue blows up and nothing works.
I re-implemented this system using pessimistic locking, and it turned out to work much better. Even under very high load, the system could still make forward progress.
The downside was having to make sure that no deadlocks can happen.
We have been working on automatic database optimization using AI/ML for a decade at Carnegie Mellon University [1][2]. This is not a gimmick. Furthermore, as you can see from the many comments here, the problem is not overhyped.
I admit that's not precisely how I described it in the GP comment but it never crossed my mind that anyone would care. Commenter objections never fail to surprise!
Edit: I think I was right that it was in their interest as well as all of ours, because earlier the thread was dominated by complaints like this:
https://news.ycombinator.com/item?id=35718321
https://news.ycombinator.com/item?id=35718172
... and after the change, it has been filling up with much more interesting on-topic comments. From my perspective that's a win-win-win, but YMMV.
> A better approach is to use an AI-powered service automatically determine the best way to optimize PostgreSQL. This is what OtterTune does. We’ll cover more about what we can do in our next article. Or you can sign-up for a free trial and try it yourself.
That was removed after the article was posted to HN, at dang's suggestion - he posted about it elsewhere in these comments.
But I can't find the Ottertune Github page
Is any part of Ottertune open source?
https://www.reddit.com/r/programming/comments/4vms8x/why_we_...
https://www.postgresql.org/message-id/5797D5A1.5030009%40agl...
> The Uber guy is right that InnoDB handles this better as long as you don't touch the primary key (primary key updates in InnoDB are really bad).
> This is a common problem case we don't have an answer for yet.
It's still not how I remember it.
Quote from https://www.postgresql.org/message-id/flat/579795DF.10502%40...
I still prefer Postgres by a long way as a developer experience, for the sophistication of the SQL you can write and the smarts in the optimizer. And I'd still pick MySQL for an app which expects to grow to huge quantities of data, because of the path to Vitesse.
https://www.postgresql.org/docs/current/runtime-config-resou... tells you what all the parameters do, but not why and how to change them.
"If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system." Why not set it to 25% of the memory in my system by default, then?
"Sets the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. If this value is specified without units, it is taken as kilobytes. The default value is four megabytes (4MB)." Yes, and? Should I set it higher? When?
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv... hasn't been updated for two years and explains only a handful of parameters.
"If you do a lot of complex sorts, and have a lot of memory, then increasing the work_mem parameter allows PostgreSQL to do larger in-memory sorts which, unsurprisingly, will be faster than disk-based equivalents." How much is a lot? Do I need to care if I'm running mostly OLTP queries?
"This is a setting where data warehouse systems, where users are submitting very large queries, can readily make use of many gigabytes of memory." Okay, so I need to set it higher if I'm running OLAP queries. But how high is too high?
https://wiki.postgresql.org/wiki/Performance_Optimization is just a collection of blog posts written by random (probably smart) people that may or may not be outdated.
So when someone complains their Postgres instance runs like ass and smug Postgres weenies tell them to git gud at tuning, they should be less smug, because if your RDBMS requires extensive configuration to support nontrivial loads, you either make this configuration the default one or, if it's significantly different for different load profiles, put a whole section in the manual that covers day 1 and day 2 operations.
I remember people stating that transactions are useless, and maybe they are for some workloads, see the success of MongoDB years later.
The transactional engine InnoDB was added in version 3.23 [1] in 2001 [2] .
[1] https://en.wikipedia.org/wiki/Comparison_of_MySQL_database_e...
Later on, VACUUM has to plow through everything and check the oldest running transaction to see whether the tuple can be "frozen" (old enough to be seen by every transaction, and not yet deleted) or the space reclaimed as usable (deleted and visible to nothing). Index tuples likewise must be pruned at this time.
In systems with an UNDO log, the tuple is mutated in place and the contents of the old version placed into a sequential structure. In the case where the transaction commits, and no existing concurrent repeatable read level transactions exist, the old version in the sequential structure can be freed, rather than forcing the system to fish around doing garbage collection at some later time to obsolete the data. This could be considered "optimized for commit" instead of the memorable "optimized for rollback."
On the read side, however, you need special code to fish around in UNDO (since the copy in the heap is uncommitted data at least momentarily) and ROLLBACK needs to apply the UNDO material back to the heap. Postgres gets to avoid all that, at the cost of VACUUM.
[1] The exception is "HOT" (heap only tuple) chains, which if you squint look a tiny bit UNDO-y. https://www.cybertec-postgresql.com/en/hot-updates-in-postgr...
There are multiple years available for his first DB class but that’s the one I watched. I almost called it his ‘basic’ class but there’s literally only like one or two classes on SQL before he dives into all the various layers of the internals.
There’s also a few of his advanced courses. And then you’ll see guest lectures from industry on about every one of the new DB platforms you can think of.
They’re all under “CMU Database Group” on Youtube.
Highly recommend.