Clocks Are Bad, Or, Welcome to the World of Distributed Systems

>>pharkm+(OP)
Distributed systems design aside, the core of the problem is that they relied on ntp (as they probably should), and in their case ntp was not working properly.

replies(4): >>device+B6 >>scottd+hn >>donava+Up >>global+or

>>pharkm+(OP)
I can't scroll down.

replies(1): >>rossy+eb

>>pharkm+(OP)
I don't get how clocks are bad this from the article.

I get that syncing clocks across systems is hard and when it goes awry, unintended consequences are incurred.

replies(1): >>RickHu+29

>>flavie+t2
And this is precisely why a thing that is not monitored is not actually a thing.

replies(2): >>olefoo+3b >>specia+ln

>>pharkm+(OP)
Is it me, or does the hand waving at the beginning of the article between "write" and "update" smell of bad spin? As a developer I consider both "creates"/"updates" as "writes".

"Riak is designed to accomodate (sp) server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers."

As such losing a write to me when I read documentation is losing either a create, update or a delete. Any side affecting operation essentially. Anything that needs to write to disk to record a change...

replies(2): >>macint+P8 >>marshr+Xw

>>pharkm+(OP)
How do other distributed databases handle this?

replies(4): >>krilno+Mc >>hendze+7e >>jbelli+cf >>voidma+DI

>>mey+98
Thanks for catching the spelling error; as much as I pride myself on my spelling, I should let 2013-era tools do their job.

I was concerned that might be interpreted as spin, but I hoped the rest of the article would reinforce the point that there is no way to guarantee an update is preserved in a distributed system without an approach more sophisticated than blindly trusting clocks.

Writes to a new object are inherently less problematic; while it's possible to temporarily receive a negative response about the presence of an object, the data will always be there, barring catastrophic multiple server failure.

Updates can be entirely lost, and that's something that developers and operations people need to be aware of.

replies(2): >>mey+y9 >>ruroun+ae

>>dustin+A6
There is, in fact, a TL;DR at the end:

> If your distributed database relies on clocks to pick a winner, you’d better have rock-solid time synchronization, and even then, it’s unlikely your business needs are served well by blindly selecting the last write that happens to arrive.

replies(1): >>ssever+rj

>>macint+P8
My apologies for being a little gruff. I am coming at this from not being a user of Riak and currently exploring options for distributed processing of data as our companies data needs have gotten a bit big. I was just expressing my concern over the complexity of the problem and our understanding of a technical term. It makes it harder to consume documentation on systems for evaluation, to get an idea of how they fail and how to adjust to failure. It may not be rational but in my gut it causes me concern.

replies(1): >>macint+aa

>>mey+y9
I absolutely understand your concern, and I'll be more cautious in the future. I tend to write with a very casual, informal tone, and data safety is not something to be overly breezy about.

More broadly, as someone who helps write our documentation, it's very difficult to figure out how to present enough detail about the proper ways to use Riak without forcing everyone to become an expert on distributed systems. Unfortunately there are incredibly subtle tradeoffs inherently involved in running a distributed database.

>>device+B6
> "A thing that is not monitored is not actually a thing."

That should be on a cross-stitch sampler on the wall of every NOC.

>>tych0+X4
Same here. On Windows/Chrome 30 with a window size of about 900x950px, I can't scroll down. Increasing the windows's width or decreasing the height makes it work again.

>>neolef+K8
Google's Spanner [1] uses something it calls TrueTime:

"The key enabler of these properties is a new TrueTime API and its implementation. The API directly exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out that uncertainty. Google’s cluster-management software provides an implementation of the TrueTime API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks)."

[1] Spanner: Google's globally-distributed database https://www.usenix.org/system/files/conference/osdi12/osdi12...

replies(1): >>theatr+2f

>>neolef+K8
Unfortunately, not in a particularly clever way. CP systems such as MongoDB, HBase, etc. don't have this problem since each datum has an authoritative master. As you can imagine, this can result in some operational...unpleasantness due to the lack of liveness guarantees in the presence of a network partition.

Out of the well known open-source AP systems, Riak is probably the leader here since they implement well understood techniques from the literature such as CRDTs and vclocks.

EDIT: removed my statement about Cassandra since it was a bit misleading and jbellis answered above in greater detail.

>>macint+P8
Is there any reason you do not use "create" terminology instead of the possibly-confusing "write".

I am with op in that I consider an update a write.

"create/update" are both writes

"write/update" ... eh?

>>krilno+Mc
With TrueTime you are trading some latency on concurrent operations for correctness.

Other structures such as CRDTs/lattices might be more appropriate for your use case.

replies(1): >>madhus+ql

>>neolef+K8
Cassandra offers a mix of commutative operations (sets, maps, increments), an eventlog model, and lightweight (paxos-based) transactions. Unlike a key/value database like Riak, Cassandra can update individual fields of a row or document independently, which simplifies things enormously.

http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-v...

http://www.datastax.com/dev/blog/cql3_collections

http://www.datastax.com/dev/blog/lightweight-transactions-in...

replies(1): >>ssever+yj

>>RickHu+29
In fact I would recommend GPS calibrated hardware clocks with PTP.

replies(3): >>donava+Yp >>marshr+ax >>duaneb+IA

>>jbelli+cf
Cassandra suffers from the same problem and can drop updates. The Paxos transactions were and maybe still are an absolute joke as exposed by Aphyr.

>>theatr+2f
By correctness you mean consistency? You don't have to be consistent all the time, i.e. you can trade consistency, but never correctness.

If we could have traded correctness, we could have optimized everything and gone home by now :)

>>flavie+t2
The key take away from the article SHOULD be: don't rely on ntp if you don't have to.

There are people who have to. They run their own atomic clocks, and worry about things such as precision delivery of nuclear ordanance.

Then there's you. You should use vector clocks, with a builtin conflict resolution mechanism based on domain knowledge.

That's the point of the article.

>>device+B6
Nice. Much stronger than the "you can only manage what you measure" adage I learned from accounting.

>>pharkm+(OP)
Decades ago albert einstein introduced the general theory of relativity, which is already telling us that timestamps are bad for synchronisation.

>>flavie+t2
Ntp is good. Assuming that time is coordinated, much less monotonically increasing, is a bad plan. Just the other week i got paged in the middle of the night because a clock moved backwards.

>>ssever+rj
As last summers negative leap second fiascos demonstrated even a trusted source isnt enough.

replies(1): >>oh_sig+Xk1

>>flavie+t2
Even if NTP had been working properly, you would not have clocks synchronised at the level of individual ticks - only to the level of time intervals. If two updates happened at roughly the same time, and fell into the same time interval, there would be no way to tell which one happened before the other. A paper by Cilia et al on timing of composite events in distributed event-based systems using NTP deals with this issue.

replies(1): >>Dylan1+zA

>>pharkm+(OP)
How would you get around the unreliability of clocks in VMs? Seems like deploying Riak in the cloud could be problematic.

replies(1): >>hectca+Id1

>>mey+98
Possibly even more folks are familiar with the terms 'insert' and 'update'.

Also, s/side affecting/side effecting/.

>>ssever+rj
The point is not that time synchronization is inherently bad, only that it's usually not the correct thing for a distributed database to resolve update conflicts with.

replies(1): >>ssever+Z01

>>global+or
But this is not a problem in many situations. Whereas successor writes failing within an entire 30 second span is a pretty big problem.

>>ssever+rj
Ok, google.

>>pharkm+(OP)
Dumb question: What breaks with the following approach?

1) set last_write_wins=true (so all updates, always apply, as described in the article)

2) avoid the "partition/rejoin may cause old values to stomp on new" issue by having "rejoin detection" which refuses to rejoin if clocks are "too out of sync"

>>pharkm+(OP)
Mars design. Assume one server is on Mars, with associated time dilation on it's clocks and latency.

>>neolef+K8
FoundationDB provides real ACID transactions and external consistency, and definitely does NOT rely on clock accuracy for soundness! (Google Spanner, which we are often compared to, does use a trusted clock, but Google went to extreme measures to make it accurate, including installing atomic clocks and GPS hardware.)

As for how, it's a long story. At bottom we rely on Paxos for consistency across failures, but we only actually do Paxos when there are failures. (We use less costly synchronous techniques for replication in "happy times".)

>>marshr+ax
Yes I completely agree. I fail to see how anyone would think that using a clock as a source of truth in a distributed system would be in anyway a good idea. As far as PTP it would be too expensive to deploy at large scale which was some of the motivation (i believe) behind truetime.

replies(1): >>marshr+e61

>>ssever+Z01
To be fair, time was considered to provide a pretty universal total ordering up until fairly recently, i.e., 1903.

>>joaoms+gu
You'd change the settings detailed in the "Stopping Last Write Wins" section of the original post.

>>donava+Yp
They are when you know that the leap-(nanosecond/second/minute/day) is coming up. When you know it is coming, you can "smear" the time difference over, let's say, the entire year, so when it happens, every system behaves correctly.

zlacker

Clocks Are Bad, Or, Welcome to the World of Distributed Systems