zlacker

Except that there are collisions...

>>valent+(OP)
This might be completely naive but can a reversible time component be incorporated into distinguishing two hash calculations? Meaning when unpacked/extrapolated it is a unique signifier but when decomposed it folds back into the standard calculation - is this feasible?

replies(2): >>ruined+xk >>shakna+1l

>>datame+d1
hashes by definition are not reversible. you could store a timestamp together with a hash, and/or you could include a timestamp in the digested content, but the timestamp can’t be part of the hash.

replies(2): >>RetroT+a51 >>datame+JD6

>>datame+d1
Some hashes do have verification bits, that are used not just to verify intact hash, but one "identical" hash from another. However, they do tend to be slower hashes.

replies(1): >>grumbe+Mm

>>shakna+1l
Do you have an example? That just sounds like a hash that is a few bits longer.

replies(1): >>shakna+On

>>grumbe+Mm
Mostly use of GCM (Galois/Counter Mode). Usually you tag the key, but you can also tag the value to check verification of collisions instead.

But as I said, slow.

>>ruined+xk
> hashes by definition are not reversible.

Sure they are. You could generate every possible input, compute hash & compare with a given one.

Ok it might take infinite amount of compute (time/energy). But that's just a technicality, right?

replies(1): >>dagw+D51

>>RetroT+a51
Sure they are. You could generate every possible input

Depends entirely on what you mean by reversible. For every hash value, there are an infinite number of inputs that give that value. So while it is certainly possible to find some input that hashes to a given value, you cannot know which input I originally hashed to get that that value.

>>valent+(OP)
Can use cryptographic hashing.

replies(1): >>anonym+IC1

>>ww520+Ft1
How does that get around the pigeonhole principle?

I think you'd have to compare the data value before purging, and you can only do the deduplication (purge) if the block is actually the same, otherwise you have to keep the block (you can't replace it with the hash because the hash link in the pool points to different data)

replies(1): >>ww520+FQ2

>>anonym+IC1
The hash collision chance is extremely low.

replies(1): >>valent+ES2

>>ww520+FQ2
For small amounts of data yeah. With growing data, the chance of a collision grows more than proportional. So in the context of working on storage systems (like s3 or so) that won't work unless customers actually accept the risk of a collission as okay. So for example, when storing media data (movies, photos), I could imagine that, but not for data in general.

replies(1): >>ww520+4X2

>>valent+ES2
Cryptographic hashing collisions are very very small, like end of universe in numerous times small. They're smaller than AWS being burnt down and all backups were lost leading to data loss.

replies(1): >>valent+0a3

>>ww520+4X2
You have a point.

When using MD5 (128bit) then when AWS S3 would apply this technique, it would only get a handful of collisions. Using 256bit would drive that down to a level where any collision is very unlikely.

This would be worth it if a 4kb block is, on average, duplicated with a chance of at least 6.25%. (not considering overhead of data-structures etc.)

>>ruined+xk
Oh, of course, the timestamp could instead be metadata!