For algorithms, a little memory outweighs a lot of time

>>makira+(OP)
Lookup tables with precalculated things for the win!

In fact I don’t think we would need processors anymore if we were centrally storing all of the operations ever done in our processors.

Now fast retrieval is another problem for another thread.

>>whatev+ti
Reminds me of when I started working on storage systems as a young man and once suggested pre-computing every 4KB block once and just using pointers to the correct block as data is written, until someone pointed out that the number of unique 4KB blocks (2^32768) far exceeds the number of atoms in the universe.

>>crmd+Jz
The idea is not too far off. You could compute a hash on an existing data block. Store the hash and data block mapping. Now you can use the hash in anywhere that data block resides, i.e. any duplicate data blocks can use the same hash. That's how storage deduplication works in the nutshell.

>>ww520+ND
Except that there are collisions...

>>valent+8E
Can use cryptographic hashing.

>>ww520+N72
How does that get around the pigeonhole principle?

I think you'd have to compare the data value before purging, and you can only do the deduplication (purge) if the block is actually the same, otherwise you have to keep the block (you can't replace it with the hash because the hash link in the pool points to different data)

>>anonym+Qg2
The hash collision chance is extremely low.

>>ww520+Nu3
For small amounts of data yeah. With growing data, the chance of a collision grows more than proportional. So in the context of working on storage systems (like s3 or so) that won't work unless customers actually accept the risk of a collission as okay. So for example, when storing media data (movies, photos), I could imagine that, but not for data in general.

>>valent+Mw3
Cryptographic hashing collisions are very very small, like end of universe in numerous times small. They're smaller than AWS being burnt down and all backups were lost leading to data loss.

>>ww520+cB3
You have a point.

When using MD5 (128bit) then when AWS S3 would apply this technique, it would only get a handful of collisions. Using 256bit would drive that down to a level where any collision is very unlikely.

This would be worth it if a 4kb block is, on average, duplicated with a chance of at least 6.25%. (not considering overhead of data-structures etc.)

zlacker