Maybe I don't know what I'm talking about, and likely the case, as I don't know much about SSD controllers, but I was just thinking about my old Windows Home Server. Maybe I'm not using all the proper terminology, but with it's system backup feature and Drive Extender it would use pointers for duplicate data on a block level. So if you were to have fresh installs of Windows 7 Pro on three PC's, each about 8GB, when you went to back them up, it wouldn't take 24GB (3x8GB), it would take not much more than 8GB because the files were all similar. It would make one copy, then generate a pointer to the same data for the other two backups.
It seems something similar could be incorporated into SSD controllers to reduce amount of data written to the SSD, as well as freeing up space.
Just an idea. Not that it matters, but seems like something like this could work.
-
Not an expert on this either, but what you're talking about seems to be akin to data compression?
-
That would be very handy, imagine carrying 4x vhd's of virtual windows server installs around being 20GB each with just some settings different between them (e.g. testing stuff, school, ..) Taking a little over 20GB instead of 80GB. That would really save me some space!
I don't know how this would be different from compression but lets say I start changing stuff on one virtual machine then what would it do? write the difference to some sort of file? Then it needs some sort of database to keep track of all those files? I think it would get messy real fast, like Windows registry messy fast.
Doesn't Microsoft does something similar with it's RAM since Windows 8..? Where similar processes can use 'pointers', for example 2x iexplore with no open windows would theoretically have the exact same content in RAM.
~Aeny -
Compression just looks for patterns and spaces and stacks them together. The pointer basically looks at each block of data as an entity, and can reconstruct data from those blocks. It does this somewhat already by storing where the data is held. All they would have to do is change it to examine incoming data and just point to a block with the same data/hash.
-
Your proposed method of data deduplication would be excrutiatingly slow, as a good deal of the drive would need to be scanned each and every time a 4K page is written to disk in order to look for a match (the worst case would be if no match exists, actually, as the entire drive would need to be examined in order to determine that). To write a 1 GB file, you would need to read through the contents of the SSD 262,144 times.
At a typical 500 MB/s read speed, it takes 240 seconds to read through a completely filled 120 GB SSD once. Therefore, in a worst-case scenario, it could potentially take 728 days to write 1 GB.
You could do a lot better performance-wise by having lots of DRAM, by using an O(log n) search algorithm instead of linear search, and by hashing the data, but even with these tricks and under ideal conditions it's unlikely you'd do significantly better than what Sandforce does, given hardware, price, and power constraints. After all, you can't cram a quad-core i7 and 16 GB of RAM into an SSD. -
What you're describing is similar to the block level deduplication that Sandforce drives do (in addition to real time compression). Except they use the extra space for redundancy (RAISE), so it's not visible to the user.
-
It actually sounds a lot like the sand force style compression except at a very high level. The method you describe is basically a form of deduplicating data, which is the principle behind file compression. The trick with all compression is speed, effective compression is almost always mutually exclusive with speed, sand force drives use really powerful ARM CPUs (at the time) to do the compression in real time to do all you have described but we all know what the real performance is like when the data is incompressible.
-
You don't need to do this at block level if your goal is to backup OS installs. You can do it on file system level. If two files have matching names, check the hash. When a match is found, add a symbolic link. It should be fast enough with an index.
For software implementation of block level dedup, ZFS already has it. (Not available on Windows out of the box, but there are 3rd party projects.) -
-
What if you edit one copy of a file slightly? Should the edited file then be considered a completely different file? How about if multiple different files share some similarities (e.g. file metadata) but are otherwise different? Can no gains be made there? If that's the case, Sandforce is far more efficient.
I can't think of any real world scenario where someone would have a legitimate reason to purposely store enough duplicate copies of a file for file-level deduplication to make a meaningful difference, especially given how far SSD owners tend to go in order to save just a couple of GBs of precious space. After all, SSDs aren't backup devices, you'd have to be crazy (or crazy rich) to use SSDs for backing up given the cost/GB ratio.
Sorry, but this sounds like a solution to a non-existent problem. -
It has nothing to do with backups or duplication. It would save space, probably even create crazy improvements in write performance. It's not file by file, but block by block.
No it would be stored like it is currently in the file table. Tons of examples. If you install a game that has tons of texture files, lots of small blocks of data will be identical. Instead of storing it dozens or hundreds of times, for parts of those files it just points to that same block of data instead of writing it dozens or hundreds of times for each individual file. Heck it could have repeating blocks of data in the same file. Even completely unrelated files could have certain segments of identical data. -
I see what you're getting at, but to play devil's advocate, a couple of points:
a) Seems like this concept could apply to any kind of random access storage mechanism. In other words, if it was worth doing, then HDDs would have implemented this as well a long time ago. And since they haven't, my guess is there are valid reasons.
b) To implement this, the drive would require a "mapping" tree (which bit pattern goes to what LBA). In order to use this, you would have to create a lookup / mapping hash which told which byte pattern was stored where on disk. Perhaps someone has tried this, and it turned out the required size and actual management (updates, deletes, balancing, etc.) cost way more to save/upkeep that impeded actual I/O performance and/or storage requirements.
c) The "all your eggs in one basket" issue. If you would loose one LBA due to a bad sector / bad NAND page on disk, you could possibly lose many, many, many files. This may lead to some very upset users.
d) Calculating free space. How could a controller know how much free space is available for usage if there are bit patterns that may be shared. Don't know if end users would like this aspect either.
Could this be done at the file system level rather than hardware level? Would take way too much time from kernel time on the CPU that the cost exceeds any benefit here as well? For quite some time, Windows has had the option to create a "compressed" drive. I wonder if it uses this technique? Does it compress files at the file level rather than at the file system level? -
Windows doesn't use this technique. The compression level is not that great. Regarding free space it would know because it would track as its writing whether it was a write (and how much space) or just a pointer. Not that difficult. All your eggs in one basket makes sense, but make it a user configurable features to enable or disable so you can either get more space with some risk or regular storage space. They may need a bit more cache to store the cross reference hash data, but as we know that is getting cheaper by the day. For HDD's it may not have mattered so much because storage size wasn't as much of a premium, plus controller performance likely wasn't nearly as fast as those in SSD's.
-
-
DOH!!! Typo - 15 to 20 MB!!!!! -
-
10MB HDD owner here.
-
-
Getting back on topic, since drive space was at such a premium those days, any HDD manufacturer that had a way to cram more on a drive would've sold a ton of them, probably becoming a huge, huge market leader. It's hard to imagine those engineers didn't cover this ground. But whose to say for certain.
Thoughts on SSD Controllers - New Idea
Discussion in 'Hardware Components and Aftermarket Upgrades' started by HTWingNut, Mar 18, 2013.