Thoughts on SSD Controllers - New Idea | NotebookReview

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

Maybe I don't know what I'm talking about, and likely the case, as I don't know much about SSD controllers, but I was just thinking about my old Windows Home Server. Maybe I'm not using all the proper terminology, but with it's system backup feature and Drive Extender it would use pointers for duplicate data on a block level. So if you were to have fresh installs of Windows 7 Pro on three PC's, each about 8GB, when you went to back them up, it wouldn't take 24GB (3x8GB), it would take not much more than 8GB because the files were all similar. It would make one copy, then generate a pointer to the same data for the other two backups.

It seems something similar could be incorporated into SSD controllers to reduce amount of data written to the SSD, as well as freeing up space.

Just an idea. Not that it matters, but seems like something like this could work.

HTWingNut, Mar 18, 2013

#1

MidnightSun Emodicon

Reputations:: 6,668

Messages:: 8,224

Likes Received:: 231

Trophy Points:: 231

Not an expert on this either, but what you're talking about seems to be akin to data compression?

MidnightSun, Mar 18, 2013

#2

Aeny Notebook Consultant

Reputations:: 110

Messages:: 169

Likes Received:: 93

Trophy Points:: 41

That would be very handy, imagine carrying 4x vhd's of virtual windows server installs around being 20GB each with just some settings different between them (e.g. testing stuff, school, ..) Taking a little over 20GB instead of 80GB. That would really save me some space!

I don't know how this would be different from compression but lets say I start changing stuff on one virtual machine then what would it do? write the difference to some sort of file? Then it needs some sort of database to keep track of all those files? I think it would get messy real fast, like Windows registry messy fast.

Doesn't Microsoft does something similar with it's RAM since Windows 8..? Where similar processes can use 'pointers', for example 2x iexplore with no open windows would theoretically have the exact same content in RAM.

~Aeny

Aeny, Mar 18, 2013

#3

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

MidnightSun said: ↑

Not an expert on this either, but what you're talking about seems to be akin to data compression?

Click to expand...

No, it's not compression. It's different. If you were to compress two similar sets of 20GB files down to 10GB, with this method you could copy both compressed files to the SSD and it wouldn't take more than 10GB total because the contents are nearly identical. It would just place pointers to where to pull the data from on the SSD.

Compression just looks for patterns and spaces and stacks them together. The pointer basically looks at each block of data as an entity, and can reconstruct data from those blocks. It does this somewhat already by storing where the data is held. All they would have to do is change it to examine incoming data and just point to a block with the same data/hash.

Aeny said: ↑

That would be very handy, imagine carrying 4x vhd's of virtual windows server installs around being 20GB each with just some settings different between them (e.g. testing stuff, school, ..) Taking a little over 20GB instead of 80GB. That would really save me some space!

Click to expand...

Yes exactly. That's what I'm talking about.

I don't know how this would be different from compression but lets say I start changing stuff on one virtual machine then what would it do? write the difference to some sort of file? Then it needs some sort of database to keep track of all those files? I think it would get messy real fast, like Windows registry messy fast.

Click to expand...

It'd just be a little more complex than a standard file table. using something like hash values or something.

Doesn't Microsoft does something similar with it's RAM since Windows 8..? Where similar processes can use 'pointers', for example 2x iexplore with no open windows would theoretically have the exact same content in RAM.

~Aeny

Click to expand...

Perhaps, but not sure. But that's the idea.

HTWingNut, Mar 18, 2013

#4

Peon Notebook Virtuoso

Reputations:: 406

Messages:: 2,007

Likes Received:: 128

Trophy Points:: 81

Your proposed method of data deduplication would be excrutiatingly slow, as a good deal of the drive would need to be scanned each and every time a 4K page is written to disk in order to look for a match (the worst case would be if no match exists, actually, as the entire drive would need to be examined in order to determine that). To write a 1 GB file, you would need to read through the contents of the SSD 262,144 times.

At a typical 500 MB/s read speed, it takes 240 seconds to read through a completely filled 120 GB SSD once. Therefore, in a worst-case scenario, it could potentially take 728 days to write 1 GB.

You could do a lot better performance-wise by having lots of DRAM, by using an O(log n) search algorithm instead of linear search, and by hashing the data, but even with these tricks and under ideal conditions it's unlikely you'd do significantly better than what Sandforce does, given hardware, price, and power constraints. After all, you can't cram a quad-core i7 and 16 GB of RAM into an SSD.

Peon, Mar 19, 2013

#5

R3d Notebook Virtuoso

Reputations:: 1,515

Messages:: 2,382

Likes Received:: 60

Trophy Points:: 66

What you're describing is similar to the block level deduplication that Sandforce drives do (in addition to real time compression). Except they use the extra space for redundancy (RAISE), so it's not visible to the user.

R3d, Mar 19, 2013

#6

Marksman30k Notebook Deity

Reputations:: 2,080

Messages:: 1,068

Likes Received:: 180

Trophy Points:: 81

It actually sounds a lot like the sand force style compression except at a very high level. The method you describe is basically a form of deduplicating data, which is the principle behind file compression. The trick with all compression is speed, effective compression is almost always mutually exclusive with speed, sand force drives use really powerful ARM CPUs (at the time) to do the compression in real time to do all you have described but we all know what the real performance is like when the data is incompressible.

Marksman30k, Mar 19, 2013

#7

Mr.Koala Notebook Virtuoso

Reputations:: 568

Messages:: 2,307

Likes Received:: 566

Trophy Points:: 131

You don't need to do this at block level if your goal is to backup OS installs. You can do it on file system level. If two files have matching names, check the hash. When a match is found, add a symbolic link. It should be fast enough with an index.

For software implementation of block level dedup, ZFS already has it. (Not available on Windows out of the box, but there are 3rd party projects.)

Mr.Koala, Mar 19, 2013

#8

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

Peon said: ↑

Your proposed method of data deduplication would be excrutiatingly slow, as a good deal of the drive would need to be scanned each and every time a 4K page is written to disk in order to look for a match (the worst case would be if no match exists, actually, as the entire drive would need to be examined in order to determine that). To write a 1 GB file, you would need to read through the contents of the SSD 262,144 times.

At a typical 500 MB/s read speed, it takes 240 seconds to read through a completely filled 120 GB SSD once. Therefore, in a worst-case scenario, it could potentially take 728 days to write 1 GB.

You could do a lot better performance-wise by having lots of DRAM, by using an O(log n) search algorithm instead of linear search, and by hashing the data, but even with these tricks and under ideal conditions it's unlikely you'd do significantly better than what Sandforce does, given hardware, price, and power constraints. After all, you can't cram a quad-core i7 and 16 GB of RAM into an SSD.

Click to expand...

It doesn't have to scan the whole disk. It'd just be like any drive file system. Do you think a system scans a drive every time it fetches a file? No it stores a database more or less of where to fetch that file. If you can compress on the fly you can check a bunch of hashes on the fly. It wouldn't take 728 days, lol. Windows Home Server has done this since its inception on much slower drives filled with TB's of data.

Mr.Koala said: ↑

You don't need to do this at block level if your goal is to backup OS installs. You can do it on file system level. If two files have matching names, check the hash. When a match is found, add a symbolic link. It should be fast enough with an index.

For software implementation of block level dedup, ZFS already has it. (Not available on Windows out of the box, but there are 3rd party projects.)

Click to expand...

But wouldn't it be better to be stored on the hardware than rely on some third party software?

HTWingNut, Mar 19, 2013

#9

Peon Notebook Virtuoso

Reputations:: 406

Messages:: 2,007

Likes Received:: 128

Trophy Points:: 81

HTWingNut said: ↑

It doesn't have to scan the whole disk. It'd just be like any drive file system. Do you think a system scans a drive every time it fetches a file? No it stores a database more or less of where to fetch that file. If you can compress on the fly you can check a bunch of hashes on the fly. It wouldn't take 728 days, lol. Windows Home Server has done this since its inception on much slower drives filled with TB's of data.

Click to expand...

Ahh, so you're proposing to perform the dedup at the file level.

What if you edit one copy of a file slightly? Should the edited file then be considered a completely different file? How about if multiple different files share some similarities (e.g. file metadata) but are otherwise different? Can no gains be made there? If that's the case, Sandforce is far more efficient.

I can't think of any real world scenario where someone would have a legitimate reason to purposely store enough duplicate copies of a file for file-level deduplication to make a meaningful difference, especially given how far SSD owners tend to go in order to save just a couple of GBs of precious space. After all, SSDs aren't backup devices, you'd have to be crazy (or crazy rich) to use SSDs for backing up given the cost/GB ratio.

Sorry, but this sounds like a solution to a non-existent problem.

Peon, Mar 20, 2013

#10

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

It has nothing to do with backups or duplication. It would save space, probably even create crazy improvements in write performance. It's not file by file, but block by block.

No it would be stored like it is currently in the file table. Tons of examples. If you install a game that has tons of texture files, lots of small blocks of data will be identical. Instead of storing it dozens or hundreds of times, for parts of those files it just points to that same block of data instead of writing it dozens or hundreds of times for each individual file. Heck it could have repeating blocks of data in the same file. Even completely unrelated files could have certain segments of identical data.

HTWingNut, Mar 20, 2013

#11

jclausius Notebook Virtuoso

Reputations:: 6,160

Messages:: 3,265

Likes Received:: 2,573

Trophy Points:: 231

I see what you're getting at, but to play devil's advocate, a couple of points:

a) Seems like this concept could apply to any kind of random access storage mechanism. In other words, if it was worth doing, then HDDs would have implemented this as well a long time ago. And since they haven't, my guess is there are valid reasons.

b) To implement this, the drive would require a "mapping" tree (which bit pattern goes to what LBA). In order to use this, you would have to create a lookup / mapping hash which told which byte pattern was stored where on disk. Perhaps someone has tried this, and it turned out the required size and actual management (updates, deletes, balancing, etc.) cost way more to save/upkeep that impeded actual I/O performance and/or storage requirements.

c) The "all your eggs in one basket" issue. If you would loose one LBA due to a bad sector / bad NAND page on disk, you could possibly lose many, many, many files. This may lead to some very upset users.

d) Calculating free space. How could a controller know how much free space is available for usage if there are bit patterns that may be shared. Don't know if end users would like this aspect either.

Could this be done at the file system level rather than hardware level? Would take way too much time from kernel time on the CPU that the cost exceeds any benefit here as well? For quite some time, Windows has had the option to create a "compressed" drive. I wonder if it uses this technique? Does it compress files at the file level rather than at the file system level?

jclausius, Mar 20, 2013

#12

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

Windows doesn't use this technique. The compression level is not that great. Regarding free space it would know because it would track as its writing whether it was a write (and how much space) or just a pointer. Not that difficult. All your eggs in one basket makes sense, but make it a user configurable features to enable or disable so you can either get more space with some risk or regular storage space. They may need a bit more cache to store the cross reference hash data, but as we know that is getting cheaper by the day. For HDD's it may not have mattered so much because storage size wasn't as much of a premium, plus controller performance likely wasn't nearly as fast as those in SSD's.

HTWingNut, Mar 20, 2013

#13

R3d Notebook Virtuoso

Reputations:: 1,515

Messages:: 2,382

Likes Received:: 60

Trophy Points:: 66

HTWingNut said: ↑

It has nothing to do with backups or duplication. It would save space, probably even create crazy improvements in write performance. It's not file by file, but block by block.

No it would be stored like it is currently in the file table. Tons of examples. If you install a game that has tons of texture files, lots of small blocks of data will be identical. Instead of storing it dozens or hundreds of times, for parts of those files it just points to that same block of data instead of writing it dozens or hundreds of times for each individual file. Heck it could have repeating blocks of data in the same file. Even completely unrelated files could have certain segments of identical data.

Click to expand...

What you just described is block level deduplication.

R3d, Mar 20, 2013

#14

jclausius Notebook Virtuoso

Reputations:: 6,160

Messages:: 3,265

Likes Received:: 2,573

Trophy Points:: 231

HTWingNut said: ↑

Regarding free space it would know because it would track as its writing whether it was a write (and how much space) or just a pointer. Not that difficult. All your eggs in one basket makes sense, but make it a user configurable features to enable or disable so you can either get more space with some risk or regular storage space.

Click to expand...

That's not what I'm getting at. When an app writes to the OS, it can do so with WRITE THROUGH or WRITE BACK. In either case, the file system, which controls the file system has to know what blocks it can/cannot write to and know how much space it has to grant a request. So, there's a breakdown as the filesystem cannot necessarily know if a write would or would not succeed due to actual presence of data within a block or NAND cell.

HTWingNut said: ↑

For HDD's it may not have mattered so much because storage size wasn't as much of a premium

Click to expand...

Oh, how I wish this was true. Most users always have a need for more and more storage space. If we could get a flux capacitor, I'd take you back to my computers using " Stacker" and " DoubleSpace" on whopping 16 to 20 <strike>GB</strike> MB HDDs.

DOH!!! Typo - 15 to 20 MB!!!!!

jclausius, Mar 20, 2013

#15

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

jclausius said: ↑

I'd take you back to my computers using " Stacker" and " DoubleSpace" on whopping 16 to 20GB HDDs.

Click to expand...

Been there. 20GB? Hell, my first hard drive was 80MB!

HTWingNut, Mar 20, 2013

#16

Mr.Koala Notebook Virtuoso

Reputations:: 568

Messages:: 2,307

Likes Received:: 566

Trophy Points:: 131

10MB HDD owner here.

Mr.Koala, Mar 21, 2013

#17

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,877

Trophy Points:: 931

Mr.Koala said: ↑

10MB HDD owner here.

Click to expand...

LOL. Yeah. I remember mine cost me $240 back in 1992. But considering the speed difference from booting off floppy vs hard drive, not to mention most programs were only a few hundred KB back then, it was a godsend. It was actually on my old Amiga 500 with the expansion box. I loved that thing. It got me through college.

HTWingNut, Mar 21, 2013

#18

jclausius Notebook Virtuoso

Reputations:: 6,160

Messages:: 3,265

Likes Received:: 2,573

Trophy Points:: 231

htwingnut said: ↑

been there. 20gb? Hell, my first hard drive was 80mb!

Click to expand...

Doh!!! Typo!!! MegaBytes, MegaBytes, MegaBytes - Back when drives marked MB meant MB. My first non-parental computer ('88-ish) was a i386 SX w/ either a 16 or 20MB HDD. I remember later on running out of drive space trying to load things like Prince of Persia along side my trusty Borland Pascal and C/C++ compilers. Sorry 'bout that.

Getting back on topic, since drive space was at such a premium those days, any HDD manufacturer that had a way to cram more on a drive would've sold a ton of them, probably becoming a huge, huge market leader. It's hard to imagine those engineers didn't cover this ground. But whose to say for certain.

jclausius, Mar 21, 2013

#19