The Notebook Review forums were hosted by TechTarget, who shut down them down on January 31, 2022. This static read-only archive was pulled by NBR forum users between January 20 and January 31, 2022, in an effort to make sure that the valuable technical information that had been posted on the forums is preserved. For current discussions, many NBR forum users moved over to NotebookTalk.net after the shutdown.
Problems? See this thread at archive.org.

    Thoughts on SSD Controllers - New Idea

    Discussion in 'Hardware Components and Aftermarket Upgrades' started by HTWingNut, Mar 18, 2013.

  1. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    Maybe I don't know what I'm talking about, and likely the case, as I don't know much about SSD controllers, but I was just thinking about my old Windows Home Server. Maybe I'm not using all the proper terminology, but with it's system backup feature and Drive Extender it would use pointers for duplicate data on a block level. So if you were to have fresh installs of Windows 7 Pro on three PC's, each about 8GB, when you went to back them up, it wouldn't take 24GB (3x8GB), it would take not much more than 8GB because the files were all similar. It would make one copy, then generate a pointer to the same data for the other two backups.

    It seems something similar could be incorporated into SSD controllers to reduce amount of data written to the SSD, as well as freeing up space.

    Just an idea. Not that it matters, but seems like something like this could work.
     
  2. MidnightSun

    MidnightSun Emodicon

    Reputations:
    6,668
    Messages:
    8,224
    Likes Received:
    231
    Trophy Points:
    231
    Not an expert on this either, but what you're talking about seems to be akin to data compression? :)
     
  3. Aeny

    Aeny Notebook Consultant

    Reputations:
    110
    Messages:
    169
    Likes Received:
    93
    Trophy Points:
    41
    That would be very handy, imagine carrying 4x vhd's of virtual windows server installs around being 20GB each with just some settings different between them (e.g. testing stuff, school, ..) Taking a little over 20GB instead of 80GB. That would really save me some space!

    I don't know how this would be different from compression but lets say I start changing stuff on one virtual machine then what would it do? write the difference to some sort of file? Then it needs some sort of database to keep track of all those files? I think it would get messy real fast, like Windows registry messy fast.

    Doesn't Microsoft does something similar with it's RAM since Windows 8..? Where similar processes can use 'pointers', for example 2x iexplore with no open windows would theoretically have the exact same content in RAM.

    ~Aeny
     
  4. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    No, it's not compression. It's different. If you were to compress two similar sets of 20GB files down to 10GB, with this method you could copy both compressed files to the SSD and it wouldn't take more than 10GB total because the contents are nearly identical. It would just place pointers to where to pull the data from on the SSD.

    Compression just looks for patterns and spaces and stacks them together. The pointer basically looks at each block of data as an entity, and can reconstruct data from those blocks. It does this somewhat already by storing where the data is held. All they would have to do is change it to examine incoming data and just point to a block with the same data/hash.


    Yes exactly. That's what I'm talking about.

    It'd just be a little more complex than a standard file table. using something like hash values or something.

    Perhaps, but not sure. But that's the idea.
     
  5. Peon

    Peon Notebook Virtuoso

    Reputations:
    406
    Messages:
    2,007
    Likes Received:
    128
    Trophy Points:
    81
    Your proposed method of data deduplication would be excrutiatingly slow, as a good deal of the drive would need to be scanned each and every time a 4K page is written to disk in order to look for a match (the worst case would be if no match exists, actually, as the entire drive would need to be examined in order to determine that). To write a 1 GB file, you would need to read through the contents of the SSD 262,144 times.

    At a typical 500 MB/s read speed, it takes 240 seconds to read through a completely filled 120 GB SSD once. Therefore, in a worst-case scenario, it could potentially take 728 days to write 1 GB.

    You could do a lot better performance-wise by having lots of DRAM, by using an O(log n) search algorithm instead of linear search, and by hashing the data, but even with these tricks and under ideal conditions it's unlikely you'd do significantly better than what Sandforce does, given hardware, price, and power constraints. After all, you can't cram a quad-core i7 and 16 GB of RAM into an SSD.
     
  6. R3d

    R3d Notebook Virtuoso

    Reputations:
    1,515
    Messages:
    2,382
    Likes Received:
    60
    Trophy Points:
    66
    What you're describing is similar to the block level deduplication that Sandforce drives do (in addition to real time compression). Except they use the extra space for redundancy (RAISE), so it's not visible to the user.
     
  7. Marksman30k

    Marksman30k Notebook Deity

    Reputations:
    2,080
    Messages:
    1,068
    Likes Received:
    180
    Trophy Points:
    81
    It actually sounds a lot like the sand force style compression except at a very high level. The method you describe is basically a form of deduplicating data, which is the principle behind file compression. The trick with all compression is speed, effective compression is almost always mutually exclusive with speed, sand force drives use really powerful ARM CPUs (at the time) to do the compression in real time to do all you have described but we all know what the real performance is like when the data is incompressible.
     
  8. Mr.Koala

    Mr.Koala Notebook Virtuoso

    Reputations:
    568
    Messages:
    2,307
    Likes Received:
    566
    Trophy Points:
    131
    You don't need to do this at block level if your goal is to backup OS installs. You can do it on file system level. If two files have matching names, check the hash. When a match is found, add a symbolic link. It should be fast enough with an index.

    For software implementation of block level dedup, ZFS already has it. (Not available on Windows out of the box, but there are 3rd party projects.)
     
  9. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    It doesn't have to scan the whole disk. It'd just be like any drive file system. Do you think a system scans a drive every time it fetches a file? No it stores a database more or less of where to fetch that file. If you can compress on the fly you can check a bunch of hashes on the fly. It wouldn't take 728 days, lol. Windows Home Server has done this since its inception on much slower drives filled with TB's of data.

    But wouldn't it be better to be stored on the hardware than rely on some third party software?
     
  10. Peon

    Peon Notebook Virtuoso

    Reputations:
    406
    Messages:
    2,007
    Likes Received:
    128
    Trophy Points:
    81
    Ahh, so you're proposing to perform the dedup at the file level.

    What if you edit one copy of a file slightly? Should the edited file then be considered a completely different file? How about if multiple different files share some similarities (e.g. file metadata) but are otherwise different? Can no gains be made there? If that's the case, Sandforce is far more efficient.

    I can't think of any real world scenario where someone would have a legitimate reason to purposely store enough duplicate copies of a file for file-level deduplication to make a meaningful difference, especially given how far SSD owners tend to go in order to save just a couple of GBs of precious space. After all, SSDs aren't backup devices, you'd have to be crazy (or crazy rich) to use SSDs for backing up given the cost/GB ratio.

    Sorry, but this sounds like a solution to a non-existent problem.
     
  11. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    It has nothing to do with backups or duplication. It would save space, probably even create crazy improvements in write performance. It's not file by file, but block by block.

    No it would be stored like it is currently in the file table. Tons of examples. If you install a game that has tons of texture files, lots of small blocks of data will be identical. Instead of storing it dozens or hundreds of times, for parts of those files it just points to that same block of data instead of writing it dozens or hundreds of times for each individual file. Heck it could have repeating blocks of data in the same file. Even completely unrelated files could have certain segments of identical data.
     
  12. jclausius

    jclausius Notebook Virtuoso

    Reputations:
    6,160
    Messages:
    3,265
    Likes Received:
    2,573
    Trophy Points:
    231
    I see what you're getting at, but to play devil's advocate, a couple of points:

    a) Seems like this concept could apply to any kind of random access storage mechanism. In other words, if it was worth doing, then HDDs would have implemented this as well a long time ago. And since they haven't, my guess is there are valid reasons.

    b) To implement this, the drive would require a "mapping" tree (which bit pattern goes to what LBA). In order to use this, you would have to create a lookup / mapping hash which told which byte pattern was stored where on disk. Perhaps someone has tried this, and it turned out the required size and actual management (updates, deletes, balancing, etc.) cost way more to save/upkeep that impeded actual I/O performance and/or storage requirements.

    c) The "all your eggs in one basket" issue. If you would loose one LBA due to a bad sector / bad NAND page on disk, you could possibly lose many, many, many files. This may lead to some very upset users.

    d) Calculating free space. How could a controller know how much free space is available for usage if there are bit patterns that may be shared. Don't know if end users would like this aspect either.


    Could this be done at the file system level rather than hardware level? Would take way too much time from kernel time on the CPU that the cost exceeds any benefit here as well? For quite some time, Windows has had the option to create a "compressed" drive. I wonder if it uses this technique? Does it compress files at the file level rather than at the file system level?
     
  13. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    Windows doesn't use this technique. The compression level is not that great. Regarding free space it would know because it would track as its writing whether it was a write (and how much space) or just a pointer. Not that difficult. All your eggs in one basket makes sense, but make it a user configurable features to enable or disable so you can either get more space with some risk or regular storage space. They may need a bit more cache to store the cross reference hash data, but as we know that is getting cheaper by the day. For HDD's it may not have mattered so much because storage size wasn't as much of a premium, plus controller performance likely wasn't nearly as fast as those in SSD's.
     
  14. R3d

    R3d Notebook Virtuoso

    Reputations:
    1,515
    Messages:
    2,382
    Likes Received:
    60
    Trophy Points:
    66
    What you just described is block level deduplication.
     
  15. jclausius

    jclausius Notebook Virtuoso

    Reputations:
    6,160
    Messages:
    3,265
    Likes Received:
    2,573
    Trophy Points:
    231
    That's not what I'm getting at. When an app writes to the OS, it can do so with WRITE THROUGH or WRITE BACK. In either case, the file system, which controls the file system has to know what blocks it can/cannot write to and know how much space it has to grant a request. So, there's a breakdown as the filesystem cannot necessarily know if a write would or would not succeed due to actual presence of data within a block or NAND cell.

    Oh, how I wish this was true. Most users always have a need for more and more storage space. If we could get a flux capacitor, I'd take you back to my computers using " Stacker" and " DoubleSpace" on whopping 16 to 20 <strike>GB</strike> MB HDDs.

    DOH!!! Typo - 15 to 20 MB!!!!!
     
  16. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    Been there. 20GB? Hell, my first hard drive was 80MB!
     
  17. Mr.Koala

    Mr.Koala Notebook Virtuoso

    Reputations:
    568
    Messages:
    2,307
    Likes Received:
    566
    Trophy Points:
    131
    10MB HDD owner here. :D
     
  18. HTWingNut

    HTWingNut Potato

    Reputations:
    21,580
    Messages:
    35,370
    Likes Received:
    9,877
    Trophy Points:
    931
    LOL. Yeah. I remember mine cost me $240 back in 1992. But considering the speed difference from booting off floppy vs hard drive, not to mention most programs were only a few hundred KB back then, it was a godsend. It was actually on my old Amiga 500 with the expansion box. I loved that thing. It got me through college.
     
  19. jclausius

    jclausius Notebook Virtuoso

    Reputations:
    6,160
    Messages:
    3,265
    Likes Received:
    2,573
    Trophy Points:
    231
    Doh!!! Typo!!! MegaBytes, MegaBytes, MegaBytes - Back when drives marked MB meant MB. My first non-parental computer ('88-ish) was a i386 SX w/ either a 16 or 20MB HDD. I remember later on running out of drive space trying to load things like Prince of Persia along side my trusty Borland Pascal and C/C++ compilers. Sorry 'bout that.

    Getting back on topic, since drive space was at such a premium those days, any HDD manufacturer that had a way to cram more on a drive would've sold a ton of them, probably becoming a huge, huge market leader. It's hard to imagine those engineers didn't cover this ground. But whose to say for certain.