We upgraded our main data processing server with 4x120GB Indilinx SSDs a little over 2 years ago. It was quite an astounding leap in performance going from hard drives to 1000MB/s reads, 800MB/s writes and 0 latency.
This week though, we started seeing SSD degradation from too many writes. The SQL software would crash under unexplainable conditions. After testing every possible factor (CPU, memory, OS, software), I finally narrowed it down to SSD burnout after examining a backup dump. Out of 5000 tables, there were 2 tables showing blocks of '0's which sounds like what happens when memory cells can no longer accept writes -- they revert to the default state of 0's.
On average, we did 100GB of writes per week which calculates out to 3TB of writes per 120GB drive. At a ratio of 3000:120, that's an average of 25 writes per location which is a magnitude less than the claimed 5000 write life for MLC SSDs. However, these are 1st generation drives so my guess is the write levelling algorithms are not very good and distribution of write are concentrated. To back that up, having ~10 bad blocks out of 350GB of data is a very low amount -- just about what you get for hard drives with bad sectors here and there. The problem is the remapping/detection of bad blocks for these 1st generation SSDs is just as bad as their write levelling. Hence having 0.01% of cells wear out due to concentrated writes is just as bad as the entire drive going bad if there's no way to avoid that 0.01%.
We will be replacing these SSDs with 5x Sandforce SATA 6Gbps SSDs next week so our server will soon have 2500MB/s reads and 2000MB/s writes. Sandforce controllers have built-in compression and error correction so it's possible these drives will come much closer to the rated 5000 write life for MLC memory. If not, it's just the cost of doing business and it's impossible to return to regular hard drives after using SSDs.
(Filed in technology)
Equipment guaranteed to break
Posted by Mossy
October 1, 2011 4:43 AM
This week though, we started seeing SSD degradation from too many writes. The SQL software would crash under unexplainable conditions. After testing every possible factor (CPU, memory, OS, software), I finally narrowed it down to SSD burnout after examining a backup dump. Out of 5000 tables, there were 2 tables showing blocks of '0's which sounds like what happens when memory cells can no longer accept writes -- they revert to the default state of 0's.
On average, we did 100GB of writes per week which calculates out to 3TB of writes per 120GB drive. At a ratio of 3000:120, that's an average of 25 writes per location which is a magnitude less than the claimed 5000 write life for MLC SSDs. However, these are 1st generation drives so my guess is the write levelling algorithms are not very good and distribution of write are concentrated. To back that up, having ~10 bad blocks out of 350GB of data is a very low amount -- just about what you get for hard drives with bad sectors here and there. The problem is the remapping/detection of bad blocks for these 1st generation SSDs is just as bad as their write levelling. Hence having 0.01% of cells wear out due to concentrated writes is just as bad as the entire drive going bad if there's no way to avoid that 0.01%.
We will be replacing these SSDs with 5x Sandforce SATA 6Gbps SSDs next week so our server will soon have 2500MB/s reads and 2000MB/s writes. Sandforce controllers have built-in compression and error correction so it's possible these drives will come much closer to the rated 5000 write life for MLC memory. If not, it's just the cost of doing business and it's impossible to return to regular hard drives after using SSDs.
(Filed in technology)
Leave a comment