Incoming data rates and capacity levels-- already creating havoc inside IT shops and making mockery out of their budgets -- are only getting worse. Most IT folks I talk to are seeing data almost double every year, with unstructured and semi-structured data in the lead. Granted, the cost per
The good news is that the investment in certain capacity optimization technologies over the last five years is beginning to pay off handsomely. I am truly amazed at the creativity of some of these products. And I believe these products, which I will briefly discuss here, will make a huge difference to all IT shops, big and small, regardless of industry, in the next 12 months. The way I see it, we had to find a way to keep 10 pounds of data in a one-pound storage unit. Otherwise, something would have to give. Think of it as a typical American trait. Innovation comes just in time to fill a void.
Fundamentally, there are two ways to solve this problem: compress the data somehow so it fits into a smaller space or squeeze out all duplicate data in such a way that the remaining data fits into a smaller space. Both techniques are important. The first, exemplified by the LZ compression algorithms, has been around for at least 25 years and has been applied most often to tape drives. The problem with LZ is two-fold. First, the compression is data-dependent and second, the compression is local. That means the compression ratios are unpredictable but, more importantly, the algorithm only looks at a narrow window of data. So while, on a corporate level, one could achieve much larger compression ratios, LZ achieves a fraction thereof due to its narrow view.
The second technique, or rather set of techniques, is to essentially try to eliminate all duplication of data at the corporate level. Just think about it: How many documents in the company have the corporate logo on them? How many proposals or other contracts contain the boilerplate? When you send an e-mail with an attachment to 50 colleagues in the company, how many times is that e-mail saved today? How about the attachments? Is there any reason to keep storing the same data again and again? How many full backups does a typical IT shop keep on tape? And now on disk? I think it would hard to argue that some very large percentage of those "fulls" is duplicate data (on the average 90% or more).
Incremental backups, similarly, have a lot of duplication (at a sub-file level) with files that have already been saved once. The new products that fall in this category all attack this problem. How efficient they are, how they do it, where they do it and what the implications are at a corporate level finally determine their overall effectiveness and their applicability to a specific IT shop.
The companies that have shipping products in this area include Avamar, Data Domain, Diligent Technologies, EMC (Centera), HP (RISS) and Kashya. Sepaton will be in the game later this year. ADIC will probably apply the Rocksoft technology they just acquired to their VTL and other products. Symantec just announced a product based on their DCT acquisition that incorporates this technique and applies it to a remote office. I just wanted to at least let you know that help is on its way. Also, don't forget to check out a technology that achieves similar cost results without squashing the data. (You know who I am referring to…Copan Systems and their MAID technology.)
Based on what I have seen, it is not uncommon for most IT shops to see a 20:1 reduction in secondary storage needs. Just imagine the possibilities. At these ratios, SATA drives start looking so cheap that you might be tempted (correctly, I might add) to consider it cheaper than tape. But, remember that once you squeeze the duplication out you could store the "squeezed" data on tape for long-term archiving. So, the advantage does seep into tape. There are implications, though, that I cannot get into here. Suffice it to say, the whole industry is ready to be turned upside down with these technologies. You don't have the choice of ignoring them anymore. Go out and start checking them out. As a minimum, read my detailed article and then start evaluating these technologies. If you don't, I'm certain you'll be gasping for air as the incoming data submerges you sooner rather than later.
About the author: Arun Taneja is the founder and consulting analyst for the Taneja Group. Taneja writes columns and answers questions about data management and related topics.