There are two main approaches to archiving data: independent architectures for individual applications; or one architecture that consolidates all archives on a single platform.
By Stephen Foskett
Data archiving is the vampire of the storage world. It promises to rejuvenate enterprise storage systems by sucking out debris so they can work as well as they did when they were young and not burdened by millions of files. But behind the benefits of archiving, there lurks a hidden detail: Getting archiving to work and the final cost of implementation can be terrifying.
But times are changing, and the benefits of archiving are becoming even more enticing. Once focused solely on the data lifecycle (moving content from expensive disk storage to cheap tape), archiving has transformed into something altogether different. Due to compliance and legal reasons, today's archives are increasingly serving as long-term data storehouses, and many implementations forego the old stub-and-delete hierarchical storage management concept altogether.
Archiving has become an indispensable tool to protect an organization, not just a few pieces of data here and there. Although most archiving efforts start with a single application, demand quickly grows to include multiple data types and diverse systems. The key question is whether you should attempt to expand a single archiving system to include heterogeneous data or employ multiple, single-application archiving systems. But there are other ways to build a consolidated archive. You can leverage a single storage platform for multiple archiving systems or employ enterprise search technologies to put a unified face on a diverse set of systems.
Email often starts the process
Most archives start small, and many begin with one type of data like email (see "What to archive: Different data types," below). IT decides it needs to control growth, so they go looking for a system that can stub out attachments or move data out of the Exchange server. Then the legal department demands a complete set of email messages for a litigation-related search. Later, records management needs to retain certain messages for compliance with Sarbanes-Oxley or other industry regulations.
|What to archive: Different data types|
Although most business information is digital today, not all systems manage information equally. Manageability requires organization and structure, the ability to search for information and meta data to categorize content. We use these elements to classify data as structured, semi-structured or unstructured.
Structured applications are inherently organized, although the identification, description and relation between data can be highly customized. In the enterprise, structured databases are often core applications with specialized administrators managing the data and archiving.
Systems like email have some structure, but they weren't developed with information management in mind, and it shows. What structure they have is functional, designed to serve specific application needs rather than the higher goal of manageability.
Finally, there's the class of unstructured data that's so familiar in file systems. Although some basic systems are used to organize and describe these files, they can't be called truly structured as they lack information about their functional or organizational relationships.
Regardless of their original intent, the scope of these point solutions tends to expand over time. Email systems also include contact lists, calendar entries, to-do lists and notes, and these may not have been considered at first. What about attachments? Most corporate record-retention policies call for documents to be retained, but they might be duplicated on the file server or document management system. And once legal gets used to simpler search and retrieval of old email messages, they'll want to archive and search across a variety of data types beyond email, including document management systems, file servers and structured data systems.
Events like those often lead to a key turning point in the process of implementing consolidated archiving: Should the current system be expanded to include other record types or should another vertical solution be deployed for each new application?
Standardization and simplicity are often the primary reasons for expanding an existing archival system. Jason Beckham, director of IT at Payformance Corp. in Jacksonville, FL, sees the benefit in sticking to a single platform. "Our current Hitachi archiving platform is already in place and it's a known quantity," he says. "We plan to expand on the existing HCAP [Hitachi Content Archive Platform] when we add email archiving, since it's so simple to implement and will require much less management and training."
Three keys to archiving
There are many ideas about what a consolidated or unified archive should look like, and preconceptions can clash when you're considering creating such a solution. There are three key elements to any archiving system:
- Archiving software, from companies like Autonomy Zantaz, EMC Corp., Mimosa Systems Inc. and Symantec Corp., manages the location, movement and disposition of data.
- Storage hardware, from companies such as EMC, Hewlett-Packard (HP) Co., Hitachi and NetApp, receives the data to be preserved and specialized platforms handle encryption, protection, retrieval and destruction of data.
- Management software, from vendors like Abrevity Inc., Attenex Corp., Autonomy Corp., Clearwell Systems Inc., i365 (a Seagate Company) and Kazeon Systems Inc., provides services like search, classification and e-discovery capability.
But most applications don't fall into these neat classifications. As they develop their products and the market matures, vendors continually add features like e-discovery support, search and data movement, blurring the lines among storage, archiving and management. The variety of elements and overlapping features add complexity to the once-simple world of archiving.
Navigating the archiving market starts with an evaluation of your company's objectives for its archive. If the objective is to serve business demands like in-house e-discovery or retention to comply with regulations, it makes sense to let data management features drive product selection. But if IT needs a system to control data growth or enable lifecycle management, higher-level search and e-discovery features are less relevant. Regardless of the initial object, it's likely the archive will eventually serve both business and IT demands.
Not all archives will use all three archiving elements presented here. Some organizations send data to an archiving platform directly from a custom application, while others will use conventional storage systems rather than investing in a specialized device as their archive target.
Brian Greenberg, an independent IT strategy consultant based in Chicago, suggests a strategy is needed before expanding the archive environment. "Larger organizations, especially in regulated industries, are looking for federated search and management across data types, but smaller, less-regulated companies might be able to keep their data in silos," he says. "The key is the level of overarching management needed."
One practical solution to bring order to a diverse set of data archives is to leverage federated management software. These applications let you pick the best point solutions for email, databases, content management systems and file servers, but hide the complexity of having multiple archivers behind a unified interface. These are especially appropriate where search, rather than capacity management, is the primary reason for archiving.
Although many archiving platforms today include search and e-discovery features, none can match legal-focused data management tools. Simple Boolean text search can't hold a candle to the concept-clustering and fuzzy-search features in products like Attenex's Patterns, Clearwell's E-Discovery Platform or i365's MetaLincs. Tools like these also boast complex e-discovery features, including review and annotation, that are beyond anything found in the more IT-focused archiving applications.
Some of these tools can be used to search non-archived data as well. Autonomy and OpenText Corp. offer enterprise search platforms that can manage both production and archived data from a single interface and can be integrated into other enterprise applications for complex environments. CommVault even includes backup and remote site replication in its unified archiving and search platform, creating a one-stop data protection and management suite.
Many companies are also attempting to solve the archive consolidation puzzle from the bottom up, investigating unified storage platforms to retain data. This approach is most appropriate where capacity control, rather than search, is the key requirement, as storage unification can bring many advantages.
Consolidated archiving doesn't have to be a great technical challenge. Although specialized protocols like EMC's Centera API and the Storage Networking Industry Association's new eXtensible Access Method (XAM) specification were developed specifically with archiving in mind, Thomas Savage, senior manager, product marketing data retention at NetApp, points out that "most archiving applications support a variety of storage devices over standard CIFS or NFS." Nearly any storage device can serve as a landing spot for archive data.
But specialized archive platforms like those from EMC, Hitachi, HP and NetApp bring special capabilities for storage of archival data (see "Popular archiving platforms," below). Most offer native support for "objects" rather than files, and can manage these with custom meta data. Some can autonomously enforce retention policies, even securely deleting data once it has "expired." They may also support full-text indexing and search. Although higher-level archiving software will almost certainly duplicate some of the functionality of these devices, their presence at this lowest common layer can make it simpler to configure these features and consistently enforce policy.
Click here for a a sampling of popular archiving platforms.
Data protection is especially critical, so IT managers might feel more comfortable with a "belts and suspenders" approach. Even if the archiving software places some data off-limits by policy, a basic storage system used as the archive target may leave it unprotected. Although careful application of traditional security and access controls found on standard NAS systems can offset this risk, the enhanced features of specialized archiving storage systems go further.
Storage platforms specifically designed for archiving often include enhanced protection against modification or deletion, sometimes called write once, read many (WORM) storage. Contrary to popular belief, there are few regulations or laws specifically calling for WORM storage. But the concept of access control is as central to long-term data management as retention schedules and classification, and legal discovery regularly demands certification that data hasn't been accessed or modified. Therefore, no legal or compliance archive should be without WORM capability. Systems may also offer authentication that uses mathematical checksums to verify that data hasn't been modified, but use of those haven't yet been commonly presented in legal cases. Finally, make sure the system logs and reports all access attempts, as this data is critical for documenting compliance.
Many storage systems designed for archiving are now adding data-reduction technologies ranging from compression to single-instance storage (SIS) to advanced deduplication. These various technologies function similarly, using algorithms to reduce the amount of data that must be stored in a lossless fashion. Single-instance storage compares whole files or objects with existing content, storing only a single copy of duplicates. Traditional data compression encodes files or objects, creating a "dictionary" of repeating patterns to shrink them. Finally, deduplication technology searches for duplicate data both within objects and across the entire data store, a complex computing task that can result in vastly reduced storage requirements. Each approach balances storage reduction against the computing power required to accomplish it.
These specialized archiving systems also include standard enterprise storage capabilities like scalability, high availability and data replication. Architecture varies from product to product, with some employing traditional storage array technology and others based on a cluster of redundant nodes. One differentiator is the extent to which the archive system can, or should, include non-archive data (see "Consolidated vs. unified storage," below).
|Consolidated vs. unified storage|
As archives are implemented, it's tempting to create "stovepipes" of storage on the back end, with each application using its own storage system for content and indexes. "Of course a unified storage platform gives efficiency of resources and management, but consolidating archives has other benefits," says Rob Mossi, senior marketing manager for archiving at Hitachi Data Systems. "Creating a consolidated platform for the storage of archived content is a winning strategy, especially when leveraging advanced storage system features like duplicate elimination, compression and replication as found in a highly scalable, performance-enabled active archive solutions," he notes.
Although archive software can use a variety of storage platforms to store the archived content itself, there are other storage requirements for these applications. All archiving software products maintain an index of both the production and archived data, and that database is often stored on a conventional storage array. A unified, multiprotocol storage system with conventional block storage and archiving features allows both the index and content to share space, easing management and growth headaches.
Financial considerations also come into play when comparing an archiving system to plain storage capacity. Payformance's Beckham recognized the cost differential, but says it was justified based on the added meta data and WORM functionality. "We weren't buying a storage system, we were investing in the value that an intelligent archiving platform brought to our business," he says. When evaluating a storage platform for archive consolidation, consider whether the advanced features of an archiving system are required. Although an existing storage device might be acceptable for retention, these capabilities might be worth the extra money. Note that technical issues like the scalability of deduplication and the manner in which protocols like NFS handle offline files sometimes crop up. In these cases, only a specialized archiving platform will do.
Finally, since archiving software often directly integrates with these capabilities, these specialized storage systems are much more likely to be configured correctly than basic storage devices. When purchasing archiving hardware or software, look for highly integrated combinations to minimize the risk and headache of management.
The global archive
Where should one start when considering a consolidated archive? Like all IT decisions, the first consideration should be the business objectives to be served. Start by thinking about the goal: Will the archive deliver capacity control, compliance, business productivity or legal support, or a combination of these capabilities? Then compile a list of specific functional requirements: what the system must do, rather than how it must accomplish these things. Only then can products be evaluated fairly in this complicated corner of the storage market.
BIO: Stephen Foskett is director of data practice at Contoural Inc.