What you will learn: The term "big data" is used anytime an enterprise produces a set of data containing critical business information that's too large to be processed by relational databases. Determining what data is left unstructured depends on the size and scope of the enterprise’s IT infrastructure, but it's common for businesses of all sizes to have some amount of information that could be considered big data. The struggle for IT administrators and business analysts is not only how to store this data, but how to store it in a way that allows for analysis, resulting in the identification of critical business patterns and insights.
As the IT industry continues to preach the advantages of cheap storage, businesses are keeping more data than ever, resulting in a deep investigation into which factors matter most when evaluating a big data infrastructure. Among the most important are capacity, latency, access, security and cost, all of which are covered in this article.
What's driving the big data movement?
Aside from the ability to keep more data than ever before, we have access to more types of data. These data sources include Internet transactions, social networking activity, automated sensors, mobile devices and scientific instrumentation, among others. In addition to static data points, transactions can create a certain “velocity” to this data growth. As an example, the extraordinary growth of social media is generating new transactions and records. But the availability of ever-expanding data sets doesn’t guarantee success in the search for business value.
Data is now a factor of production
Data has become a full-fledged factor of production, like capital, labor and raw materials, and it’s not just a requirement for organizations with obscure applications in special industries. Companies in all sectors are combining and comparing more data sets in an effort to lower costs, improve quality, increase productivity and create new products. For example, analyzing data supplied directly from products in the field can help improve designs. Or a company may be able to get a jump on competitors through a deeper analysis of its customers’ behavior compared with that of a growing number of available market characteristics.
Storage must evolve
Big data has outgrown its own infrastructure and it’s driving the development of storage, networking and compute systems designed to handle these specific new challenges. Software requirements ultimately drive hardware functionality and, in this case, big data analytics processes are impacting the development of data storage infrastructures. This could mean an opportunity for storage and IT infrastructure companies. As data sets continue to grow with both structured and unstructured data, and analysis of that data gets more diverse, current storage system designs will be less able to meet the needs of a big data infrastructure. Storage vendors have begun to respond with block- and file-based systems designed to accommodate many of these requirements. Here’s a listing of some of the characteristics big data storage infrastructures need to incorporate to meet the challenges presented by big data.
Capacity. “Big” often translates into petabytes of data, so big data infrastructures certainly needs to be able to scale. But they also need to scale easily, adding capacity in modules or arrays transparently to users, or at least without taking the system down. Scale-out storage is becoming a popular alternative for this use case. Scale-out’s clustered architecture features nodes of storage capacity with embedded processing power and connectivity that can grow seamlessly, avoiding the silos of storage that traditional systems can create.
Big data also means a large number of files. Managing the accumulation of metadata for file systems at this level can reduce scalability and impact performance, a situation that can be a problem for traditional NAS systems. Object-based storage architectures, on the other hand, can allow big data storage systems to expand file counts into the billions without suffering the overhead problems that traditional file systems encounter. Object-based storage systems can also scale geographically, enabling large infrastructures to be spread across multiple locations.
Latency. A big data infrastructure may also have a real-time component, especially in use cases involving Web transactions or finance. For example, tailoring Web advertising to each user’s browsing history requires real-time analytics. Storage systems must be able grow to the aforementioned proportions while maintaining performance because latency can produce “stale data.” Here, too, scale-out architectures enable the cluster of storage nodes to increase in processing power and connectivity as they grow in capacity. Object-based storage systems can parallelize data streams, further improving throughput.
Many big data environments will need to provide high IOPS performance, such as those in high-performance computing (HPC) environments. Server virtualization will drive high IOPS requirements, just as it does in traditional IT environments. To meet these challenges, solid-state storage devices can be implemented in many different formats, from a simple server-based cache to all-flash-based scalable storage systems.
Access. As companies get better at understanding the potential of big data analysis, the need to compare differing data sets will bring more people into the data sharing loop. In the quest to create business value, firms are looking at more ways to cross-reference different data objects from various platforms. Storage infrastructures that include global file systems can help address this issue, as they allow multiple users on multiple hosts to access files from many different back-end storage systems in multiple locations.
Security. Financial data, medical information and government intelligence carry their own security standards and requirements. While these may not be different from what current IT managers must accommodate, big data analytics may need to cross-reference data that may not have been co-mingled in the past, which may create some new security considerations.
Cost. “Big” can also mean expensive. And at the scale many organizations are operating their big data environments, cost containment will be an imperative. This means more efficiency “within the box,” as well as less expensive components. Storage deduplication has already entered the primary storage market and, depending on the data types involved, could bring some value for big data storage systems. The ability to reduce capacity consumption on the back end, even by a few percentage points, can provide a significant return on investment as data sets grow. Thin provisioning, snapshots and clones may also provide some efficiencies depending on the data types involved.
More on big data infrastructure
Needed: More employees with the skill set to work with big data
Helping firms decipher big data infrastructure processes
Explaining big data characteristics
Overview: Big data infrastructure considerations
Many big data storage systems will include an archive component, especially for those organizations dealing with historical trending or long-term retention requirements. Tape is still the most economical storage medium from a capacity/dollar standpoint, and archive systems that support multiterabyte cartridges are becoming the de facto standard in many of these environments.
What may have the biggest impact on cost containment is the use of commodity hardware. It’s clear that big data infrastructures won’t be able to rely on the big iron enterprises have traditionally turned to. Many of the first and largest big data users have developed their own “white-box” systems that leverage a commodity-oriented, cost-saving strategy. But more storage products are now coming out in the form of software that can be installed on existing systems or common, off-the-shelf hardware. In addition, many of these companies are selling their software technologies as commodity appliances or partnering with hardware manufacturers to produce similar offerings.
Persistence. Many big data applications involve regulatory compliance that dictates data be saved for years or decades. Medical information is often saved for the life of the patient. Financial information is typically saved for seven years. But big data users are also saving data longer because it’s part of an historical record or used for time-based analysis. This requirement for longevity means storage manufacturers need to include on-going integrity checks and other long-term reliability features, as well as address the need for data-in-place upgrades.
Flexibility. Because big data storage infrastructures usually get very large, care must be taken in their design so they can grow and evolve along with the analytics component of the mission. Data migration is essentially a thing of the past in the big data world, especially since data may be in multiple locations. A big data storage infrastructure is essentially fixed once you begin to fill it, so it must be able to accommodate different use cases and data scenarios as it evolves.
Application awareness. Some of the first big data implementations involved application-specific infrastructures, such as systems developed for government projects or the white-box systems invented by large Internet services companies. Application awareness is becoming more common in mainstream storage systems as a way to improve efficiency or performance, and it’s a technology that should apply to big data environments.
Smaller users. As a business requirement, big data will trickle down to organizations that are much smaller than what some storage infrastructure marketing departments may associate with big data analytics. It’s not only for the “lunatic fringe” or oddball use cases anymore, so storage vendors playing in the big data space would do well to provide smaller configurations while focusing on the cost requirements.
BIO: Eric Slack is a senior analyst at Storage Switzerland.