IBM: Hadoop solutions and the data lakes of the future
The foundation of the AI Ladder is Information Architecture. The modern data-driven enterprise needs to leverage the right tools to collect, organize, and analyze their data before they can infuse their business with the results.
Businesses have many types of data and many ways to apply it. We must look for approaches to manage all forms of data, regardless of techniques (e.g., relational, map reduce) or use case (e.g., analytics, business intelligence, business process automation). Data must be stored securely and reliably, while minimizing costs and maximizing utility.
Object storage is the ideal place to collect, store, and manage data assets which will be used to generate business value.
Object storage started as an archive
Object storage was first conceived as a simplification: How can we remove the extraneous functions seen in file systems to make storage more scalable, reliable, and low-cost. Technologies like erasure encoding massively reduced costs by allowing reliable storage to be built on cheap commodity hardware. The interface was simple: a uniform limitless namespace, atomic data writes, all available over HTTP.
And, object storage excels at data storage. For example, IBM Cloud Object Storage is designed for over 10 9’s of durability, has robust data protection and security features, flexible pricing, and is natively integrated with IBM Aspera on Cloud high-speed data transfer.
The first use cases were obvious: Object storage was a more scalable file system that could be used to store unstructured data like music, images, and video. Or to store backup files, database dumps, and log files. Its single-namespace, data tiering options allow it to be used for data archiving, and its HTTP interface makes it convenient to serve static website content as part of cloud native applications.
But, beyond that, it was just a data dump.
Map reduce and the rise of Hadoop solutions
At the same time object storage was replacing file system use cases, the map reduce programming model was emerging in data analytics. Apache Hadoop provided a software framework for processing big data workloads that traditional relational database management system (RDBMS) solutions could not effectively manage. Data scientists and analysts had to give up declarative data management and SQL queries, but gained the ability to work with exponentially larger data sets and the freedom to explore unstructured and semi-structured data.
In the beginning, Hadoop was seen by some as a backwards step. It achieved a measure of scale and cost savings but gave up much of what made RDBMS systems so powerful and easy to work with. While not requiring schemas added flexibility, query latencies and overall performance decreased. But the Hadoop ecosystem has continued to expand and meet user needs. Spark massively increased performance, and Hive provided SQL query.
Hadoop is not suitable for everything. Transactional processing is still a better fit for RDBMS. Businesses must use the appropriate technology for their various OLAP and OTLP needs.
HDFS became the de facto data lake for many enterprises
Like object storage, Hadoop was also designed as a scale-out architecture on cheap commodity hardware. The Hadoop File System (HDFS) began with the premise that compute should be moved to the data. It was designed to place data on locally attached storage on the compute nodes themselves. Data was stored in a form that could be directly read by locally running tasks, without a network hop.
While this is beneficial for many types of workloads, it wasn’t the end of ETL envisioned by some. By placing readable copies of data directly on compute nodes, HDFS can’t take advantage of erasure coding to save costs. When data reliability is required, replication is just not as cost-effective. Furthermore, it can’t be independently scaled from compute. As workloads diversified, this inflexibility caused management and cost issues. For many jobs, network wasn’t actually the bottleneck. Compute bottlenecks are typically CPU or memory issues, and storage bottlenecks are typically hard drive spindle related, either disk throughput or seek constrained.
When you separate compute from storage, both sides benefit. Compute nodes are cheaper because they don’t need large amounts of storage, can be scaled up or down quickly without massive data migration costs, and jobs can even be run in isolation when necessary.
For storage, you want to spread out your load onto as many spindles as possible, so using a smaller active data set on a large data pool is beneficial, and your dedicated storage budget can be reserved for smaller flash clusters when latency is really an issue.
The scale and cost savings of Hadoop attracted many businesses to use it wherever possible, and many businesses ended up using HDFS as their primary data store. This has led to cost, manageability, and flexibility issues.
The data lake of the future supports all compute workloads
Object storage was always useful as a way of backing up your databases, and it could be used to offload data from HDFS to a lower-cost tier as well. In fact, this is the first thing enterprises typically do.
But, a data lake is more than just a data dump.
A data lake is a place to collect an organization’s data for future use. Yes, it needs to store and protect data in a highly scalable, secure, and cost-effective manner, and object storage has always provided this. But when data is stored in the data lake, it is often not known how it will be used and how it will be turned into value. Thus, it is essential that the data lake include good integration with a range of data processing, analytics, and AI tools.
Typical tools will not only include big data tools such as Hadoop, Spark, and Hive, but also deep learning frameworks (such as TensorFlow) and analytics tools (such as Pandas). In addition, it is essential for a data lake to support tools for cataloging, messaging, and transforming the data to support exploration and repurposing of data assets.
Object storage can store data in formats native to big data and analytics tools. Your Hadoop and Spark jobs can directly access object storage through the S3a or Stocator connectors using the IBM Analytics Engine. IBM uses these techniques against IBM Cloud Object Storage for operational and analytic needs.
Just like with Hadoop, you can also leverage object storage to directly perform SQL queries. The IBM Cloud SQL Query service uses Spark internally to perform ad-hoc and OLAP queries against data stored directly in IBM Cloud Object Storage buckets.
TensorFlow can also be used to train and deploy ML models directly using data in object storage.
This is the true future of object storage
As organizations look to modernize their information architecture, they save money when they build their data lake with object storage.
For existing HDFS shops, this can be done incrementally, but you need to make sure to keep your data carefully organized as you do so to take full advantage of the rich capabilities available now and in the future.
Author: Wesly Leggette & Michael Factor
Source: IBM