2 items tagged "data lake"

  • Get the most out of a data lake, avoid building a data swamp

    Get the most out of your data lake, avoid building a data swamp

    As an industry, we’ve been talking about the promise of data lakes for more than a decade. It’s a fantastic concept—to put an end to data silos with a single repos­itory for big data analytics. Imagine having a singular place to house all your data for analytics to support product-led growth and business insight. Sadly, the data lake idea went cold for a while because early attempts were built on Hadoop-based repositories that were on-prem and lacked resources and scalability. We ended up with a “Hadoop hangover.”

    Data lakes of the past were known for management challenges and slow time-to-value. But the accelerated adoption of cloud object storage, along with the expo­nential growth of data, has made them attractive again.

    In fact, we need data lakes to support data analytics now more than ever. While cloud object storage first became popular as a cost-effective way to temporarily store or archive data, it has caught on because it is inexpensive, secure, durable, and elastic. It’s not only cost-effective but it’s easy to stream data in. These features make the cloud a perfect place to build a data lake—with one addressable exception.

    Data lake or data swamp?

    The economics, built-in security, and scalability of cloud object storage encour­age organizations to store more and more data—creating a massive data lake with lim­itless potential for data analytics. Businesses understand that having more data (not less) can be a strategic advantage. Unfortu­nately, many data lake initiatives in recent history failed because the data lake became a data swamp—comprised of cold data that could not be easily accessed or used. Many found that it’s easy to send data to the cloud but making it accessible to users across the organization who can analyze that data and act on the insights from it is difficult. These data lakes became a dumping ground for multi-structured datasets, accumulating and collecting digital dust without a glimmer of the promised strategic advantage.

    Simply put, cloud object storage wasn’t built for general-purpose analytics—just as Hadoop wasn’t. To gain insights, data must be transformed and moved out of the lake into an analytical database such as Splunk, MySQL, or Oracle, depending on the use case. This process is complex, slow, and costly. It’s also a challenge because the industry currently faces a shortage of the data engineers who are needed to cleanse and transform data and build the data pipelines needed to get it into these ana­lytical systems.

    Gartner found that more than half of enterprises plan to invest in a data lake within the next 2 years despite these well-known challenges. There are an incredi­ble number of use cases for the data lake, from investigating cyber-breaches through security logs to researching and improv­ing customer experience. It’s no wonder that businesses are still holding onto the promise of the data lake. So how can we clean up the swamp and make sure these efforts don’t fail? And critically, how do we unlock and provide access to data stored in the cloud—the most significant barrier of all?

    Turning up the heat on cold cloud storage

    It’s possible (and preferable) to make cloud object storage hot for data analytics, but it requires rethinking the architecture. We need to make sure the storage has the look and feel of a database, in essence, turn­ing cloud object storage into a high-per­formance analytics database or warehouse. Having “hot data” requires fast and easy access in minutes—not weeks or months—even when processing tens of terabytes per day. That type of performance requires a dif­ferent approach to pipelining data, avoiding transformation and movement. The architecture needed is as simple as compressing, indexing, and publishing data to tools such as Kibana and/or Looker via well-known APIs in order to store once and move and process less.

    One of the most important ways to turn up the heat on data analytics is by facilitating search. Specifically, search is the ultimate democratizer of data, allow­ing for self-service data stream selection and publishing without IT admins or database engineers. All data should be fully searchable and available for analysis using existing data tools. Imagine giving users the ability to search and query at will, easily asking questions and analyzing data with ease. Most of the better-known data warehouse and data lakehouse platforms don’t provide this critical functionality.

    But some forward-leaning enterprises have found a way. Take, for example, BAI Communications, whose data lake strat­egy embraces this type of architecture. In major commuter cities, BAI provides state-of-the-art communications infra­structure (cellular, Wi-Fi, broadcast, radio, and IP networks). BAI streams its data to a centralized data lake built on Amazon S3 cloud object storage, where it is secure and compliant with numerous government regulations. Using its data lake built on cloud object storage which has been activated for analytics through a multi-API data lake platform, BAI can find, access, and analyze its data faster, more easily, and in a more cost-controlled manner than ever before. The company is using insights generated from its global networks over multiple years to help rail operators maintain the flow of traffic and optimize routes, turning data insights into business value. This approach proved especially valuable when the pandemic hit, since BAI was able to deeply under­stand how COVID-19 impacted public transit networks regionally, all around the world, so they could continue providing critical connectivity to citizens.

    Another example is Blackboard, the leader in education technology serving K–12 education, business, and government clients. Blackboard’s product development team typically used log analytics to monitor cloud deployments of the company’s SaaS-based learning, management system (LMS) in order to troubleshoot application issues, etc. But when COVID-19 hit, millions of students switched to online learning and those log volumes skyrocketed—product usage grew by 3,000% in 2020 when the world went virtual. Its custom-managed ELK (Elasticsearch, Logstash, Kibana) stacks and managed Elasticsearch service for centralized log management couldn’t support the new log volumes—at a time when that log data was most valuable. The Blackboard team needed to be able to ana­lyze short-term data for troubleshooting but also long-term data for deeper analysis and compliance purposes. The Blackboard team moved its log data to a data lake plat­form running directly on Amazon S3 and serving analytics to end users via Kibana, which is included natively under the hood. The company now has day-to-day visibility of cloud computing environments at scale, app troubleshooting and alerting over long periods of time, root cause analysis without data retention limits, and fast resolution of application performance issues.

    Now we’re cooking

    Cloud storage has the potential to truly democratize data analytics for businesses. There’s no better or more cost-effective place to store a company’s treasure trove of information. The trick is unlocking cloud object storage for analytics without data movement or pipelining. Many data lake, warehouse, and even lakehouse providers have the right idea, but their underlying architectures are based on 1970s computer science, making the process brittle, com­plex, and slow.

    If you are developing or implementing a data lake and want to avoid building a swamp—ask yourself these questions:

    • What business use cases or analytics questions should we be able to address with the data lake?
    • How will data get into the data lake?
    • How will users across the organization get access to the data in the lake?
    • What analytics tools need to be con­nected to the data lake to facilitate the democratization of insights?

    It is important to find a solution that allows you to turn up the heat in the data lake with a platform that is cost-effective, elastically scalable, fast, and easily accessible. A winning solution allows business analysts to query all the data in the data lake using the BI tools they know and love, without any data movement, transformation, or gover­nance risk.  

    Author: Thomas Hazel

    Source: Database Trends & Applications

  • IBM: Hadoop solutions and the data lakes of the future

    IBM: Hadoop solutions and the data lakes of the future

    The foundation of the AI Ladder is Information Architecture. The modern data-driven enterprise needs to leverage the right tools to collect, organize, and analyze their data before they can infuse their business with the results.

    Businesses have many types of data and many ways to apply it. We must look for approaches to manage all forms of data, regardless of techniques (e.g., relational, map reduce) or use case (e.g., analytics, business intelligence, business process automation). Data must be stored securely and reliably, while minimizing costs and maximizing utility.

    Object storage is the ideal place to collect, store, and manage data assets which will be used to generate business value.

    Object storage started as an archive

    Object storage was first conceived as a simplification: How can we remove the extraneous functions seen in file systems to make storage more scalable, reliable, and low-cost. Technologies like erasure encoding massively reduced costs by allowing reliable storage to be built on cheap commodity hardware. The interface was simple: a uniform limitless namespace, atomic data writes, all available over HTTP.

    And, object storage excels at data storage. For example, IBM Cloud Object Storage is designed for over 10 9’s of durability, has robust data protection and security features, flexible pricing, and is natively integrated with IBM Aspera on Cloud high-speed data transfer.

    The first use cases were obvious: Object storage was a more scalable file system that could be used to store unstructured data like music, images, and video. Or to store backup files, database dumps, and log files. Its single-namespace, data tiering options allow it to be used for data archiving, and its HTTP interface makes it convenient to serve static website content as part of cloud native applications.

    But, beyond that, it was just a data dump.

    Map reduce and the rise of Hadoop solutions

    At the same time object storage was replacing file system use cases, the map reduce programming model was emerging in data analytics. Apache Hadoop provided a software framework for processing big data workloads that traditional relational database management system (RDBMS) solutions could not effectively manage. Data scientists and analysts had to give up declarative data management and SQL queries, but gained the ability to work with exponentially larger data sets and the freedom to explore unstructured and semi-structured data.

    In the beginning, Hadoop was seen by some as a backwards step. It achieved a measure of scale and cost savings but gave up much of what made RDBMS systems so powerful and easy to work with. While not requiring schemas added flexibility, query latencies and overall performance decreased. But the Hadoop ecosystem has continued to expand and meet user needs. Spark massively increased performance, and Hive provided SQL query.

    Hadoop is not suitable for everything. Transactional processing is still a better fit for RDBMS. Businesses must use the appropriate technology for their various OLAP and OTLP needs.

    HDFS became the de facto data lake for many enterprises

    Like object storage, Hadoop was also designed as a scale-out architecture on cheap commodity hardware. The Hadoop File System (HDFS) began with the premise that compute should be moved to the data. It was designed to place data on locally attached storage on the compute nodes themselves. Data was stored in a form that could be directly read by locally running tasks, without a network hop.

    While this is beneficial for many types of workloads, it wasn’t the end of ETL envisioned by some. By placing readable copies of data directly on compute nodes, HDFS can’t take advantage of erasure coding to save costs. When data reliability is required, replication is just not as cost-effective. Furthermore, it can’t be independently scaled from compute. As workloads diversified, this inflexibility caused management and cost issues. For many jobs, network wasn’t actually the bottleneck. Compute bottlenecks are typically CPU or memory issues, and storage bottlenecks are typically hard drive spindle related, either disk throughput or seek constrained.

    When you separate compute from storage, both sides benefit. Compute nodes are cheaper because they don’t need large amounts of storage, can be scaled up or down quickly without massive data migration costs, and jobs can even be run in isolation when necessary.

    For storage, you want to spread out your load onto as many spindles as possible, so using a smaller active data set on a large data pool is beneficial, and your dedicated storage budget can be reserved for smaller flash clusters when latency is really an issue.

    The scale and cost savings of Hadoop attracted many businesses to use it wherever possible, and many businesses ended up using HDFS as their primary data store. This has led to cost, manageability, and flexibility issues.

    The data lake of the future supports all compute workloads

    Object storage was always useful as a way of backing up your databases, and it could be used to offload data from HDFS to a lower-cost tier as well. In fact, this is the first thing enterprises typically do.

    But, a data lake is more than just a data dump.

    A data lake is a place to collect an organization’s data for future use. Yes, it needs to store and protect data in a highly scalable, secure, and cost-effective manner, and object storage has always provided this. But when data is stored in the data lake, it is often not known how it will be used and how it will be turned into value. Thus, it is essential that the data lake include good integration with a range of data processing, analytics, and AI tools.

    Typical tools will not only include big data tools such as Hadoop, Spark, and Hive, but also deep learning frameworks (such as TensorFlow) and analytics tools (such as Pandas). In addition, it is essential for a data lake to support tools for cataloging, messaging, and transforming the data to support exploration and repurposing of data assets.

    Object storage can store data in formats native to big data and analytics tools. Your Hadoop and Spark jobs can directly access object storage through the S3a or Stocator connectors using the IBM Analytics Engine. IBM uses these techniques against IBM Cloud Object Storage for operational and analytic needs.

    Just like with Hadoop, you can also leverage object storage to directly perform SQL queries. The IBM Cloud SQL Query service uses Spark internally to perform ad-hoc and OLAP queries against data stored directly in IBM Cloud Object Storage buckets.

    TensorFlow can also be used to train and deploy ML models directly using data in object storage.

    This is the true future of object storage

    As organizations look to modernize their information architecture, they save money when they build their data lake with object storage.

    For existing HDFS shops, this can be done incrementally, but you need to make sure to keep your data carefully organized as you do so to take full advantage of the rich capabilities available now and in the future.

    Author: Wesly Leggette & Michael Factor

    Source: IBM

EasyTagCloud v2.8