2 items tagged "data warehouse"

  • Data Warehousing Lessons for A Data Lake World



    Over the past 2 decades, we have spent considerable time and effort trying to perfect the world of data warehousing. We took the technology that we were given and the data that would fit into that technology, and tried to provide our business constituents with the reports and dashboards necessary to run the businesses.

    It was a lot of hard work and we had to do many “unnatural” acts to get these OLTP (Online Transaction Processing)-centric technologies to work; aggregated tables, plethora of indices, user defined functions (UDF) in PL/SQL, and materialized views just to name a few. Kudos to us!!

    Now as we get ready for the full onslaught of the data lake, what lessons can we take away from our data warehousing experiences? I don’t have all the insights, but I offer this blog in hopes that others will comment and contribute. In the end, we want to learn from our data warehousing mistakes, but we don’t want to throw out those valuable learnings.

    Special thanks to Joe DosSantos (@JoeDosSantos) for his help on this blog.

    Why Did Data Warehousing Fail?

    Below is the list of areas where data warehousing struggled or outright failed. Again, this list is not comprehensive, and I encourage your contributions.

    • Adding New Data Takes Too Long. It took too long to load new data into the data warehouse. The general rule to add new data to a data warehouse was 3 months and $1 million. Because of the need to pre-build a schema before loading data into the data warehouse, the addition of new data sources to the data warehouse was a major effort. We had to conduct weeks of interviews with every potential user to capture every question they might ever want to ask in order to build a schema that handled all of their query and reporting requirements. This greatly hindered our ability to quickly explore new data sources, so organizations resorted to other options, which leads to…
    • Data Silos. Because it took so long to add new data sources to the data warehouse, organizations found it more expedient to build their own data marts, spreadmarts[1] or Access databases. Very quickly there was a wide-spread proliferation of these purpose built data stores across the organization. The result: no single version of the truth and lots of executive meetings wasting time debating whose version of the data was most accurate, which leads to…
    • Lack of Business Confidence. Because there was this proliferation of data across the organization and the resulting executive debates around whose data was most accurate, business leaders’ confidence in the data (and the data warehouse) quickly faded. This became especially true when the data being used to run a business unit was redefined for corporate use in such a way that it was not useful to the business. Take, for instance, a sales manager looking to assign a quota to his rep that manages the GE account and wants a report of historical sales. For him, sales might be Gross and GE might include Synchrony, whereas the corporate division might look at sales as Net or Adjusted and GE as its legal entities. It’s not so much a question of right and wrong as much as it is the enterprise introducing definitions that undermines confidence, which leads to…
    • Underinvestment In Metadata. No business leader had the time to verify the accuracy of the data, and no IT person knew the business well enough to make those data accuracy decisions. Plus, spending the money to hire consultants to do our job for us was always a hard internal sell, which leads to the metadata management denial cycle:
      • IT: “You business users need to own the data.”
      • Business: “We don’t have time to do that.”
      • IT: “Okay, let’s hire consultants.”
      • Business: “Shouldn’t we know our data better than consultants?”
      • IT: “Okay, you business users need to own the data”
      • And so forth…
    • Inability to Easily Share Data. The data warehouse lacked the ability to quickly ingest and consequently easily share data across different business functions and use cases. The data warehouse failed to become that single repository for the storage of the organization’s data assets because of the complexity, difficulty and slowness to add new data to the data warehouse, which leads to…
    • Shadow IT Spend. Nothing confirms the failure of the data warehouse more than shadow IT spend. Business users did not have confidence in how the data warehouse could help them address urgent business needs. Consequently, many line of business leaders pursued their own one-off IT initiatives (call center operations, sales force automation, campaign marketing, logistics planning, financial planning, etc.), which also further contributed to the unmanageable proliferation of data across the organizational data silos.
    • Inability to Handle Unstructured Data. Data warehouses cannot handle unstructured data. Unfortunately the bulk of the world’s data is now found in semi-structured data (log files, sensors, beacons, routers, MAC addresses) and unstructured data (text files, social media postings, audio files, photos, video files). Organizations who wanted a holistic view of the business had to make do with only 10 to 20% of the available organizational data. Hard to provide a holistic view with a 80% to 90% hole in that view.
    • No Predictive Analytic Capabilities. Business Intelligence solutions provide the summarized data necessary to support the organization’s operational and management reporting needs (descriptive analytics). However, most data warehouses lacked the detailed data across a wide variety of structured and unstructured data sources to support the organization’s predictive and prescriptive analytic needs.
    • Too Damned Expensive. Data science is about creating behavioral analytics at the individual levels (e.g., customers, employees, jet engine, train engine, truck, wind turbine, etc.). To uncover these behavioral analytics at the individual level, data scientists need the complete history of detailed transactional, operational and engagement data. The data scientists don’t want 13 months of aggregated data; they want 17 years of detailed transactions, even if that data is now located on mag tape. Trying to gather all of the voluminous data on a data warehouse is a recipe for organizational bankruptcy.
    • Inadequate Processing Power. Let’s face it; data warehouses lacked the economical processing power necessary to analyze petabytes of customer and machine data to uncover behavioral patterns and propensities. The data lake is built on modern, big data scale-out environments using open source software built on commodity servers are game changers in allowing organizations to store and analyze data volumes magnitudes bigger than one could ever economically fit into a data warehouse.

    What Did Data Warehousing Get Right?

    Okay, I was pretty harsh on the data warehouse world in which I grew up. But again, it was amazing what we were able to do with technology designed to deal with single records (insert, update, delete). I have never constructed analytics that uses only a single record. Analytics requires a massive number of records in order to uncover individual behaviors, propensities, tendencies, patterns, etc.

    So what did we get right, and what should we preserve as we move into the modern data lake world?

    • Data Governance. Data governance, into which I also group things like data accuracy, data lineage and data traceability, is as important now as it was in the data warehouse world. Having a process that allows the data science team to quickly ingest and explore the data unencumbered by data governance is a good practice. However you will need data governance rules, policies and procedures once you have determined that there is value in that data to support key decisions. If the business users do not have confidence in the data, then all is lost.
    • Metadata Management. The importance of metadata only becomes clearer as we begin to integrate data and analytics into the organization’s key business processes. The more metadata that we have about the data, the easier it is to get value from that data. Investing in the associated metadata carries the same economic value as investing it the data itself, IMHO. We want to enrich the data as much as possible, and a solid metadata management strategy is key for making that happen.
    • Conformed Dimensions. Having a single master file – or conformed dimension – for key business entities (e.g., products, customers, employees, physicians, teachers, stores, jet engines, locomotives, delivery trucks, etc.) is critical. It is these conformed dimensions that allow the data science team to tie together the wide variety of data sources to create the detailed analytic and behavioral profiles. Maintaining these conformed dimensions is hard work, but without them, there is no way to turn all this valuable data (and metadata) into actionable insights.
    • Single Version of The Truth. While I have always hated the term “single version of the truth,” operationally it is important to have all the data about your key business entities in a single (physical or logical) location. Also, in the Big Data world, the notion of data that is fit for purpose becomes critical. There may not be one truth, but there should be clarity as to how numbers are produced to provide transparency and trust.
    • Analytics Self-service. The idea of creating a self-service environment around analytics is very powerful. How do I pull IT out of the middle of the analytics request and provisioning process? If I truly want to create an environment where the analysts can quickly spin up an analytics sandbox and populate with data, I can’t have heavy manual processes in the middle of that process.
    • Reports Starting Point. The many reports and dashboards that have been built upon your data warehouse are a great starting point for your data lake journey. Business users have requested those reports for a reason. Instead of focusing time and effort to create yet more reports, first try to understand what questions and decisions the business users hoped to address with those reports, and what additional predictive and prescriptive insights do they need from those reports.
    • Yeah, SQL is still the query language of choice and we need to embrace how we help SQL-trained analysts to use that tool on the data lake. Open-source tools like Hive, HBase, and HAWQ are all designed to enable that army of SQL-trained business users and analysts to have access to the wealth of data in the data lake.


    There is much that can be learned from our data warehousing experiences. The key is to understand what to keep and what to throw out. That means a single data lake (not data lakes). That means data governance. That means metadata management, and even more that we learned from our data warehousing experiences. We must learn from our experiences, otherwise…

    “Those who do not learn history are doomed to repeat it.”

    [1] Spreadmart (short for “spreadsheet data mart”) is a business intelligence term that refers to the propensity of some organizations or departments within organizations to use individual, desktop-based databases like spreadsheets as a primary means of data organization, storage, and dissemination.



  • How modern ETL processes lead to better decision making

    How modern ETL processes lead to better decision making

    We live in a world of data: there’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with data is your window into the ways data teams are tackling the challenges of this new world to help their companies and their customers thrive.

    In recent years we’ve seen data become vastly more available to businesses. This is mostly due to the rise of data warehouses, decrease in cost for data storage, and modern ETL tools that make storing and accessing data more approachable than ever before. This has allowed companies to become more and more data driven in all areas of their business. In fact, being data driven has become ubiquitous and imperative to survival in today’s climate. This article will discuss at a high level how modern businesses are leveraging new technology to ingest a wider variety of data sources. 

    As of 2019, according to Bissfully’s 2020 SaaS trends report, companies in all phases of business growth make use of many SaaS aaplications. smaller companies (0-50 employees) use an average of 102 SaaS applications, while mid-sized ones (101-250 employees) use about 137 SaaS applications. Enterprises, meanwhile, can have as many as 288 SaaS applications in use across the entire organization!

    These SaaS products range from CRMs to marketing automation apps to billing software to help desk tools, etc. These applications can log thousands to millions of rows of data made available via API, much of which can hold valuable insight for businesses. Add to this the company’s own internally-gathered product data and it’s clear that even a small or medium-sized business can do a lot with the data at its disposal. In this article, we’ll cover what the ETL process is, why it’s important to powering data-driven businesses, and how the rise of modern ETL tools has changed the ETL process forever.

    The ETL process: foundation of data-driven businesses

    Every company wants every team within their business to make smarter, data-driven decisions. Customer support teams look at trends in support tickets or do text analysis on conversations to understand where they can provide better onboarding and documentation. Marketing teams want better visibility into their ad performance across different platforms and the ROI on their spend. Product and engineering teams dig into productivity metrics or bug reports to help them better prioritize their resources. 

    The ETL process empowers all these different teams to get the information they need to understand and perform their jobs better. Businesses ingest data from a wide array of sources through the ETL process, which stands for Extract, Transform, Load. The prepared data is then available for analysis and use by the various teams who need it, as well as for advanced analytics, embedding into applications, and use for other data monetization efforts. Whatever you want to do with data, you have to ETL it first. 

    Explaining the traditional ETL process

    Traditionally, the extraction step of the process involves batch-pulling the data from the source via files, application database, or an API. In the transformation step, the data is cleaned and modeled so that it’s fit for analysis and ingestible by the data warehouse. Finally, the cleaned tables are loaded into the data warehouse for analysis. 

    This traditional ETL process is very difficult to scale. It often requires full-time data engineers to develop and maintain the scripts that keep the data flowing. This is because the data providers often make changes to their schemas or APIs, which then break the scripts that power the ETL process. Every time there’s a change, the data engineers scramble to update their scripts to accommodate them, resulting in downtime. With businesses now needing to ingest data from so many disparate (often fast-moving) data sources, maintaining ETL scripts for each one is not scalable. 

    Better living through ETL tools: the modern ETL process

    The modern ETL process follows a slightly different order of operations, dubbed ELT. This new process arose as a result of the introduction of tools to update the ETL process, as well as the rise of modern data warehouses with relatively low storage costs.

    Today, ETL tools do the heavy lifting for you. They have integrations for many of the major SaaS applications, and have teams of engineers who maintain those integrations, taking the pressure off of your in-house data team. These ETL tools are built to connect to most major data warehouses, allowing businesses to plug in their apps on one end and their warehouse on the other, while the ETL tools do the rest.

    Users can usually control orchestration via a simple drop-down selections within the apps, alleviating the need to stand up your own servers or EC2 box or building DAGs to run on platforms like Airflow. ETL tools can also typically offer more robust options for appending new data incrementally, or only updating new and modified rows, which can allow for more frequent loads, and closer to real-time data for the business. With this simplified process for making data available for analysis, data teams can focus on finding new applications for data to generate value for the business.

    The ETL process and data warehouses

    Data warehouses are the present and future of data and analytics. Storage costs on data warehouses have drastically reduced in recent years, which allows businesses to load as many raw data sources as possible without the same concerns they might have had before.

    Today, data teams can ingest raw data before transforming it, allowing them to do the transformations in the warehouse instead of a separate staging area. With the increased availability of data and a common language to access that data, SQL, it allows the business more flexibility in leveraging their data to make the right decisions

    The modern ETL process: delivering better results, faster

    Under the traditional ETL process, as data and processing requirements grew, the chance that on-premise data warehouses would fail grew as well. When this happened, IT would have to swoop in to fix the issue, which usually meant adding more hardware.

    The modern ETL process in today’s data warehouses sidesteps this issue by offloading the compute resource management to the cloud data warehouse. Many cloud data warehouses offer compute scaling that allows for dynamic scaling when needs spike. This allows data teams to still see scalable performance while holding increased numbers of computationally expensive data models and ingesting more large data sources. The decreased cost in compute power along with compute scaling in cloud data warehouses allows data teams to efficiently scale resources up or down to suit their needs and better ensure no downtime. The bottom line is that, instead of having your in-house data and/or IT team fretting over your data storage and computing issues, you can offload that pretty much completely to the data warehouse provider.

    Data teams can then build tests on top of their cloud data warehouse to monitor their data sources for quality, freshness, etc. giving them quicker, more proactive visibility into any problems with their data pipelines.

    From ETL to ELT and beyond

    Data, analytics, and BI have radically evolved since their inception. We are leaps and bounds beyond Excel tables and on-prem-centric data sources. Cloud-native data warehouses, cloud-native architecture in analytics and BI platforms, and embedded analytics powered by these systems have redefined what it means to be truly data-driven in our modern age.

    The ETL process has been updated and can now deliver insights from a wide array of datasets, which helps companies and teams of all kinds make smarter decisions, faster. It also opens the doors for advanced analytics, next-level data monetization, and much more. Whatever you’re building with your data, a modern ELT setup will help you get more from your data, easier.

    While we have seen a large migration to data warehouses, looking to the future, we might see even another evolution that involves data lakes being more widely used as the initial destination for all raw data sources, offering even further cost benefits. Additionally, new tools are starting to take form that allow for transformation within these data lakes, thus continuing the evolution of tools and processes within the data pipeline following the path of ETL to ELT.

    Author: Adam Luba

    Source: Sisense

EasyTagCloud v2.8