ETL or ELT? Five arguments in favor of ETL
The ETL and data integration market are highly mature, but it still sees new trends emerge, especially when underlying platforms, technologies, and tools see innovation. We outlined one new trend – ETL++ – in this article on Medium.
Another trend sweeping the data integration community is ELT – extract, load, and transform. It is a new pattern of data integration whereby data is first extracted and loaded into a cloud data warehouse (CDW), then it is transformed inside the CDW. This trend has emerged specifically in the cloud, taking advantage of powerful new CDW platforms.
At the same time as the ELT model emerged, ETL evolved into ETL++, and in the process, becoming “code-free” and “schema-free” and having re-use and extensibility. This eliminated mundane and error-prone coding tasks, making ETL data pipelines far more reliable, and made data pipeline creation, testing, and deployment far easier and more efficient.
There are advantages and disadvantages of each approach. Here we will explore five arguments in favor of using an ETL approach.
1. Less Data Replication
With ELT, you are replicating all the raw data you need for analysis into your CDW, then transforming it into an analytics-ready form. For many organizations, this will triple or even quadruple the amount of data that is being replicated for analytics.
With ETL, you transform the data in-flight, requiring no extra data movement and storage. All you load into your CDW is analytics-ready data. As you will see in upcoming sections, having less data replication can eliminate potential security risks, reduce your governance burden, and lower the overall cost of your analytics. In general, the less you replicate data, the more efficient your data and analytics stack is.
2. Easier, More Effective Data Security and Governance
The more data you replicate for analytics, the greater the burden and risk of securing and governing this data. All the raw data that is moved into the CDW will need to be secured and governed, adding extra work to already taxed data security and governance teams.
With ETL, you are only moving analytics-ready data into the CDW, reducing the potential for security holes and lowering governance risk and effort. Using an ETL platform with highly evolved security and governance features ensures consistency and high levels of data security and governance across the entire data lifecycle – extract, transformation, and loading.
3. Broader Data Transformation Capabilities
SQL has evolved over the years to support much of the standard data transformation capabilities. What SQL doesn’t have can be supplemented with user-defined functions (UDFs). But SQL is only designed to work on highly structured data (tables) and can require complex coding for more sophisticated transformation.
The code-free approach of modern ETL tools makes it fast and easy to graphically apply a wide range of functions that can also work on more modern, complex data formats. Spectrum offers over 300 functions, some extremely sophisticated, and can be applied in a single click, such as extracting meaning or sentiment from text fields, completely re-organizing datasets, slicing and dicing data into intricate buckets, or encoding data for AI and machine learning models.
4. More Reliable Data Pipelines
With modern, code-free ETL tools, all data pipelines can be defined and deployed without writing one line of code. Not only does this make the definition and deployment process faster and more efficient, but it also eliminates potential coding errors, making your data pipelines more reliable.
The schema-free approach, graphical tools, and full data lineage of modern ETL tools such as Spectrum also reduce potential requirements mismatches during projects. Spectrum also provides end-to-end data pipeline monitoring and auditability, including the T process, for data observability.
5. Transparent Costs
As we’ve discussed, in an ELT model, raw data is stored in a normalized schema in the CDW, then transformed into an analytics-ready form, such as a denormalized materialized table or a multi-dimensional aggregated table. This will increase:
- CDW storage costs as both raw and analytics-ready data are stored
- CDW compute costs as transformation queries are pushed down into the CDW
Transforming from a denormalized schema to a normalized analytics-ready one requires many-way joins and unions, and aggregations, which are extremely “expensive” compute operations in a CDW. These extra hidden costs will increase your monthly CDW bill, and there are additional costs of teams trying to write, debug, and deploy SQL transformations.
There are advantages and disadvantages of both ETL and ELT models for data integration. For instance, some organizations are very efficient at using SQL for transforming data and know how to effectively manage their CDW usage.
So, why not hedge your bets and use a data integration platform that supports BOTH an ETL and ELT model. This way, you can choose the best approach per data pipeline based on its’ unique needs.
Author: John Morrell