Approaching the Transformation phase in ELT processes
Much has been written about the shift from ETL to ELT and how ELT enables superior speed and agility for modern analytics. One important move to support this speed and agility is creating a workflow that enables data transformation to be exploratory and iterative. Defining an analysis requires an iterative loop of forming and testing these hypotheses via data transformation. Reducing the latency of that interactive loop is crucial to reducing the overall time it takes to build data pipelines.
ELT achieves flexibility by enabling access to raw data instead of predefined subsets or aggregates. The process achieves speed by leveraging the processing power of Cloud Data Warehouses and Data Lakes. A simple, dominant pattern is emerging: move all of your data to cost effective cloud storage and leverage cloud compute for transformation of data prior to analysis.
What this takes:
- Extract your data from source systems
- Load your data into the cloud platform
- Transform your data in the cloud!
There are a few different approaches to doing the transformation work as part of the ELT process.
Code only solutions
Individuals and teams proficient in languages such as SQL or Python can write transformations that run directly against the cloud data warehouse or data lake. Tools such as DBT and Dataform provide infrastructure around code to help teams build more maintainable pipelines. Code gives its authors ultimate flexibility to build anything the underlying system supports. Additionally, there are large communities of Python and SQL developers as well as a wealth of examples, best practices, and forums to learn from. Of course, there are many individuals in organizations that do not know how to write code or simply prefer not to but still need to efficiently transform data.
Visual only solutions
Individuals and teams that prefer visual transformation can leverage visual tools to build their pipelines. To gain the benefits of ELT, visual tools increasingly execute these pipelines directly in Cloud Data Warehouses and Lakes instead of in proprietary run times. Visual solutions appeal to a large community of data pipeline developers who need to produce data assets but don’t necessarily want to or know how to code. These solutions often provide more automated approaches to build pipelines, increasing efficiency for many use cases. However, visual only approaches can at times be not as flexible as coding in the underlying system: certain use cases are not performant enough or simply not possible in the tool.
Visual + code solutions
We believe increasingly that modern data transformation tools will support both approaches to enable a broader community to collaborate and scale how this work is done in organizations. Use a visual approach where it makes sense but also enable users to leverage code where it makes sense. This best of both worlds approach has two major benefits:
- Increased efficiency for individuals: While some have strong preferences for doing their work in code or in a visual tool, we find that many people just want to get the job done as efficiently and effectively as possible. Each approach has advantages and disadvantages – providing flexibility allows an individual to choose the right tool for the job.
- Collaboration across the organization: Organizations have some users who prefer to code and some prefer not to. Solutions providing support for both have the potential to enable collaboration across users with varied skill sets and use cases.
Approach at Trifacta
Moving forward, Trifacta is increasingly investing in the two areas to enable collaboration across teams and individuals working in both code + user interfaces:
- Running code within Trifacta: Most of Trifacta's customers primarily leverage it's visual interface to build transformations, but many of them also use the existing functionality for running SQL queries from directly within Trifacta, building pipelines with SQL and/or our internal DSL. Soon, Trifacta plans to support other languages such as Python.
- Generating code with Trifacta: The pipelines customers build in Trifacta are built on top of our internal transformation language. This language is then translated into code that can run across a variety of different platforms. Today Spark, Google Dataflow, Photon (Trifacta's engine for in-browser computation) are supported and common transformations pushed down into databases and cloud data warehouses like Snowflake, Big Query and Redshift. To date, this code is ran on behalf of the customers, but Trifacta received many requests to take the code that is generated and use that completely outside of Trifacta.
Author: Sean Kandel