2 items tagged "ELT"

  • Approaching the Transformation phase in ELT processes

    Approaching the Transformation phase in ELT processes

    Much has been written about the shift from ETL to ELT and how ELT enables superior speed and agility for modern analytics. One important move to support this speed and agility is creating a workflow that enables data transformation to be exploratory and iterative. Defining an analysis requires an iterative loop of forming and testing these hypotheses via data transformation. Reducing the latency of that interactive loop is crucial to reducing the overall time it takes to build data pipelines.

    ELT achieves flexibility by enabling access to raw data instead of predefined subsets or aggregates. The process achieves speed by leveraging the processing power of Cloud Data Warehouses and Data Lakes. A simple, dominant pattern is emerging: move all of your data to cost effective cloud storage and leverage cloud compute for transformation of data prior to analysis.

    What this takes:

    1. Extract your data from source systems
    2. Load your data into the cloud platform
    3. Transform your data in the cloud!

    There are a few different approaches to doing the transformation work as part of the ELT process.

    Code only solutions

    Individuals and teams proficient in languages such as SQL or Python can write transformations that run directly against the cloud data warehouse or data lake. Tools such as DBT and Dataform provide infrastructure around code to help teams build more maintainable pipelines. Code gives its authors ultimate flexibility to build anything the underlying system supports. Additionally, there are large communities of Python and SQL developers as well as a wealth of examples, best practices, and forums to learn from. Of course, there are many individuals in organizations that do not know how to write code or simply prefer not to but still need to efficiently transform data.

    Visual only solutions

    Individuals and teams that prefer visual transformation can leverage visual tools to build their pipelines. To gain the benefits of ELT, visual tools increasingly execute these pipelines directly  in Cloud Data Warehouses and Lakes instead of in proprietary run times. Visual solutions appeal to a large community of data pipeline developers who need to produce data assets but don’t necessarily want to or know how to code. These solutions often provide more automated approaches to build pipelines, increasing efficiency for many use cases. However, visual only approaches can at times be not as flexible as coding in the underlying system: certain use cases are not performant enough or simply not possible in the tool.

    Visual + code solutions

    We believe increasingly that modern data transformation tools will support both approaches to enable a broader community to collaborate and scale how this work is done in organizations. Use a visual approach where it makes sense but also enable users to leverage code where it makes sense. This best of both worlds approach has two major benefits:

    1. Increased efficiency for individuals: While some have strong preferences for doing their work in code or in a visual tool, we find that many people just want to get the job done as efficiently and effectively as possible. Each approach has advantages and disadvantages – providing flexibility allows an individual to choose the right tool for the job.
    2. Collaboration across the organization: Organizations have some users who prefer to code and some prefer not to. Solutions providing support for both have the potential to enable collaboration across users with varied skill sets and use cases.

    Approach at Trifacta
    Moving forward, Trifacta is increasingly investing in the two areas to enable collaboration across teams and individuals working in both code + user interfaces:

    1. Running code within Trifacta: Most of Trifacta's customers primarily leverage it's visual interface to build transformations, but many of them also use the existing functionality for running SQL queries from directly within Trifacta, building pipelines with SQL and/or our internal DSL. Soon, Trifacta plans to support other languages such as Python.
    2. Generating code with Trifacta: The pipelines customers build in Trifacta are built on top of our internal transformation language. This language is then translated into code that can run across a variety of different platforms. Today Spark, Google Dataflow, Photon (Trifacta's engine for in-browser computation) are supported and common transformations pushed down into databases and cloud data warehouses like Snowflake, Big Query and Redshift. To date, this code is ran on behalf of the customers, but Trifacta received many requests to take the code that is generated and use that completely outside of Trifacta.  

    Author: Sean Kandel

    Source: Trifacta

  • Data warehousing: ETL, ELT, and the use of big data

    Data warehousing: ETL, ELT, and the use of big data

    If your company keeps up with the trends in data management, you likely have encountered the concepts and definitions of data warehouse and big data. When your data professionals try to implement data extraction for your business, they need a data repository. For this purpose, they can use a data warehouse and a data lake.

    Roughly speaking, a data lake is mainly used to gather and preserve unstructured data, while a data warehouse is intended for structured and semi-structured data.

    Data warehouse modeling concepts

    All data in a data warehouse is well-organized, archived, and arranged in a particular way. Not all data that can be gathered from multiple sources reach a data warehouse. The source of data is crucial since it impacts the quality of data-driven insights and hence, business decisions.

    During the phase of data warehouse development, a lot of time and effort is needed to analyze data sources and select useful ones. It depends on the business processes, whether a data source has value or not. Data only gets into the warehouse when its value is confirmed.

    On top of that, the way data is represented in your database has a critical role. Concepts of data modeling in a data warehouse are a powerful expression of business requirements specific to a company. A data model determines how data scientists and software engineers will design, create, and implement a database.

    There are three basic types of modeling. Conceptual data model describes all entities a business needs information about. It provides facts about real-world things, customers, and other business-related objects and relations.
    The goal of creating this data model is to synthesize and store all the data needed to gain an understanding of the whole business. This model is designed for the business audience.

    Logical data model suits more in-depth data. It describes the structure of data elements, their attributes, and ways these elements interrelate. For instance, this model can be used to identify relationships between customers and products of interest for them. This model is characterized by a high level of clarity and accuracy.

    Physical data model describes specific data and relationships needed for a particular case as well as the way data model is used in database implementation. It provides a wealth of meta-data and facilitates visualizing the structure of a database. Meta-data can involve accesses, limitations, indexes, and other features.

    ELT and ETL data warehouse concepts

    Large amounts of data sorted for warehousing and analytics require a special approach. Businesses need to gather and process data to retrieve meaningful insights. Thus, data should be manageable, clean, and suitable for molding and transformation.

    ETL (extract, transform, load)and ELT (extract, load, transform) are the two approaches that have technological differences but serve the same purpose – to manage and analyze data.

    ETL is the paradigm that enables data extraction from multiple sources and pulling data into a single database to serve a business.

    At the first stage of the ETL process, engineers extract data from different databases and gather it in a single place. The collected data undergo transformation to take the form required for a target repository. Then the data come to a data warehouse or a target database.

    If to switch the letters 'T' and 'L', you get the ELT process. After the retrieval, the data can be loaded straight to the target database. The cloud technology enables large and scalable storage places, and massive datasets can be first loaded and then transformed as per the business requirements and needs.

    The ELT paradigm is a newer alternative to a well-established ETL process. It is flexible and allows fast processing speed to work with raw data. On the one hand, ELT requires special tools and frameworks, but on the other, it enables unlimited access to business data, thus saving BI and data analytics experts so much time.

    ETL testing concepts are also essential to ensure that data is loading in a data warehouse in a correct and accurate manner. This testing involves data verification at transitional phases. And before data reaches the destination, its quality and usefulness are already verified.

    Types of data warehouse for your company

    Different data warehouse concepts presuppose the use of particular techniques and tools to work with data. Basic data warehouse concepts also differ depending on a company’s size and purposes of using data.

    Enterprise data warehouse enables a unique approach to organizing, visualizing, and representing all the data across a company. Data can be classified by a subject and can be accessed based on this attribute.

    Data mart is a subcategory of a data warehouse designed for specific tasks in business areas such as retail, finance, and so forth. Data comes into a data mart straight from the sources.

    Operational data store satisfies the reporting needs within a company. It is updating in real time, which makes this solution best-suited for keeping in all business records.

    Big data and data warehouse ambiguity

    A data warehouse is an architecture that has proved to be valuable for data storing over the years. It involves data that has a defined value and can be used from the start to solve some business needs. Everyone can access this data, and the features of datasets are reliability and accuracy.

    Big data is a hyped field these days. It is the technology that allows retrieving data from heterogeneous sources. The key features of big data are volume, velocity or data streams, and a variety of data formats. Unlike a data warehouse, big data is a repository that can hold unstructured data as well.

    Companies seek to adopt custom big data solutions to unlock useful information that can help improve decision-making. These solutions help drive revenue, increase profitability, and cut customer churn thanks to the comprehensive information collected and available in one place.

    Data warehouse implementation entails advantages in terms of making informed decisions. It provides comprehensive insights into what is going on within a company, while big data can be in the shape of massive but disorganized datasets. However, big data can be later used for data warehousing.

    Running a data-driven business means dealing with billions of data on in-house, external operations, consumers, and regulations.

    Author: Katrine Spirina

    Source: In Data Labs

EasyTagCloud v2.8