5 items tagged "ETL"

  • Data warehousing: ETL, ELT, and the use of big data

    Data warehousing: ETL, ELT, and the use of big data

    If your company keeps up with the trends in data management, you likely have encountered the concepts and definitions of data warehouse and big data. When your data professionals try to implement data extraction for your business, they need a data repository. For this purpose, they can use a data warehouse and a data lake.

    Roughly speaking, a data lake is mainly used to gather and preserve unstructured data, while a data warehouse is intended for structured and semi-structured data.

    Data warehouse modeling concepts

    All data in a data warehouse is well-organized, archived, and arranged in a particular way. Not all data that can be gathered from multiple sources reach a data warehouse. The source of data is crucial since it impacts the quality of data-driven insights and hence, business decisions.

    During the phase of data warehouse development, a lot of time and effort is needed to analyze data sources and select useful ones. It depends on the business processes, whether a data source has value or not. Data only gets into the warehouse when its value is confirmed.

    On top of that, the way data is represented in your database has a critical role. Concepts of data modeling in a data warehouse are a powerful expression of business requirements specific to a company. A data model determines how data scientists and software engineers will design, create, and implement a database.

    There are three basic types of modeling. Conceptual data model describes all entities a business needs information about. It provides facts about real-world things, customers, and other business-related objects and relations.
    The goal of creating this data model is to synthesize and store all the data needed to gain an understanding of the whole business. This model is designed for the business audience.

    Logical data model suits more in-depth data. It describes the structure of data elements, their attributes, and ways these elements interrelate. For instance, this model can be used to identify relationships between customers and products of interest for them. This model is characterized by a high level of clarity and accuracy.

    Physical data model describes specific data and relationships needed for a particular case as well as the way data model is used in database implementation. It provides a wealth of meta-data and facilitates visualizing the structure of a database. Meta-data can involve accesses, limitations, indexes, and other features.

    ELT and ETL data warehouse concepts

    Large amounts of data sorted for warehousing and analytics require a special approach. Businesses need to gather and process data to retrieve meaningful insights. Thus, data should be manageable, clean, and suitable for molding and transformation.

    ETL (extract, transform, load)and ELT (extract, load, transform) are the two approaches that have technological differences but serve the same purpose – to manage and analyze data.

    ETL is the paradigm that enables data extraction from multiple sources and pulling data into a single database to serve a business.

    At the first stage of the ETL process, engineers extract data from different databases and gather it in a single place. The collected data undergo transformation to take the form required for a target repository. Then the data come to a data warehouse or a target database.

    If to switch the letters 'T' and 'L', you get the ELT process. After the retrieval, the data can be loaded straight to the target database. The cloud technology enables large and scalable storage places, and massive datasets can be first loaded and then transformed as per the business requirements and needs.

    The ELT paradigm is a newer alternative to a well-established ETL process. It is flexible and allows fast processing speed to work with raw data. On the one hand, ELT requires special tools and frameworks, but on the other, it enables unlimited access to business data, thus saving BI and data analytics experts so much time.

    ETL testing concepts are also essential to ensure that data is loading in a data warehouse in a correct and accurate manner. This testing involves data verification at transitional phases. And before data reaches the destination, its quality and usefulness are already verified.

    Types of data warehouse for your company

    Different data warehouse concepts presuppose the use of particular techniques and tools to work with data. Basic data warehouse concepts also differ depending on a company’s size and purposes of using data.

    Enterprise data warehouse enables a unique approach to organizing, visualizing, and representing all the data across a company. Data can be classified by a subject and can be accessed based on this attribute.

    Data mart is a subcategory of a data warehouse designed for specific tasks in business areas such as retail, finance, and so forth. Data comes into a data mart straight from the sources.

    Operational data store satisfies the reporting needs within a company. It is updating in real time, which makes this solution best-suited for keeping in all business records.

    Big data and data warehouse ambiguity

    A data warehouse is an architecture that has proved to be valuable for data storing over the years. It involves data that has a defined value and can be used from the start to solve some business needs. Everyone can access this data, and the features of datasets are reliability and accuracy.

    Big data is a hyped field these days. It is the technology that allows retrieving data from heterogeneous sources. The key features of big data are volume, velocity or data streams, and a variety of data formats. Unlike a data warehouse, big data is a repository that can hold unstructured data as well.

    Companies seek to adopt custom big data solutions to unlock useful information that can help improve decision-making. These solutions help drive revenue, increase profitability, and cut customer churn thanks to the comprehensive information collected and available in one place.

    Data warehouse implementation entails advantages in terms of making informed decisions. It provides comprehensive insights into what is going on within a company, while big data can be in the shape of massive but disorganized datasets. However, big data can be later used for data warehousing.

    Running a data-driven business means dealing with billions of data on in-house, external operations, consumers, and regulations.

    Author: Katrine Spirina

    Source: In Data Labs

  • ETL or ELT? Five arguments in favor of ETL

    ETL or ELT? Five arguments in favor of ETL

    The ETL and data integration market are highly mature, but it still sees new trends emerge, especially when underlying platforms, technologies, and tools see innovation.  We outlined one new trend – ETL++ – in this article on Medium.

    Another trend sweeping the data integration community is ELT – extract, load, and transform.  It is a new pattern of data integration whereby data is first extracted and loaded into a cloud data warehouse (CDW), then it is transformed inside the CDW.  This trend has emerged specifically in the cloud, taking advantage of powerful new CDW platforms.

    At the same time as the ELT model emerged, ETL evolved into ETL++, and in the process, becoming “code-free” and “schema-free” and having re-use and extensibility.  This eliminated mundane and error-prone coding tasks, making ETL data pipelines far more reliable, and made data pipeline creation, testing, and deployment far easier and more efficient.

    There are advantages and disadvantages of each approach. Here we will explore five arguments in favor of using an ETL approach.

    1. Less Data Replication

    With ELT, you are replicating all the raw data you need for analysis into your CDW, then transforming it into an analytics-ready form. For many organizations, this will triple or even quadruple the amount of data that is being replicated for analytics.

    With ETL, you transform the data in-flight, requiring no extra data movement and storage.  All you load into your CDW is analytics-ready data. As you will see in upcoming sections, having less data replication can eliminate potential security risks, reduce your governance burden, and lower the overall cost of your analytics. In general, the less you replicate data, the more efficient your data and analytics stack is.

    2. Easier, More Effective Data Security and Governance

    The more data you replicate for analytics, the greater the burden and risk of securing and governing this data. All the raw data that is moved into the CDW will need to be secured and governed, adding extra work to already taxed data security and governance teams.

    With ETL, you are only moving analytics-ready data into the CDW, reducing the potential for security holes and lowering governance risk and effort. Using an ETL platform with highly evolved security and governance features ensures consistency and high levels of data security and governance across the entire data lifecycle – extract, transformation, and loading.

    3. Broader Data Transformation Capabilities

    SQL has evolved over the years to support much of the standard data transformation capabilities. What SQL doesn’t have can be supplemented with user-defined functions (UDFs).  But SQL is only designed to work on highly structured data (tables) and can require complex coding for more sophisticated transformation.

    The code-free approach of modern ETL tools makes it fast and easy to graphically apply a wide range of functions that can also work on more modern, complex data formats. Spectrum offers over 300 functions, some extremely sophisticated, and can be applied in a single click, such as extracting meaning or sentiment from text fields, completely re-organizing datasets, slicing and dicing data into intricate buckets, or encoding data for AI and machine learning models.

    4. More Reliable Data Pipelines

    With modern, code-free ETL tools, all data pipelines can be defined and deployed without writing one line of code. Not only does this make the definition and deployment process faster and more efficient, but it also eliminates potential coding errors, making your data pipelines more reliable.

    The schema-free approach, graphical tools, and full data lineage of modern ETL tools such as Spectrum also reduce potential requirements mismatches during projects. Spectrum also provides end-to-end data pipeline monitoring and auditability, including the T process, for data observability. 

    5. Transparent Costs

    As we’ve discussed, in an ELT model, raw data is stored in a normalized schema in the CDW, then transformed into an analytics-ready form, such as a denormalized materialized table or a multi-dimensional aggregated table. This will increase:

    • CDW storage costs as both raw and analytics-ready data are stored
    • CDW compute costs as transformation queries are pushed down into the CDW

    Transforming from a denormalized schema to a normalized analytics-ready one requires many-way joins and unions, and aggregations, which are extremely “expensive” compute operations in a CDW. These extra hidden costs will increase your monthly CDW bill, and there are additional costs of teams trying to write, debug, and deploy SQL transformations.

    Wrap-Up

    There are advantages and disadvantages of both ETL and ELT models for data integration. For instance, some organizations are very efficient at using SQL for transforming data and know how to effectively manage their CDW usage.

    So, why not hedge your bets and use a data integration platform that supports BOTH an ETL and ELT model. This way, you can choose the best approach per data pipeline based on its’ unique needs.

    Author: John Morrell

    Source: Datameer

  • How modern ETL processes lead to better decision making

    How modern ETL processes lead to better decision making

    We live in a world of data: there’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with data is your window into the ways data teams are tackling the challenges of this new world to help their companies and their customers thrive.

    In recent years we’ve seen data become vastly more available to businesses. This is mostly due to the rise of data warehouses, decrease in cost for data storage, and modern ETL tools that make storing and accessing data more approachable than ever before. This has allowed companies to become more and more data driven in all areas of their business. In fact, being data driven has become ubiquitous and imperative to survival in today’s climate. This article will discuss at a high level how modern businesses are leveraging new technology to ingest a wider variety of data sources. 

    As of 2019, according to Bissfully’s 2020 SaaS trends report, companies in all phases of business growth make use of many SaaS aaplications. smaller companies (0-50 employees) use an average of 102 SaaS applications, while mid-sized ones (101-250 employees) use about 137 SaaS applications. Enterprises, meanwhile, can have as many as 288 SaaS applications in use across the entire organization!

    These SaaS products range from CRMs to marketing automation apps to billing software to help desk tools, etc. These applications can log thousands to millions of rows of data made available via API, much of which can hold valuable insight for businesses. Add to this the company’s own internally-gathered product data and it’s clear that even a small or medium-sized business can do a lot with the data at its disposal. In this article, we’ll cover what the ETL process is, why it’s important to powering data-driven businesses, and how the rise of modern ETL tools has changed the ETL process forever.

    The ETL process: foundation of data-driven businesses

    Every company wants every team within their business to make smarter, data-driven decisions. Customer support teams look at trends in support tickets or do text analysis on conversations to understand where they can provide better onboarding and documentation. Marketing teams want better visibility into their ad performance across different platforms and the ROI on their spend. Product and engineering teams dig into productivity metrics or bug reports to help them better prioritize their resources. 

    The ETL process empowers all these different teams to get the information they need to understand and perform their jobs better. Businesses ingest data from a wide array of sources through the ETL process, which stands for Extract, Transform, Load. The prepared data is then available for analysis and use by the various teams who need it, as well as for advanced analytics, embedding into applications, and use for other data monetization efforts. Whatever you want to do with data, you have to ETL it first. 

    Explaining the traditional ETL process

    Traditionally, the extraction step of the process involves batch-pulling the data from the source via files, application database, or an API. In the transformation step, the data is cleaned and modeled so that it’s fit for analysis and ingestible by the data warehouse. Finally, the cleaned tables are loaded into the data warehouse for analysis. 

    This traditional ETL process is very difficult to scale. It often requires full-time data engineers to develop and maintain the scripts that keep the data flowing. This is because the data providers often make changes to their schemas or APIs, which then break the scripts that power the ETL process. Every time there’s a change, the data engineers scramble to update their scripts to accommodate them, resulting in downtime. With businesses now needing to ingest data from so many disparate (often fast-moving) data sources, maintaining ETL scripts for each one is not scalable. 

    Better living through ETL tools: the modern ETL process

    The modern ETL process follows a slightly different order of operations, dubbed ELT. This new process arose as a result of the introduction of tools to update the ETL process, as well as the rise of modern data warehouses with relatively low storage costs.

    Today, ETL tools do the heavy lifting for you. They have integrations for many of the major SaaS applications, and have teams of engineers who maintain those integrations, taking the pressure off of your in-house data team. These ETL tools are built to connect to most major data warehouses, allowing businesses to plug in their apps on one end and their warehouse on the other, while the ETL tools do the rest.

    Users can usually control orchestration via a simple drop-down selections within the apps, alleviating the need to stand up your own servers or EC2 box or building DAGs to run on platforms like Airflow. ETL tools can also typically offer more robust options for appending new data incrementally, or only updating new and modified rows, which can allow for more frequent loads, and closer to real-time data for the business. With this simplified process for making data available for analysis, data teams can focus on finding new applications for data to generate value for the business.

    The ETL process and data warehouses

    Data warehouses are the present and future of data and analytics. Storage costs on data warehouses have drastically reduced in recent years, which allows businesses to load as many raw data sources as possible without the same concerns they might have had before.

    Today, data teams can ingest raw data before transforming it, allowing them to do the transformations in the warehouse instead of a separate staging area. With the increased availability of data and a common language to access that data, SQL, it allows the business more flexibility in leveraging their data to make the right decisions

    The modern ETL process: delivering better results, faster

    Under the traditional ETL process, as data and processing requirements grew, the chance that on-premise data warehouses would fail grew as well. When this happened, IT would have to swoop in to fix the issue, which usually meant adding more hardware.

    The modern ETL process in today’s data warehouses sidesteps this issue by offloading the compute resource management to the cloud data warehouse. Many cloud data warehouses offer compute scaling that allows for dynamic scaling when needs spike. This allows data teams to still see scalable performance while holding increased numbers of computationally expensive data models and ingesting more large data sources. The decreased cost in compute power along with compute scaling in cloud data warehouses allows data teams to efficiently scale resources up or down to suit their needs and better ensure no downtime. The bottom line is that, instead of having your in-house data and/or IT team fretting over your data storage and computing issues, you can offload that pretty much completely to the data warehouse provider.

    Data teams can then build tests on top of their cloud data warehouse to monitor their data sources for quality, freshness, etc. giving them quicker, more proactive visibility into any problems with their data pipelines.

    From ETL to ELT and beyond

    Data, analytics, and BI have radically evolved since their inception. We are leaps and bounds beyond Excel tables and on-prem-centric data sources. Cloud-native data warehouses, cloud-native architecture in analytics and BI platforms, and embedded analytics powered by these systems have redefined what it means to be truly data-driven in our modern age.

    The ETL process has been updated and can now deliver insights from a wide array of datasets, which helps companies and teams of all kinds make smarter decisions, faster. It also opens the doors for advanced analytics, next-level data monetization, and much more. Whatever you’re building with your data, a modern ELT setup will help you get more from your data, easier.

    While we have seen a large migration to data warehouses, looking to the future, we might see even another evolution that involves data lakes being more widely used as the initial destination for all raw data sources, offering even further cost benefits. Additionally, new tools are starting to take form that allow for transformation within these data lakes, thus continuing the evolution of tools and processes within the data pipeline following the path of ETL to ELT.

    Author: Adam Luba

    Source: Sisense

  • Migros: an example of seizing the opportunities BI offers

    Migros: an example of seizing the opportunities BI offers

    Migros is the largest retailer in Turkey, with more than 2500 outlets selling fresh produce and groceries to millions of people. To maintain high-quality operations, the company depends on fresh, accurate data. And to ensure high data quality, Migros depends on Talend.

    The sheer volume of data managed by Migros is astonishing. The company’s data warehouse currently holds more than 200 terabytes, and Migros is running more than 7,000 ETL (extract, transform, load) jobs every day. Recently, the quality of that data became the focal point for the BI (business intelligence) team at Migros.

    “We have 4,000 BI users in this company,” said Ahmet Gozmen, Senior Manager of IT Data Quality and Governance at Migros. “We produce 5-6 million mobile reports every year that our BI analysts see on their personal dashboards. If they can’t trust the timeliness or the accuracy of the reports, they can’t provide trustworthy guidance on key business decisions.”

    In 2019, Mr. Gozmen and his team decided they needed a more reliable foundation on which to build data quality. “We were having a few issues with our data at that time,” he said. “There would be occasional problematic or unexpected values in reports—a store’s stock would indicate an abnormal level, for example—and the issue was in the data, not the inventory. We had to address these problems, and more than that we wanted to take our data analysis and BI capabilities to a higher level.''

    From Community to Commercial

    Initially, Mr. Gozmen’s team used the non-commercial version of Talend Data Quality. “It was an open-source solution that we could download and set up in one day,” he said. “At first, we just wanted to see whether we could do something with this tool or not. We explored its capabilities, and we asked the Talend Community if we had questions or needed advice.”

    Mr. Gozmen discovered that Talend had far more potential than he expected. “We found that the data quality tool was very powerful, and we started exploring what else we could do with Talend,” he said. “So we also downloaded the data integration package, then the big data package. Talend could handle the huge volumes of data we were dealing with. And very soon we started thinking about the licensed, commercial versions of these solutions, because we saw a great deal of potential not only for immediate needs but for future plans.”

    By upgrading to the commercial versions, Migros also elevated the level of service and support that was available. “The Community served our purposes well in the early stages,” said Mr. Gozmen, “but with the commercial license we now have more personalized support and access to specialists who can help us immediately with any aspect of our implementation.”

    From Better Data Quality to Big Data Dreams

    With Talend Data Quality, Migros has improved the accuracy and reliability of its BI reporting, according to Mr. Gozmen. “We are a small department in a very big company,” he said, “but with help from Talend we can instill confidence in our reporting, and we can start to support other departments and have a larger impact on improving processes and even help generate more income.”

    The higher level of data quality Migros has achieved with Talend has also led Mr. Gozmen to consider using Talend for future data initiatives. “We have big dreams, and we are testing the waters on several fronts,” he said. “We are exploring the possibilities for predictive analytics, and we feel Talend’s big data capabilities are a good match.”

    The Migros team is also considering using Talend in moving from its current batch processing mode to real-time data analysis, according to Mr. Gozmen. “We are currently using date-minus-one or date-minus two batch processing, but we want to move to real-time big data predictive analytics and other advanced capabilities as soon as possible,” he said. “We are currently testing new models that can help us achieve these goals.”

    While the business advantages of using Talend are manifesting themselves in past, present, and future use cases, Mr. Gozmen sums them up this way: “With Talend we can trust our data, so business leaders can trust our reports, so we can start to use big data in new ways to improve processes, supply chain performance, and business results.”

    Author: Laura Ventura

    Source: Talend

  • The transformation of raw data into actionable insights in 5 steps

    The transformation of raw data into actionable insights in 5 steps

    We live in a world of data: there’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with Data is your window into the ways organizations tackle the challenges of this new world to help their companies and their customers thrive.

    In a world of proliferating data, every company is becoming a data company. The route to future success is increasingly dependent on effectively gathering, managing, and analyzing your data to reveal insights that you’ll use to make smarter decisions. Doing this will require rethinking how you handle data, learn from it, and how data fits in your digital transformation.

    Simplifying digital transformation

    The growing amount and increasingly varied sources of data that every organization generates make digital transformation a daunting prospect. But it doesn’t need to be. At Sisense, we’re dedicated to making this complex task simple, putting power in the hands of the builders of business data and strategy, and providing insights for everyone. The launch of the Google Sheets analytics template illustrates this.

    Understanding how data becomes insights

    A big barrier to analytics success has been that typically only experts in the data field (data engineers, scientists, analysts and developers) understood this complex topic. As access to and use of data has now expanded to business team members and others, it’s more important than ever that everyone can appreciate what happens to data as it goes through the BI and analytics process. 

    Your definitive guide to data and analytics processes

    The following guide shows how raw data becomes actionable insights in 5 steps. It will navigate you through every consideration you might need to make about what BI and analytics capabilities you need, and every step of the way that leads to potentially game-changing decisions for you and your company.

    1. Generating and storing data in its raw state

    Every organization generates and gathers data, both internally and from external sources. The data takes many formats and covers all areas of the organization’s business (sales, marketing, payroll, production, logistics, etc.) External data sources include partners, customers, potential leads, etc. 

    Traditionally all this data was stored on-premises, in servers, using databases that many of us will be familiar with, such as SAP, Microsoft Excel, Oracle, Microsoft SQL Server, IBM DB2, PostgreSQL, MySQL, Teradata.

    However, cloud computing has grown rapidly because it offers more flexible, agile, and cost-effective storage solutions. The trend has been towards using cloud-based applications and tools for different functions, such as Salesforce for sales, Marketo for marketing automation, and large-scale data storage like AWS or data lakes such as Amazon S3, Hadoop and Microsoft Azure.

    An effective, modern BI and analytics platform must be capable of working with all of these means of storing and generating data.

    2. Extract, Transform, and Load: Prepare data, create staging environment and transform data, ready for analytics

    For data to be properly accessed and analyzed, it must be taken from raw storage databases and in some cases transformed. In all cases the data will eventually be loaded into a different place, so it can be managed, and organized, using a package such as Sisense for Cloud Data Teams. Using data pipelines and data integration between data storage tools, engineers perform ETL (Extract, transform and load). They extract the data from its sources, transform it into a uniform format that enables it all to be integrated. Then they load it into the repository they have prepared for their databases.

    In the age of the Cloud, the most effective repositories are cloud-based storage solutions likeAmazon RedShift,Google BigQuery, Snowflake, Amazon S3, Hadoop, Microsoft Azure. These huge, powerful repositories have the flexibility to scale storage capabilities on demand with no need for extra hardware, making them more agile and cost-effective, as well as less labor-intensive than on-premises solutions. They hold structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video).  Sisense provides instant access to your cloud data warehouses.

    3. Data modeling: Create relationships between data. Connect tables

    Once the data is stored, data engineers can pull from the data warehouse or data lake to create tables and objects that are organized in more easily accessible and usable ways. They create relationships between data and connect tables, modeling data in a way that sets relationships, which will later be translated into query paths for joins, when a dashboard designer initiates a query in the front end. Then, users, in this case, BI and business analysts, can examine it, create relationships between data, connect and compare different tables and develop analytics from the data.

    The combination of a powerful storage repository and a powerful BI and analytics platform enables such analysts to transform live Big Data from cloud data warehouses into interactive dashboards in minutes. They use an array of tools to help achieve this.Dimension tables include information that can be sliced and diced as required for customer analysis ( date, location, name, etc.). Fact tables include transactional information, which we aggregate. TheSisense ElastiCube enables analysts to mashup any data from anywhere. The result: highly effective data modeling that maps out all the different places that a software or application stores information, and works out how these sources of data will fit together, flow into one another and interact.

    After this, the process follows one of two paths:

    4. Building dashboards and widgets

    Now,developers pick up the baton and they create dashboards so that business users can easily visualize data and discover insights specific to their needs. They also build actionable analytics apps, thereby integrating data insights into workflows bytaking data-driven actions through analytic apps. And they define exploration layers, using an enhanced gallery of relationships between widgets.

    Advanced tools that help deliver insights include universal knowledge graphs and augmented analytics that use machine learning (ML)/artificial intelligence (AI) techniques to automate data preparation, insight discovery, and sharing. These drive automatic recommendations arising from data analysis and predictive analytics respectively. Natural language querying puts the power of analytics in the hands of even untechnical users by enabling them to ask questions of their datasets without needing code, and to tailor visualizations to their own needs.

    5. Embed analytics into customers’ products and services

    Extending analytics capabilities even further, developers can create applications that they embed directly into customers’ products and services, so that they become instantly actionable. This means that at the end of the BI and analytics process, when you have extracted insights, you can immediately apply what you’ve learned in real time at the point of insight, without needing to leave your analytics platform and use alternative tools. As a result, you can create value for your clients by enabling data-driven decision-making and self-service analysis. 

    With a package like Sisense for Product Teams, product teams can build and scale custom actionable analytic apps and seamlessly integrate them into other applications, opening up new revenue streams and providing a powerful competitive advantage.

    Author: Adam Murray

    Source: Sisense

EasyTagCloud v2.8