4 items tagged "data analysis"

  • Solutions to help you deal with heterogeneous data sources

    Solutions to help you deal with heterogeneous data sources

    With enterprise data pouring in from different sources; CRM systems, web applications, databases, files, etc., streamlining data processes is a significant challenge as it requires integrating heterogeneous data streams. In such a scenario, standardizing data becomes a pre-requisite for effective and accurate data analysis. The absence of the right integration strategy will give rise to application-specific and intradepartmental data silos, which can hinder productivity and delay results.

    Consolidating data from disparate structured, unstructured, and semi-structured sources can be complex. A survey conducted by Gartner revealed that one-third of respondents consider 'integrating multiple data sources' as one of the top four integration challenges.

    Understanding the common issues faced during this process can help enterprises successfully counteract them. Here are three challenges generally faced by organizations when integrating heterogeneous data sources, as well as ways to resolve them:

    Data extraction

    Challenge: Pulling source data is the first step in the integration process. But it can be complicated and time-consuming if data sources have different formats, structures, and types. Moreover, once the data is extracted, it needs to be transformed to make it compatible with the destination system before integration.

    Solution: The best way to go about this is to create a list of sources that your organization deals with regularly. Look for an integration tool that supports extraction from all these sources. Preferably, go with a tool that supports structured, unstructured, and semi-structured sources to simplify and streamline the extraction process.

    Data integrity

    Challenge: Data Quality is a primary concern in every data integration strategy. Poor data quality can be a compounding problem that can affect the entire integration cycle. Processing invalid or incorrect data can lead to faulty analytics, which if passed downstream, can corrupt results.

    Solution: To ensure that correct and accurate data goes into the data pipeline, create a data quality management plan before starting the project. Outlining these steps guarantees that bad data is kept out of every step of the data pipeline, from development to processing.


    Challenge: Data heterogeneity leads to the inflow of data from diverse sources into a unified system, which can ultimately lead to exponential growth in data volume. To tackle this challenge, organizations need to employ a robust integration solution that has the features to handle high volume and disparity in data without compromising on performance.

    Solution: Anticipating the extent of growth in enterprise data can help organizations select the right integration solution that meets their scalability and diversity requirements. Integrating one data point at a time is beneficial in this scenario. Evaluating the value of each data point with respect to the overall integration strategy can help prioritize and plan. Say that an enterprise wants to consolidate data from three different sources: Salesforce, SQL Server, and Excel files. The data within each system can be categorized into unique datasets, such as sales, customer information, and financial data. Prioritizing and integrating these datasets one at a time can help organizations gradually scale data processes.

    Author: Ibrahim Surani

    Source: Dataversity

  • The 4 steps of the big data life cycle

    The 4 steps of the big data life cycle

    Simply put, from the perspective of the life cycle of big data, there are nothing more than four aspects:

    1. Big data collection
    2. Big data preprocessing
    3. Big data storage
    4. Big data analysis

    All above four together constitute the core technology in the big data life cycle.

    Big data collection

    Big data collection is the collection of structured and unstructured massive data from various sources.

    Database collection: Sqoop and ETL are popular, and traditional relational databases MySQL and Oracle still serve as data storage methods for many enterprises. Of course, for the open source Kettle and Talend itself, big data integration content is also integrated, which can realize data synchronization and integration between hdfs, hbase and mainstream Nosq databases.

    Network data collection: A data collection method that uses web crawlers or website public APIs to obtain unstructured or semi-structured data from web pages and unify them into local data.

    File collection: Including real-time file collection and processing technology flume, ELK-based log collection and incremental collection, etc.

    Big data preprocessing

    Big data preprocessing refers to a series of operations such as “cleaning, filling, smoothing, merging, normalization, consistency check” and other operations on the collected raw data before data analysis, in order to improve the data Quality lays the foundation for later analysis work. Data preprocessing mainly includes four parts

    1. Data cleaning
    2. Data integration
    3. Data conversion
    4. Data specification

    Data cleaning refers to the use of cleaning tools such as ETL to deal with missing data (missing attributes of interest), noisy data (errors in the data, or data that deviates from expected values), and inconsistent data.

    Data integration refers to the consolidation and storage of data from different data sources in a unified database. The storage method focuses on solving three problems: pattern matching, data redundancy, and data value conflict detection and processing.

    Data conversion refers to the process of processing the inconsistencies in the extracted data. It also includes data cleaning, that is, cleaning abnormal data according to business rules to ensure the accuracy of subsequent analysis results.

    Data specification refers to the operation of minimizing the amount of data to obtain a smaller data set on the basis of keeping the original appearance of the data to the maximum extent, including: data party aggregation, dimension specification, data compression, numerical specification, concept layering, etc.

    Big data storage

    Big data storage refers to the process of using memory to store the collected data in the form of a database in three typical routes:

    New database cluster based on MPP architecture: Using Shared Nothing architecture, combined with the efficient distributed computing model of MPP architecture, through column storage, coarse-grained indexing and other big data processing technologies, the focus is on data storage methods developed for industry big data. With the characteristics of low cost, high performance, high scalability, etc., it has a wide range of applications in the field of enterprise analysis applications.

    Compared with traditional databases, its PB-level data analysis capabilities based on MPP products have significant advantages. Naturally, MPP database has also become the best choice for a new generation of enterprise data warehouse.

    Technology expansion and packaging based on Hadoop: Hadoop-based technology expansion and encapsulation is aimed at data and scenarios that are difficult to process with traditional relational databases (for storage and calculation of unstructured data, etc.), using Hadoop open source advantages and related features (good at handling unstructured and semi-structured data), Complex ETL processes, complex data mining and calculation models the process of deriving relevant big data technology.

    With the advancement of technology, its application scenarios will gradually expand. The most typical application scenario at present is to support the Internet big data storage and analysis by expanding and encapsulating Hadoop, involving dozens of NoSQL technologies.

    Big data all-in-one: This is a combination of software and hardware designed for the analysis and processing of big data. It consists of a set of integrated servers, storage devices, operating systems, database management systems, and pre-installed and optimized software for data query, processing, and analysis. It has good stability and vertical scalability.

    Big data analysis and mining

    From visual analysis, data mining algorithms, predictive analysis, semantic engine, data quality management, etc., the process of extracting, refining and analyzing the chaotic data.

    Visual analysis: Visual analysis refers to an analysis method that clearly and effectively conveys and communicates information with the aid of graphical means. Mainly used in massive data association analysis, that is, with the help of a visual data analysis platform, the process of performing association analysis on dispersed heterogeneous data and making a complete analysis chart. It is simple, clear, intuitive and easy to accept.

    Data mining algorithm: Data mining algorithms are data analysis methods that test and calculate data by creating data mining models. It is the theoretical core of big data analysis.

    There are various data mining algorithms, and different algorithms show different data characteristics due to different data types and formats. But generally speaking, the process of creating a model is similar, that is, first analyze the data provided by the user, then search for specific types of patterns and trends, and use the analysis results to define the best parameters for creating a mining model, and apply these parameters In the entire data set to extract feasible patterns and detailed statistics.

    Data quality management refers to the identification, measurement, monitoring, and early warning of various data quality problems that may be caused in each stage of the data life cycle (planning, acquisition, storage, sharing, maintenance, application, extinction, etc.) to improve data A series of quality management activities.

    Predictive analysis: Predictive analysis is one of the most important application areas of big data analysis. It combines a variety of advanced analysis functions (special statistical analysis, predictive modeling, data mining, text analysis, entity analysis, optimization, real-time scoring, machine learning, etc.), to achieve the purpose of predicting uncertain events.

    Help users analyze trends, patterns, and relationships in structured and unstructured data, and use these indicators to predict future events and provide a basis for taking measures.

    Semantic Engine: Semantic engine refers to the operation of adding semantics to existing data to improve users’ Internet search experience.

    Author: Sajjad Hussain

    Source: Medium


  • Two modern-day shifts in market research

    Two modern-day shifts in Market Research

    In an industry that is changing tremendously, traditional ways of doing things will no longer suffice. Timelines are shortening, as demands for faster and faster insights increase, and, in addition, we are seeking these insights in such a vast sea of data. The only way to address the combination of these two issues is with technology.

    The human-machine relationship

    One good example of this shift is the whole arena surrounding computational text analysis. Smarter, artificial intelligence (AI)-based approaches are completely changing the way we approach this task. In the past, the human-based analysis only allowed us to skim the text, use a small sample and analyze it with subjective bias. This kind of generalized approach is being replaced by a computational methodology that incorporates all the text while throwing away what the computer views as non-essential information. Sometimes, without the right program, much of the meaning can be lost. However, this machine-based approach can work with large amounts of data quickly.

    When we start to dive deeper into AI-based solutions, we see that technology can shoulder much of the hard work to free up humans to do what we can do better. What the machine does really well is finding the data points that can help us tell a better, richer story. It can run algorithms and find patterns in natural language, taking care of the heavy lifting. Then the human can come in, add color and apply sensible intelligence to the data. This human-machine tension is something I predict that we’ll continue to see as we accommodate our new reality. The end goal is to make the machine as smart as possible to really leverage our own limited time in the best ways possible.

    Advanced statistical analysis

    Another big change taking place surrounds the statistical underpinnings we use for analysis. Traditionally we have found things out by using the humble crosstab tool. But if we truly want to understand what’s driving, for example, differences between groups, it is simply not efficient to go through crosstab after crosstab. It is much better to have the machine do it for you and reveal just the differences that matter. When you do that, though, classical statistics break down because false positives become statistically inevitable.

    Bayesian statistics do not suffer this same problem when a high volume of tests are required. In short, a Bayesian approach allows researchers to test a hypothesis and see if it holds given the data, rather than the more commonly used tests for significance which test that the data is right in the face of a given hypothesis.

    There are a host of other models that are changing the way we approach our daily jobs in market research. New tools, some of them based in a completely different set of underlying principles (like Bayesian statistics), are giving us new opportunities. With all these opportunities, we are challenged to work in a new set of circumstances and learn to navigate a new reality.

    We can’t afford to wait any longer to change the way we are doing things. The industry and our clients’ industries are moving too quickly for us to hesitate. I encourage researchers to embrace this new paradigm so that they will have the skill advantage. Try new tools, even if you don’t understand how they work, many of them can help you do what you do (better). Doing things in new ways can lead to better, faster insights. Go for it!

    Author: Geoff Lowe

    Source: Greenbook Blog

  • Using Hierarchical Clustering in data analysis

    Using Hierarchical Clustering in data analysis

    This article discusses the analytical method of Hierarchical Clustering and how it can be used within an organization for analytical purposes.

    What is Hierarchical Clustering?

    Hierarchical Clustering is a process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group and as much similar as possible within each group.

    For example, if you want to create four groups of items, these items  should be as similar as possible in terms of attributes of the items in each group, and items in group 1 and group 2 should be as dissimilar as possible. All items start in one cluster, and are then divided into two clusters. The data points within one cluster are as similar as possible, and the data points in other clusters are dissimilar from the other clusters being analyzed. For each cluster, we repeat the process until the specified number of clusters is reached (four in this case).

    This type of analysis can be applied to segment customers by purchase history, segment users by the types of activities they perform on websites or applications, to develop personalized consumer profiles based on activities or interests, and to recognize market segments, etc.

    How does an organization use Hierarchical Clustering to analyze data?

    In order to understand the application of Hierarchical Clustering for organizational analysis, let us consider two use cases.

    Use case one

    Business problem: A bank wants to group loan applicants into high/medium/low risk based on attributes such as loan amount, monthly installments, employment tenure, the number of times the applicant has been delinquent in other payments, annual income, debt to income ratio etc.

    Business benefit: Once the segments are identified, the bank will have a loan applicants’ dataset with each applicant labeled as high/medium/low risk. Based on these labels, the bank can easily make a decision on whether to give loan to an applicant and how much credit to extend, as well as the interest rate the applicant will be given, based on the amount of risk involved.

    Use case two

    Business problem: The enterprise wishes to organize customers into groups/segments based on similar traits, product preferences and expectations. Segments are constructed based on customer demographic characteristics, psychographics, past behavior and product use behavior.

    Business benefit: Once the segments are identified, marketing messages and products can be customized for each segment. The better the segment(s) chosen for targeting by a particular organization, the more successful the business will be in the market.

    Hierarchical Clustering can help an enterprise organize data into groups to identify similarities and, equally important, dissimilar groups and characteristics, so that the business can target pricing, products, services, marketing messages and more.

    Author: Kartik Patel

    Source: Dataversity

EasyTagCloud v2.8