2 items tagged "database management"

  • Data wrangling in SQL: 5 recommended methods

    Data wrangling in SQL: 5 recommended methods

    Data wrangling is an essential job function for data engineering, data science, or machine learning roles. As knowledgable coders, many of these professionals can rely on their programming skills and help from libraries like Pandas to wrangle data. However, it can often be optimal to manipulate data directly at the source with SQL scrips. Here are some data wrangling techniques every data expert should know.

    Note for the purpose of this blog post I will not go into the details of the many versions of SQL. Much of the syntax mentioned is conformant with the SQL-92 standard, but where there are platform-specific clauses and functions, I will try and point them out (PostgreSQL, MySQL, T-SQL, PSQL, etc) but not extensively. Remember, however, it’s common to implement a missing function directly from your favorite flavor of SQL. For example, the correlation coefficient function, CORR, is present in PostgreSQL and PLSQL but not MySQL and can be implemented directly or imported from an external library. To keep this blog brief, I don’t specify the implementation of these techniques but they are widely available and will also be covered in my upcoming virtual training session where I’ll go into further detail on data wrangling techniques and how to wrangle data.

    1. Data Profiling with SQL

    Data profiling is often the first step in any data science or machine learning project and an important preliminary step to explore the data prior to data preparation or transformation. The goal is straightforward: determine if the data is accurate, reliable, and representative to build performing algorithms since the resulting models are only as good as the data employed therein.

    Core SQL Skills for Data Profiling

    • Descriptive statistics: MIN, MAX, MEAN, AVERAGE, MEDIAN,, STDEV, , STDDEV(PostgreSQL), STDEVP (Transact-SQL)
    • Correlation functions:: CORR (PostgreSQL), 
    • Aggregation functions: COUNT, SUM, GROUP BY

    Data profiling starts with some basic descriptive statistic functions such as MIN, MAX, STDEV, etc to help you understand the distribution and range of your data. Even basic aggregation functions like COUNT and GROUP BY are useful. From there, standard deviation functions including STDEV/STDEVP can tell you a lot about your data (high, low) and its distribution. For a quick look at correlations across features and the target feature, the CORR correlation coefficient function can give some quick insight. 

    2. Detecting Outliers with SQL

    An essential part of data profiling is identifying outliers, and SQL can be a quick and easy way to find extreme values. You can get quite far with the below simple techniques and employing basic sorting techniques.

    Core SQL Skills for Detecting Outliers 

    • Sorting & Filtering: ORDER, CASE, WHERE HAVING
    • Descriptive statistics: AVG, AVERAGE, MEAN, MEDIAN
    • Winsorization: PERCENTILE_CONT, PERCENT_RANK(T-SQL),  Subqueries, VAR (T-SQL) 
    • Fixing outliers: UPDATE, NULL, AVG

    Simple sorting and ordering using ORDER BY will produce quick results and further classifying using CASE statements and using groups and summation. More sophisticated techniques can be used by implementing various forms of Winsorization and variance functions VAR(T-SQL). To identify outliers you can use percentile calculations, and VAR(T-SQL) can be used to measure the statical variants.

    There are various techniques for handling outliers from ignoring to removing (trimming). Replacement techniques such as Winsorization where we essentially change extreme values in the dataset to less extreme values with observations closest to them. This typically requires us to calculate percentiles. We can use UPDATE with PERCENTILE_CONT (present in PostgreSQL, T-SQL, and PLSQ) to, for example,  sets all observations greater than the 95th percentile equal to the value at the 95th percentile and all observations less than the 5th percentile equal to the value at the 5th percentile.

    3. Time Series, ReIndexing, and Pivoting with SQL

    Time-series databases such as Prometheus, TDengine, and InfluxDB are quite popular among data wrangling techniques but more often than not you may be dealing with a standard relational database, and pivoting, reindexing, or transforming data into time series can be a challenge.

    Core SQL Skills for TimeSeries, ReIndexing, and Pivoting with SQL

    • Window Functions:  LAG, OVER, (PostgreSQL, T-SQL, MySQL)   OVER
    • Time-series generation: Pivoting 
    • PIVOT (T-SQL) CROSSTAB, CASE
    • Data Trends: COALSE

    For time series and data realignment, window functions perform calculations on a set of rows that are related together and thus perfect for this type of data transformation. For example, if we have a set of time-stamped data and we wanted to get the difference at each step, you can use the LAG function. The LAG() function provides access to a row that comes before the current row. This works because window operations like LAG() do not collapse groups of query rows into a single output row. Instead, they produce a result for each row. Using the OVER clause specifies you can  partition query rows into groups for processing by the window function: 

    Reindexing (not to be confused with database indexing) or aligning datasets is another key technique. For example, you have sales from different time periods you’d like to compare. Using windowing functions and subqueries you can reset the data from the same data point or zero or align on the same period units in time.

    Pivoting is a widely used spreadsheet technique among data wrangling techniques. Often when looking at periodic or timestamp data you get a single column of data you’d like to pivot to multispoke columns. For example, pivoting a single column of monthly sales listed by rows into multiple columns of sales by month. PIVOT and CROSSTAB can be used but for other SQL favors that don’t support those functions, you can implement them with CASE statements.

    4. Fake Data Generation with SQL

    There are a number of reasons to generate fake data, including:

    • Anonymizing production and sensitive data before porting it to other data stores
    • Generating more data that has similar characteristics to existing data for load testing, benchmarking, robustness testing, or scaling. 
    • Varying the distribution of your data to understand how your data scienceor ML model performs for bias, loss function, or other metrics

    Core SQL skills:

    • RANDOM, RAND
    • CREATE, UPDATE, SELECT, INSERT, and subqueries
    • JOIN and CROSS JOIN, MERGE CAST
    • String manipulation: functions, CONCAT SUBSTR, TRIM, etc. 

    SQL can be useful for generating fake data for load testing, benchmarking, or model training and testing. For numerical data, RANDOM, RAND(), and its variants are classic random generators. Generation data for new tables (CREATE) or updating existing ones (UPDATE). Data selection and subqueries are generally a prerequisite. CROSS JOIN is a favorite as it lets you combine data (aka cartesian join) to make new data sets and MEGE can be used similarly. CAST is useful for converting variables into different data types. When generating text string manipulation  CONCAT SUBSTR, TRIM and a host of other functions can be used to create new text variables.

    5. Loading Flat Files with SQL 

    This may look like a bonus round data wrangling technique but I’ve worked with many data experts that struggle to load data files directly into a database (the L in ETL). Why do this?  For many research projects data is not readily available in data lakes, APIs, and other data stores. Generally, it has to be collected in the wild and more often than not in a flat-file format or multiple flat files.

    Core Skills required

    • SQL Clients; Understand delimiters, text delimiting, and special characters
    • Table creation CREATE, TEMPORARY TABLE  
    • Importing data, IMPORT, INSERT, SELECT INTO
    • Normalize data with SQL clauses including, SELECT, UPDATE, WHERE, JOIN, and subqueries
    • Transaction control: BEGIN,  COMMIT, and ROLLBACK

    ETL (Extraction, Transformation, Loading) is the most common way to simply import CSV files. You can use SQL for the all-important verification step. Simple verification includes checking for nulls, understanding the range and distribution of values imported, finding extreme values, and event simple count checks. 

    However, loading flat file data into a database is not as trivial as it seems thanks to the need for specific delimiters. Most SQL clients allow you to specify custom delimiters. For text-heavy files, you can’t use simple text quote delimiters. Double quotes and other techniques must be employed. Be careful to check for whitespace characters also including newline characters.

    Importing a single flat file like a CSV file is somewhat straightforward once the above steps are done. However, by nature flat files are not normalized thus you can use SQL to create a temporary table to load the data and then use the SQL filter clause and Update clause to port that data to the final destination tables. A common technique is to import that data to temporary tables in “raw” format and then normalize the data to other tables. As always when inserting or updating data employ transaction controls.

    Author: Seamus McGovern

    Source: Open Data Science

  • The Bright Future of SQL

    The Bright Future of SQL

    Machine learning, big data analytics or AI may steal the headlines, but if you want to hone a smart, strategic skill that can elevate your career, look no further than SQL.

    Data engineering and data science are fast-moving, competitive fields. Technologies come and go, so keeping your skillset updated is something all ambitious data pros can agree on. Where data engineers and scientists disagree is exactly what skills will be most valuable in the future.

    Regardless of the bewildering array of tools and services available to data scientists, it is still humble SQL that forms the bedrock of a data scientist’s stack. While SQL is usually seen as a baseline skill, it is, in fact, much more than that. Despite being almost 50 years old, SQL is becoming more,not less relevant. 

    Machine learning, big data analytics or AI may steal the headlines, but if you want to hone a smart, strategic skill that can elevate your career, look no further than SQL. Here’s why. 

    SQL Dominates Databases

    First off, “SQL is really the language of data” in the words of Benjamn Rogojan (aka Seattle Data Guy). This is down to the fact that the majority of databases are built on one or other of the SQL-based technologies. All but two of the top ten most popular databases today are based on SQL, the exceptions (MongoDB and Redis) are ranked fifth and sixth respectively and even they can be used with SQL. It’s easy to see why anyone who needs to query, update, change or in anyway engage with data in relational databases is going to be well-served with a solid working knowledge of SQL no matter what specialism they end up pursuing. 

    Demand for SQL Skills is High, and Growing

    Despite its age, SQL is far from a legacy skill. As data engineering has advanced into the cloud, so SQL has followed, and, according to Dataquest, SQL was the most in-demand skill among all jobs in data in 2021, especially at the more junior end of the spectrum. However, even more experienced data scientist job postings still list SQL in almost 60 percent of vacancies. What’s more, doubtless due to surging demand for data-related expertise, demand for SQL skills appears to be growing, despite a brief dip in 2020. Pandemic aside, the SQL server transformation market – that helps business address the need for data transformation - is predicted to grow steadily at more than 10 percent CAGR until the end of the decade. 

    Should Savvy Data Scientists Prioritize SQL?

    The future of SQL looks safe, but it does not necessarily follow that budding data scientists who already have a working knowledge of it will prioritize deepening their SQL skills to further their career progress. 

    They should. 

    With so many tools and nascent technologies to help them at the ELT / ETL stage, for BI and for both predictive and historic analytics, data scientists need to be savvy about where to plough their energy. The continually shrinking half-life of high-tech skills means that the tools and skills data scientists learn can be career-defining – or career-limiting. 

    How SQL is Taking Center Stage?

    No-one wants to spend six months figuring out a tool that only delivers half of what was expected of it, let alone recommending it to the wider team only to find that it underwhelms. So, when data scientists look at the services and techniques out there that will help them query their data more effectively,

    they are probably going to look at the best BI tools and ML extensions that will enable them prep the data, create the model and then train it. But all these different stages take time and demand high levels of expertise. We’ve been conditioned into accepting that ML modelling requires the data to be extracted from the database, usually using a BI tool, transformed and loaded into the BI system, before being exported (again) to the ML tool, where the magic happens, and transporting it back to the BI tool for visualization. 

    What if I told you there is a way of taking ML models to the data, enabling you to query predictions from inside the database using – you guessed it - SQL? There is. It’s part of a small, but rapidly growing movement that brings intelligence into the data layer, rather than painstakingly taking the data to the ML tool. 

    In-database Innovation

    In-database ML is a much simpler way to use existing data to predict future events…and it uses standard SQL commands. In-database ML is a bit like giving your database a brain. It means data scientists – and data engineers, and indeed anyone with SQL skills – can work within the database, running ML models to answer almost any business question. Predicting customer churn, credit scoring, customer lifecycle optimization, fraud detection, inventory management, price modelling, and predicting patient health outcomes are just a few of the many use cases in-database modelling has enabled. With this approach, all the ML models can be created, queried and maintained as if they were database tables, using the SQL language and bringing powerful predictive capabilities to a much wider range of data pros.  

    In-database ML is a relatively new field, but it is one part of a wider, fast-growing movement to simplify and democratize data engineering and data science, breaking down the technical barriers that currently exist for those working with data. Take, for example dbt Labs, a company that’s taken the data world by storm, having recently secured $222 million of funding and been valued at $4.2 billion. Its data transformation product enables data engineers to build production-grade data pipelines from within the data warehouse using SQL commands, radically simplifying, speeding up scaling the process of prepping data. 

    SQL – Not Old, But Evergreen

    We’re fortunate to be living in a golden age of digital innovation. However, against a business backdrop that prizes the insights data can offer, data scientists are under pressure like never before to produce miracles out of data. A dizzying range of tools and services has grown out of the need to speed up and scale the data analytics. These tools often demand an investment in time and skill development to fully realize their benefits. However, one skill that has often been overlooked is humble SQL, the data scientist’s best friend. SQL is not only not going anywhere, as the growing movement to innovate closer to the data shows, SQL is becoming the data scientist’s strategic secret weapon. 

    Author: Jorge Torres

    Source: KDnuggets

EasyTagCloud v2.8