9 items tagged "data storage"

  • 3 Predicted trends in data analytics for 2021

    3 Predicted trends in data analytics for 2021

    It’s that time of year again for prognosticating trends and making annual technology predictions. As we move into 2021, there are three trends data analytics professionals should keep their eyes on: OpenAI, optimized big data storage layers, and data exchanges. What ties these three technologies together is the maturation of the data, AI and ML landscapes. Because there already is a lot of conversation surrounding these topics, it is easy to forget that these technologies and capabilities are fairly recent evolutions. Each technology is moving in the same direction -- going from the concept (is something possible?) to putting it into practice in a way that is effective and scalable, offering value to the organization.

    I predict that in 2021 we will see these technologies fulfilling the promise they set out to deliver when they were first conceived.

    #1: OpenAI and AI’s Ability to Write

    OpenAI is a research and deployment company that last year released what they call GPT3 -- artificial intelligence that generates text that mimics text produced by humans. This AI offering can write prose for blog posts, answer questions as a chatbot, or write software code. It’s risen to a level of sophistication where it is getting more difficult to discern if what it generated was written by a human or a robot. Where this type of AI is familiar to people is in writing email messages; Gmail anticipates what the user will write next and offers words or sentence prompts. GPT3 goes further: the user can create a title or designate a topic and GPT3 will write a thousand-word blog post.

    This is an inflection point for AI, which, frankly, hasn’t been all that intelligent up to now. Right now, GPT3 is on a slow rollout and is being used primarily by game developers enabling video gamers to play, for example, Dungeons and Dragons without other humans.

    Who would benefit from this technology? Anyone who needs content. It will write code. It can design websites. It can produce articles and content. Will it totally replace humans who currently handle these duties? Not yet, but it can offer production value when an organization is short-staffed. As this technology advances, it will cease to feel artificial and will eventually be truly intelligent. It will be everywhere and we’ll be oblivious to it.

    #2: Optimized Big Data Storage Layers

    Historically, massive amounts of data have been stored in the cloud, on hard drives, or wherever your company holds information for future use. The problem with these systems has been finding the right data when needed. It hasn’t been well optimized, and the adage “like looking for a needle in the haystack” has been an accurate portrayal of the associated difficulties. The bigger the data got, the bigger the haystack got, and the harder it became to find the needle.

    In the past year, a number of technologies have emerged, including Iceberg, Hudi, and Delta Lake, that are optimizing the storage of large analytics data sets and making it easier to find that needle. They organize the hay in such a way that you only have to look at a small, segmented area, not the entire data haystack, making the search much more precise.

    This is valuable not only because you can access the right data more efficiently, but because it makes the data retrieval process more approachable, allowing for widespread adoption in companies. Traditionally, you had to be a data scientist or engineer and had to know a lot about underlying systems, but these optimized big data storage layers make it more accessible for the average person. This should decrease the time and cost of accessing and using the data.

    For example, Iceberg came out of an R&D project at Netflix and is now open source. Netflix generates a lot of data, and if an executive wanted to use that data to predict what the next big hit will be in its programming, it could take three engineers upwards of four weeks to come up with an answer. With these optimized storage layers, you can now get answers faster, and that leads to more specific questions with more efficient answers.

    #3: Data Exchanges

    Traditionally, data has stayed siloed within an organization and never leaves. It has become clear that another company may have valuable data in their silo that can help your organization offer a better service to your customers. That’s where data exchanges come in. However, to be effective, a data exchange needs a platform that offers transparency, quality, security, and high-level integration.

    Going into 2021 data exchanges are emerging as an important component of the data economy, according to research from Eckerson Group. According to this recent report, “A host of companies are launching data marketplaces to facilitate data sharing among data suppliers and consumers. Some are global in nature, hosting a diverse range of data sets, suppliers, and consumers. Others focus on a single industry, functional area (e.g., sales and marketing), or type of data. Still, others sell data exchange platforms to people or companies who want to run their own data marketplace. Cloud data platform providers have the upper hand since they’ve already captured the lion’s share of data consumers who might be interested in sharing data.”

    Data exchanges are very much related to the first two focal points we already mentioned, so much so that data exchanges are emerging as a must-have component of any data strategy. Once you can store data more efficiently, you don’t have to worry about adding greater amounts of data, and when you have AI that works intelligently, you want to be able to use the data you have on hand to fill your needs.

    We might reach a point where Netflix isn’t just asking the technology what kind of content to produce but the technology starts producing the content. It uses the data it collects through the data exchanges to find out what kind of shows will be in demand in 2022, and then the AI takes care of the rest. It’s the type of data flow that today might seem far-fetched, but that’s the direction we’re headed.

    A Final Thought

    One technology is about getting access, one is understanding new data, and one is executing information based on the data. As these three technologies begin to mature, we can expect to see a linear growth pattern and see them all intersect at just the right time.

    Author: Nick Jordan

    Source: TDWI

  • Database possibilities in an era of big data

    Database possibilities in an era of big data

    We live in an era of big data. The sheer volume of data currently existing is huge enough without also grappling with the amount of new information that’s generated every day. Think about it: financial transactions, social media posts, web traffic, IoTsensor data, and much more, being ceaselessly pulled into databases the world over. Outdated technology simply can’t keep up.

    The modern types of databases that have arisen to tackle the challenges of big data take a variety of forms, each suited for different kinds of data and tasks. Whatever your company does, choosing the right database to build your product or service on top of is a vital decision. In this article, we’ll dig into the different types of database options you could be considering for your unique challenges, as well as the underlying database technologies you should be familiar with. We’ll be focusing on relational database management systems (RDBMS), NoSQL DBMS, columnar stores, and cloud solutions.


    First up, the reliable relational database management system. This widespread variety is renowned for its focus on the core database attributes of atomicity (keeping tasks indivisible and irreducible), consistency (actions taken by the database obey certain constraints), isolation (a transaction’s immediate state is invisible to other transactions), and durability (data changes reliably persist). Data in an RDBMS is stored in tables and an RDBMS is able to tackle tons of data and complex queries as opposed to flat files, which tend to take up more memory and are less efficient. An RDBMS is usually made up of a collection of tables, each with columns (fields) and records (rows). Popular examples of RDBM systems include Microsoft SQL, Oracle, MySQL, and Postgres.

    Some of the strengths of an RDBMS include flexibility and scalability. Given the huge amounts of information that modern businesses need to handle, these are important factors to consider when surveying different types of databases. Ease of management is another strength since each of the constituent tables can be changed without impacting the others. Additionally, administrators can choose to share different tables with certain users and not others (ideal if working with confidential information you might not want shared with all users). It’s easy to update data and expand your database, and since each piece of data is stored at a single point, it’s easy to keep your system free from errors as well.

    No system is perfect, however. Each RDBMS is built on a single server, so once you hit the limits of the machine you’ve got, you need to buy a new one. Rapidly changing data can also challenge these systems, as increased volume, variety, velocity, and complexity create complicated relationships that the RDBMS can have trouble keeping up with. Lastly, despite having 'relation' in the name, relational database management systems don’t store the relationships between elements, meaning that the system doesn’t actually understand the connections between data as pertains to various joins you may be using. 


    NoSQL (originally, 'non relational' or 'not SQL') DBMS emerged as web applications were becoming more complex. These types of databases are designed to handle heterogeneous data that’s difficult to stick in a normalization schema. While they can take a wide array of forms, the most important difference between NoSQL and RDBMS is that while relational databases rigidly define how all the data contained within must be arranged, NoSQL databases can be schema agnostic. This means that if you’ve got unstructured and semi-structured data, you can store and manipulate it easily, whereas an RDBMS might not be able to handle it at all. 

    Considering this, it’s no wonder that NoSQL databases are seeing a lot of use in big data and real-time web apps. Examples of these database technologies include MongoDB, Riak, Amazon S3, Cassandra, and Hbase. However, one drawback of NoSQL databases is that they have 'eventual consistency', meaning that all nodes will eventually have the same data. However, since there’s a lag while all the nodes update, it’s possible to get out-of-sync data depending on which node you end up querying during the update window. Data consistency is a challenge with NoSQL since they do not perform ACID transactions.

    Columnar storage database

    A columnar storage database’s defining characteristic is that it stores data tables by column rather than by row. The main benefit of this configuration is that it accelerates analyses because the system only has to read the locations your query is interested in, all within a single column. Also, these systems compress repeating volumes in storage, allowing better compression, since the data in one specific column is homogeneous across all the columns (or, columns are all the same type: integers, strings, etc. so that they can be better compressed). 

    However, due to this feature, Columnar storage databases are not typically used to build transactional databases. One of the drawbacks of these types of database is that inserts and updates on an entire row (necessary for apps like ERPs and CRMs, for example) can be expensive. It’s also slower for these types of applications. For example, when opening an account’s page in a CRM, the app needs to read the entire row (name, address, email, account id, etc) to populate the page and write back all that as well. In this example, a relational database would be more efficient. 

    Cloud solutions

    While not technically a type of database themselves, no discussion of modern types of database solutions would be complete without discussing the cloud. In this age of big data and fast-moving data sources, data engineers are increasingly turning to cloud solutions (AWS, Snowflake, etc.) to store, access, and analyze their data. One of the biggest advantages of cloud options is that you don’t have to pay for the physical space or the physical machine associated with your database (or its upkeep, emergency backups, etc.). Additionally, you only pay for what you use: as your memory and processing power needs scale up, you pay for the level of service you need, but you don’t have to pre-purchase these capabilities.

    There are some drawbacks to using a cloud solution, however. First off, since you’re connecting to a remote resource, bandwidth limitations can be a factor. Additionally, even though the cloud does offer cost savings, especially when starting a company from scratch, the lifetime costs of paying your server fees could exceed what you would have paid buying your own equipment. Lastly, depending on the type of data you’re dealing with, compliance and security can be issues because the responsibility of managing the data and its security is no longer handled by you, the data owner, and instead by the third party provider. For example, unsecured APIs and interfaces that can be more readily exploited, data breaches, data loss or leakage risks can be elevated, and unauthorized access through improperly configured firewalls are some ways in which cloud databases can be compromised.

    Decision time

    The era of Big Data is changing the way companies deal with their data. This means choosing new database models and finding the right analytics and BI tools to help your team get the most out of your data and build the apps, products, and services that will shape the world. Whatever you’re creating, picking the right database type for you, and build boldly.

    Author: Jack Cieslak

    Source: Sisense

  • IBM: Hadoop solutions and the data lakes of the future

    IBM: Hadoop solutions and the data lakes of the future

    The foundation of the AI Ladder is Information Architecture. The modern data-driven enterprise needs to leverage the right tools to collect, organize, and analyze their data before they can infuse their business with the results.

    Businesses have many types of data and many ways to apply it. We must look for approaches to manage all forms of data, regardless of techniques (e.g., relational, map reduce) or use case (e.g., analytics, business intelligence, business process automation). Data must be stored securely and reliably, while minimizing costs and maximizing utility.

    Object storage is the ideal place to collect, store, and manage data assets which will be used to generate business value.

    Object storage started as an archive

    Object storage was first conceived as a simplification: How can we remove the extraneous functions seen in file systems to make storage more scalable, reliable, and low-cost. Technologies like erasure encoding massively reduced costs by allowing reliable storage to be built on cheap commodity hardware. The interface was simple: a uniform limitless namespace, atomic data writes, all available over HTTP.

    And, object storage excels at data storage. For example, IBM Cloud Object Storage is designed for over 10 9’s of durability, has robust data protection and security features, flexible pricing, and is natively integrated with IBM Aspera on Cloud high-speed data transfer.

    The first use cases were obvious: Object storage was a more scalable file system that could be used to store unstructured data like music, images, and video. Or to store backup files, database dumps, and log files. Its single-namespace, data tiering options allow it to be used for data archiving, and its HTTP interface makes it convenient to serve static website content as part of cloud native applications.

    But, beyond that, it was just a data dump.

    Map reduce and the rise of Hadoop solutions

    At the same time object storage was replacing file system use cases, the map reduce programming model was emerging in data analytics. Apache Hadoop provided a software framework for processing big data workloads that traditional relational database management system (RDBMS) solutions could not effectively manage. Data scientists and analysts had to give up declarative data management and SQL queries, but gained the ability to work with exponentially larger data sets and the freedom to explore unstructured and semi-structured data.

    In the beginning, Hadoop was seen by some as a backwards step. It achieved a measure of scale and cost savings but gave up much of what made RDBMS systems so powerful and easy to work with. While not requiring schemas added flexibility, query latencies and overall performance decreased. But the Hadoop ecosystem has continued to expand and meet user needs. Spark massively increased performance, and Hive provided SQL query.

    Hadoop is not suitable for everything. Transactional processing is still a better fit for RDBMS. Businesses must use the appropriate technology for their various OLAP and OTLP needs.

    HDFS became the de facto data lake for many enterprises

    Like object storage, Hadoop was also designed as a scale-out architecture on cheap commodity hardware. The Hadoop File System (HDFS) began with the premise that compute should be moved to the data. It was designed to place data on locally attached storage on the compute nodes themselves. Data was stored in a form that could be directly read by locally running tasks, without a network hop.

    While this is beneficial for many types of workloads, it wasn’t the end of ETL envisioned by some. By placing readable copies of data directly on compute nodes, HDFS can’t take advantage of erasure coding to save costs. When data reliability is required, replication is just not as cost-effective. Furthermore, it can’t be independently scaled from compute. As workloads diversified, this inflexibility caused management and cost issues. For many jobs, network wasn’t actually the bottleneck. Compute bottlenecks are typically CPU or memory issues, and storage bottlenecks are typically hard drive spindle related, either disk throughput or seek constrained.

    When you separate compute from storage, both sides benefit. Compute nodes are cheaper because they don’t need large amounts of storage, can be scaled up or down quickly without massive data migration costs, and jobs can even be run in isolation when necessary.

    For storage, you want to spread out your load onto as many spindles as possible, so using a smaller active data set on a large data pool is beneficial, and your dedicated storage budget can be reserved for smaller flash clusters when latency is really an issue.

    The scale and cost savings of Hadoop attracted many businesses to use it wherever possible, and many businesses ended up using HDFS as their primary data store. This has led to cost, manageability, and flexibility issues.

    The data lake of the future supports all compute workloads

    Object storage was always useful as a way of backing up your databases, and it could be used to offload data from HDFS to a lower-cost tier as well. In fact, this is the first thing enterprises typically do.

    But, a data lake is more than just a data dump.

    A data lake is a place to collect an organization’s data for future use. Yes, it needs to store and protect data in a highly scalable, secure, and cost-effective manner, and object storage has always provided this. But when data is stored in the data lake, it is often not known how it will be used and how it will be turned into value. Thus, it is essential that the data lake include good integration with a range of data processing, analytics, and AI tools.

    Typical tools will not only include big data tools such as Hadoop, Spark, and Hive, but also deep learning frameworks (such as TensorFlow) and analytics tools (such as Pandas). In addition, it is essential for a data lake to support tools for cataloging, messaging, and transforming the data to support exploration and repurposing of data assets.

    Object storage can store data in formats native to big data and analytics tools. Your Hadoop and Spark jobs can directly access object storage through the S3a or Stocator connectors using the IBM Analytics Engine. IBM uses these techniques against IBM Cloud Object Storage for operational and analytic needs.

    Just like with Hadoop, you can also leverage object storage to directly perform SQL queries. The IBM Cloud SQL Query service uses Spark internally to perform ad-hoc and OLAP queries against data stored directly in IBM Cloud Object Storage buckets.

    TensorFlow can also be used to train and deploy ML models directly using data in object storage.

    This is the true future of object storage

    As organizations look to modernize their information architecture, they save money when they build their data lake with object storage.

    For existing HDFS shops, this can be done incrementally, but you need to make sure to keep your data carefully organized as you do so to take full advantage of the rich capabilities available now and in the future.

    Author: Wesly Leggette & Michael Factor

    Source: IBM

  • Keeping the data of your organization safe by storing it in the cloud

    Keeping the data of your organization safe by storing it in the cloud

    We now live within the digital domain, and accessing vital information is more important than ever. Up until rather recently, most businesses tended to employ on-site data storage methods such as network servers, SSD hard drives, and direct-attached storage (DAS). However, cloud storage systems have now become commonplace.

    Perhaps the most well-known benefit of cloud storage solutions is that their virtual architecture ensures that all information will remain accessible in the event of an on-site system failure. However, we tend to overlook the security advantages of cloud storage with traditional strategies. Let us examine some key takeaway points.

    Technical Experts at Your Disposal

    A recent survey found that 73% of all organizations felt that they were unprepared in the event of a cyberattack. As this article points out, a staggering 40% suspected that their systems had been breached. It is therefore clear that legacy in-house approaches are failing to provide adequate security solutions.

    One of the main advantages of cloud-based data storage is that these services can provide targeted and customized data security solutions. Furthermore, a team of professionals is always standing by if a fault is suspected. This enables the storage platform to quickly diagnose and rectify the problem before massive amounts of data are lost or otherwise compromised. 

    Restricted Digital Access

    We also need to remember that one of the most profound threats to in-house data storage involves its physical nature. In other words, it is sometimes possible for unauthorized users (employees or even third parties) to gain access to sensitive information. Not only may this result in data theft, but the devices themselves could be purposely sabotaged, resulting in a massive data loss.

    The same cannot be said of cloud storage solutions. The information itself could very well be stored on a server located thousands of miles away from the business in question. This makes an intentional breach much less likely. Other security measures such as biometric access devices, gated entry systems, and CCTV cameras will also help deter any would-be thieves. 

    Fewer (if Any) Vulnerabilities

    The number of cloud-managed services is on the rise, and for good reason. These platforms allow businesses to optimize many factors such as CRM, sales, marketing campaigns, and e-commerce concerns. In the same respect, these bundles offer a much more refined approach to security. 

    This often comes with the ability to thwart what would otherwise remain in-house vulnerabilities. Some ways in which cloud servers can offer more robust storage solutions include:

    • 256-bit AES encryption
    • Highly advanced firewalls
    • Automatic threat detection systems
    • Multi-factor authentication

    In-house services may not be equipped with such protocols. As a result, they can be more vulnerable to threats such as phishing, compromised passwords, and distributed denial-of-service (DdoS) attacks. 

    The Notion of Data Redundancy

    The “Achilles’ heel” of on-site data storage has always stemmed from its physical nature. This is even more relevant when referring to unexpected natural disasters. Should a business endure a catastrophic situation, sensitive data could very well be lost permanently. This is once again when cloud storage solutions come into play.

    The virtual nature of these systems ensures that businesses can enjoy a much greater degree of redundancy. As opposed to having an IT team struggle for days or even weeks at a time to recover lost information, cloud servers provide instantaneous access to avoid potentially crippling periods of downtime. 

    Doing Away with Legacy Technology

    Another flaw that is often associated with in-house data storage solutions involves the use of legacy technology. Because the digital landscape is evolving at a frenetic pace, the chances are high that many of these systems are no longer relevant. What could have worked well yesterday may very well be obsolete tomorrow. Cloud solutions do not suffer from this drawback. Their architecture is updated regularly to guarantee that customers are always provided with the latest security protocols. Thus, their vital information will always remain behind closed (digital) doors.

    Brand Reputation

    A final and lesser-known benefit of cloud-based security is that clients are becoming more technically adept than in the past. They are aware of issues such as the growth of big data and GDPR compliance concerns. The reputation of businesses that continue to use outdated storage methods could therefore suffer as a result. Customers who are confident that their data is safe are much more likely to remain loyal over time. 

    Cloud Storage: Smart Solutions for Modern Times

    We can now see that there are several security advantages that cloud storage solutions have to offer. Although on-site methods may have been sufficient in the past, this is certainly no longer the case. Thankfully, there are many cloud providers associated with astounding levels of security. Any business that hopes to remain safe should therefore make this transition sooner rather than later. 

    Author: George Tuohy

    Source: Dataversity

  • The (near) future of data storage

    The (near) future of data storage

    As data proliferates at an exponential rate, companies must not only store it. They must approach Data Management expertly and look to new approaches. Companies that take new and creative approaches to data storage will be able to transform their operations and thrive in the digital economy.

    How should companies approach data storage in the years to come? As we look into our crystal ball, here are important trends in 2020. Companies that want to make the most of data storage should be on top of these developments.

    A data-centric approach to data storage

    Companies today are generating oceans of data, and not all of that data is equally important to their function. Organizations that know this, and know which pieces of data are more critical to their success than others, will be in a position to better manage their storage and better leverage their data.

    Think about it. As organizations deal with a data deluge, they are trying hard to maximize their storage pools. As a result, they can inadvertently end up putting critical data on less critical servers. Doing so is a problem because it typically takes longer to access data on slower, secondary machines. It’s this lack of speed and agility that can have a detrimental impact on businesses’ ability to leverage their data.

    Traditionally organizations have taken a server-based approach to their data backup and recovery deployments. Their priority is to back up their most critical machines rather than focusing on their most business-critical data.

    So, rather than having backup and recovery policies based on the criticality of each server, we will start to see organizations match their most critical servers with their most important data. In essence, the actual content of the data will become more of a decision-driver from a backup point of view.

    The most successful companies in the digital economy will be those that implement storage policies based not on their server hierarchy but on the value of their data.

    The democratization of flash storage

    With the continuing rise of technologies like IoT, artificial intelligence, and 5G, there will be an ever-greater need for high-performance storage. This will lead to the broader acceptance of all-flash storage. The problem, of course, is that flash storage is like a high-performance car: cool and sexy, but the price is out of reach for most.

    And yet traditional disk storage simply isn’t up to the task. Disk drives are like your family’s old minivan: reliable but boring and slow, unable to turn on a dime. But we’re increasingly operating in a highly digital world where data has to be available the instant it’s needed, not the day after. In this world, every company (not just the biggest and wealthiest ones) needs high-performance storage to run their business effectively.

    As the cost of flash storage drops, more storage vendors, are bringing all-flash arrays to the mid-market and more organizations will be able to afford this high-performance solution. This price democratization will ultimately enable every business to benefit from technology.

    The repatriation of cloud data

    Many companies realize that moving to the cloud is not as cost-effective, secure, or scalable as they initially thought. They’re now looking to return at least some of their core data and applications to their on-premises data centers.

    The truth is that data volumes in the cloud have become unwieldy. And organizations are discovering that storing data in the cloud is not only more expensive than they thought but It’s also hard to access that data expeditiously due to the cloud’s inherent latency.

    As a result, it can be more beneficial in terms of cost, security, and performance to move at least some company data back on-premises.

    Now that they realize the cloud is not a panacea, organizations are embracing the notion of cloud data repatriation. They’re increasingly deploying a hybrid infrastructure in which some data and applications remain in the cloud, while more critical data and applications come back home to an on-premises storage infrastructure.

    Immutable storage for businesses of all sizes

    Ransomware will continue to be a scourge to all companies. Because hackers have realized that data stored on network-attached storage devices is extremely valuable, their attacks will become more sophisticated and targeted. This is a serious problem because backup data is typically the last line of defense. Hackers are also attacking unstructured data. The reason is that if the primary and secondary (backup) data is encrypted, businesses will have to pay the ransom if they want their data back. This increases the likelihood that an organization, without a specific and immutable recovery plan in place, will pay a ransom to regain control over its data.

    It is not a question of if, but when, an organization will need to recover from a ‘successful’ ransomware attack. Therefore, it’s more important than ever to protect this data with immutable object storage and continuous data protection. Organizations should look for a storage solution that protects information continuously by taking snapshots as frequently as possible (e.g., every 90 seconds). That way, even when data is overwritten, older objects remain as part of the snapshot: the original data. That way, even when data is overwritten,there always will be another, immutable copy of the original objects that constitute the company’s data that can be instantly recovered… even if it’s hundreds of terabytes.

    Green storage

    Global data centers consume massive amounts of energy, which contributes to global warming. Data centers now eat up around 3% of the world’s electricity supply. They are responsible for approximately two percent of global greenhouse gas emissions. These numbers put the carbon footprint of data centers on par with the entire airline industry.

    Many companies are seeking to reduce their carbon footprint and be good corporate citizens. As part of this effort, they are increasingly looking for more environmentally-friendly storage solutions, those that can deliver the highest levels of performance and capacity at the lowest possible power consumption.

    In 2020, organizations of all sizes will work hard to get the most from the data they create and store. By leveraging these five trends and adopting a modern approach to data storage, organizations can more effectively transform their business and thrive in the digital economy.

    The ‘Prevention Era’ will be overtaken by the ‘Recovery Era’

    Organizations will have to look to more efficient and different ways to protect unstructured and structured data. An essential element to being prepared in the ‘recovery era’ will involve moving unstructured data to immutable object storage with remote replication, which will eliminate the need for traditional backup. The nightly backup will become a thing of the past, replaced by snapshots every 90 seconds. This approach will free up crucial primary storage budget, VMware/Hyper-V storage, and CPU/memory for critical servers.

    While data protection remains crucial, in the data recovery era, the sooner organizations adopt a restore and recover mentality, the better they will be able to benefit from successful business continuity strategies in 2020 and beyond.

    Author: Sean Derrington

    Source: Dataversity

  • The 4 steps of the big data life cycle

    The 4 steps of the big data life cycle

    Simply put, from the perspective of the life cycle of big data, there are nothing more than four aspects:

    1. Big data collection
    2. Big data preprocessing
    3. Big data storage
    4. Big data analysis

    All above four together constitute the core technology in the big data life cycle.

    Big data collection

    Big data collection is the collection of structured and unstructured massive data from various sources.

    Database collection: Sqoop and ETL are popular, and traditional relational databases MySQL and Oracle still serve as data storage methods for many enterprises. Of course, for the open source Kettle and Talend itself, big data integration content is also integrated, which can realize data synchronization and integration between hdfs, hbase and mainstream Nosq databases.

    Network data collection: A data collection method that uses web crawlers or website public APIs to obtain unstructured or semi-structured data from web pages and unify them into local data.

    File collection: Including real-time file collection and processing technology flume, ELK-based log collection and incremental collection, etc.

    Big data preprocessing

    Big data preprocessing refers to a series of operations such as “cleaning, filling, smoothing, merging, normalization, consistency check” and other operations on the collected raw data before data analysis, in order to improve the data Quality lays the foundation for later analysis work. Data preprocessing mainly includes four parts

    1. Data cleaning
    2. Data integration
    3. Data conversion
    4. Data specification

    Data cleaning refers to the use of cleaning tools such as ETL to deal with missing data (missing attributes of interest), noisy data (errors in the data, or data that deviates from expected values), and inconsistent data.

    Data integration refers to the consolidation and storage of data from different data sources in a unified database. The storage method focuses on solving three problems: pattern matching, data redundancy, and data value conflict detection and processing.

    Data conversion refers to the process of processing the inconsistencies in the extracted data. It also includes data cleaning, that is, cleaning abnormal data according to business rules to ensure the accuracy of subsequent analysis results.

    Data specification refers to the operation of minimizing the amount of data to obtain a smaller data set on the basis of keeping the original appearance of the data to the maximum extent, including: data party aggregation, dimension specification, data compression, numerical specification, concept layering, etc.

    Big data storage

    Big data storage refers to the process of using memory to store the collected data in the form of a database in three typical routes:

    New database cluster based on MPP architecture: Using Shared Nothing architecture, combined with the efficient distributed computing model of MPP architecture, through column storage, coarse-grained indexing and other big data processing technologies, the focus is on data storage methods developed for industry big data. With the characteristics of low cost, high performance, high scalability, etc., it has a wide range of applications in the field of enterprise analysis applications.

    Compared with traditional databases, its PB-level data analysis capabilities based on MPP products have significant advantages. Naturally, MPP database has also become the best choice for a new generation of enterprise data warehouse.

    Technology expansion and packaging based on Hadoop: Hadoop-based technology expansion and encapsulation is aimed at data and scenarios that are difficult to process with traditional relational databases (for storage and calculation of unstructured data, etc.), using Hadoop open source advantages and related features (good at handling unstructured and semi-structured data), Complex ETL processes, complex data mining and calculation models the process of deriving relevant big data technology.

    With the advancement of technology, its application scenarios will gradually expand. The most typical application scenario at present is to support the Internet big data storage and analysis by expanding and encapsulating Hadoop, involving dozens of NoSQL technologies.

    Big data all-in-one: This is a combination of software and hardware designed for the analysis and processing of big data. It consists of a set of integrated servers, storage devices, operating systems, database management systems, and pre-installed and optimized software for data query, processing, and analysis. It has good stability and vertical scalability.

    Big data analysis and mining

    From visual analysis, data mining algorithms, predictive analysis, semantic engine, data quality management, etc., the process of extracting, refining and analyzing the chaotic data.

    Visual analysis: Visual analysis refers to an analysis method that clearly and effectively conveys and communicates information with the aid of graphical means. Mainly used in massive data association analysis, that is, with the help of a visual data analysis platform, the process of performing association analysis on dispersed heterogeneous data and making a complete analysis chart. It is simple, clear, intuitive and easy to accept.

    Data mining algorithm: Data mining algorithms are data analysis methods that test and calculate data by creating data mining models. It is the theoretical core of big data analysis.

    There are various data mining algorithms, and different algorithms show different data characteristics due to different data types and formats. But generally speaking, the process of creating a model is similar, that is, first analyze the data provided by the user, then search for specific types of patterns and trends, and use the analysis results to define the best parameters for creating a mining model, and apply these parameters In the entire data set to extract feasible patterns and detailed statistics.

    Data quality management refers to the identification, measurement, monitoring, and early warning of various data quality problems that may be caused in each stage of the data life cycle (planning, acquisition, storage, sharing, maintenance, application, extinction, etc.) to improve data A series of quality management activities.

    Predictive analysis: Predictive analysis is one of the most important application areas of big data analysis. It combines a variety of advanced analysis functions (special statistical analysis, predictive modeling, data mining, text analysis, entity analysis, optimization, real-time scoring, machine learning, etc.), to achieve the purpose of predicting uncertain events.

    Help users analyze trends, patterns, and relationships in structured and unstructured data, and use these indicators to predict future events and provide a basis for taking measures.

    Semantic Engine: Semantic engine refers to the operation of adding semantics to existing data to improve users’ Internet search experience.

    Author: Sajjad Hussain

    Source: Medium


  • The differences between data lakes and data warehouses: a brief explanation

    The differences between data lakes and data warehouses: a brief explanation

    When comparing data lake vs. data warehouse, it's important to know that these two things actually serve quite different roles. They manage data differently and serve their own types of functions.

    The market for data warehouses is booming. One study forecasts that the market will be worth $23.8 billion by 2030. Demand is growing at an annual pace of 29%.

    While there is a lot of discussion about the merits of data warehouses, not enough discussion centers around data lakes. 

    Both data warehouses and data lakes are used when storing big data. On the other hand, they are not the same. A data warehouse is a storage area for filtered, structured data that has been processed already for a particular use, while Data Lake is a massive pool of raw data and the aim is still unknown.

    Many people are confused about these two, but the only similarity between them is the high-level principle of data storing.  It is vital to know the difference between the two as they serve different principles and need diverse sets of eyes to be adequately optimized. However, a data lake functions for one specific company, the data warehouse, on the other hand, is fitted for another.

    This blog will reveal or show the difference between the data warehouse and the data lake. Below are their notable differences.

    Data Lake

    • Type of Data: structured and unstructured from different sources of data
    • Purpose: Cost-efficient big data storage
    • Users: Engineers and scientists
    • Tasks: storing data as well as big data analytics, such as real-time analytics and deep learning
    • Sizes: Store data which might be utilized

    Data Warehouse

    • Data Type: Historical which has been structured in order to suit the relational database diagram
    • Purpose: Business decision analytics
    • Users: Business analysts and data analysts
    • Tasks: Read-only queries for summarizing and aggregating data
    • Size: Just stores data pertinent to the analysis

    Data Type

    Data cleaning is a vital data skill as data comes in imperfect and messy types. Raw data that has not been cleared is known as unstructured data; this includes chat logs, pictures, and PDF files. Unstructured data that has been cleared to suit a plan, sort out into tables, and defined by relationships and types, is known as structured data. This is a vital disparity between data warehouses and data lakes.

    Data warehouses contain historical information that has been cleared to suit a relational plan. On the other hand, data lakes store from an extensive array of sources like real-time social media streams, Internet of Things devices, web app transactions, and user data. This data is often structured, but most of the time, it is messy as it is being ingested from the data source.


    When it comes to principles and functions, Data Lake is utilized for cost-efficient storage of significant amounts of data from various sources. Letting data of whichever structure decreases cost as it is flexible as well as scalable and does not have to suit a particular plan or program. On the other hand, it is easy to analyze structured data as it is cleaner. It also has the same plan to query from. A data warehouse is very useful for historical data examination for particular data decisions by limiting data to a plan or program.

    You might see that both set off each other when it comes to the workflow of the data. The ingested organization will be stored right away into Data Lake. Once a particular organization concern arises, a part of the data considered relevant is taken out from the lake, cleared as well as exported.


    Each one has different applications, but both are very valuable for diverse users. Business analysts and data analysts out there often work in a data warehouse that has openly and plainly relevant data which has been processed for the job. Data warehouse needs a lower level of knowledge or skill in data science and programming to use.

    Engineers set up and maintained data lakes, and they include them into the data pipeline. Data scientists also work closely with data lakes because they have information on a broader as well as current scope.


    Engineers make use of data lakes in storing incoming data. On the other hand, data lakes are not just restricted to storage. Keep in mind that unstructured data is scalable and flexible, which is better and ideal for data analytics. A big data analytic can work on data lakes with the use of Apache Spark as well as Hadoop. This is true when it comes to deep learning that needs scalability in the growing number of training information.

    Usually, data warehouses are set to read-only for users, most especially those who are first and foremost reading as well as collective data for insights. The fact that information or data is already clean as well as archival, usually there is no need to update or even insert data.


    When it comes to size, Data Lake is much bigger than a data warehouse. This is because of the fact that Data Lake keeps hold of all information that may be pertinent to a business or organization. Frequently, data lakes are petabytes, which is 1,000 terabytes. On the other hand, the data warehouse is more selective or choosy on what information is stored.

    Understand the Significance of Data Warehouses and Data Lakes

    If you are settling between data warehouse or data lake, you need to review the categories mentioned above to determine one that will meet your needs and fit your case. In case you are interested in a thorough dive into the disparities or knowing how to make data warehouses, you can partake in some lessons offered online.

    Always keep in mind that sometimes you want a combination of these two storage solutions, most especially if developing data pipelines.

    Author: Liraz Postan

    Source: Smart Data Collective

  • What is dark data? And how to deal with it

    What is dark data? And how to deal with it

    It’s easier than ever to collect data without a specific purpose, under the assumption that it may be useful later. Often, though, that data ends up unused and even forgotten because of several simple factors: The fact that the data is being collected isn’t effectively communicated to potential users within an organization. The repositories that hold the data aren’t widely known. Or perhaps there simply isn’t enough analysis capacity within the company to process it. This data that is collected but not used is often termed 'dark data'. 

    Dark data presents an organization with tremendous opportunities, as well as liabilities. If it is harnessed effectively, it can be used to produce insights that wouldn’t otherwise be available. With that in mind, it’s important to make this dark data accessible so it can power those innovative use cases.

    On the other hand, lack of visibility into all the data being collected within an organization can make it difficult to accurately manage costs, and easy to accidentally run afoul of retention policies. It can also hamper efforts to ensure compliance with regulations like the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

    So what can be done to maximize the benefits of dark data and avoid these problems?

    Some best practices

    When dealing with dark data, the foremost best practice is to shine a spotlight on it by communicating to potential users within the organization what data is being collected.

    Secondly, organizations need to evaluate whether and for how long it makes sense to retain the data. This is crucial to avoid incurring potentially substantial costs collecting and storing data that isn’t being used and won’t be used in the future, and even more importantly to ensure that the data is being handled and secured properly.

    Perhaps the biggest challenge when working with dark data is simply getting access to it, as it’s often stored in siloed repositories close to where the data is being collected. Additionally, it may be stored in systems and formats that are difficult to query or have limited analytics capabilities.

    So the next step is to ensure that the data that is collected can actually be used effectively. The two main approaches are: (1) investing in tooling that can query the data where it is currently stored, and (2) moving the data into centralized data storage platforms. 

    I recommend combining these two approaches. Firstly, adopt tools that provide the ability to discover, analyze, and visualize data from multiple platforms and locations via a single interface, which will increase data visibility and reduce the tendency to store the same data multiple times. Second, leverage storage platforms that can efficiently aggregate and store data that would otherwise be inaccessible, in order to reduce the number of data stores that must be tracked and managed.

    Considering the potential power and pitfalls that come with having dark data in your organization, it’s definitely worth the effort to bring it out of the dark.

    Author: Dan Cech

    Source: Insidebigdata

  • Why data lakes are the future of data storage

    Why data lakes are the future of data storage

    The term big data has been around since 2005, but what does it actually mean? Exactly how big is big? We are creating data every second. It’s generated across all industries and by a myriad of devices, from computers to industrial sensors to weather balloons and countless other sources. According to a recent study conducted by Data Never Sleeps, there are a quintillion bytes of data generated each minute, and the forecast is that our data will only keep growing at an unprecedented rate.

    We have also come to realize just how important data really is. Some liken its value to something as precious to our existence as water or oil, although those aren’t really valid comparisons. Water supplies can fall and petroleum stores can be depleted, but data isn’t going anywhere. It only continues to grow. Not just in volume, but also in variety and velocity. Thankfully, over the past decade, data storage has become cheaper, faster and more easily available, and as a result, where to store all this information isn’t the biggest concern anymore. Industries that work in the IoT and faster payments space are now starting to push data through at a very high speed and that data is constantly changing shape.

    In essence, all this gives rise to a 'data demon'. Our data has become so complex that normal techniques for harnessing it often fail, keeping us from realizing data’s full potential.

    Most organizations currently treat data as a cost center. Each time a data project is spun off, there is an 'expense' attached to it. It’s contradictive. On the one side, we’re proclaiming that data is our most valuable asset, but on the other side we perceive it as a liability. It’s time to change that perception, especially when it comes to banks. The volumes of data financial institutions have can be used to create tremendous value. Note that I’m not talking about 'selling the data', but leveraging it more effectively to provide crisp analytics that deliver knowledge and drive better business decisions.

    What’s stopping people from converting data from an expense to an asset, then? The technology and talent exist, but the thought process is lacking.

    Data warehouses have been around for a long time and traditionally were the only way to store large amounts of data that’s used for analytical and reporting purposes. However, a warehouse, as the name suggests, immediately makes one think of a rigid structure that’s limited. In a physical warehouse, you can store products in three dimensions: length, breadth and height. These dimensions, though, are limited by your warehouse’s architecture. If you want to add more products, you must go through a massive upgrade process. Technically, it’s doable, but not ideal. Similarly, data warehouses present a bit of rigidity when handling constantly changing data elements.

    Data lakes are a modern take on big data. When you think of a lake, you cannot define its shape and size, nor can you define what lives in it and how. Lakes just form, even if they are man-made. There is still an element of randomness to them and it’s this randomness that helps us in situations where the future is, well, sort of unpredictable. Lakes expand and contract, they change over periods of time, and they have an ecosystem that’s home to various types of animals and organisms. This lake can be a source of food (such as fish) or fresh water and can even be the locale for water-based adventures. Similarly, a data lake contains a vast body of data and is able to handle that data’s volume, velocity and variety.

    When the mammoth data organizations like Yahoo, Google, Facebook and LinkedIn started to realize that their data and data usage were drastically different and that it was almost impossible to use traditional methods to analyze it, they had to innovate. This in turn gave rise to technologies like document-based databases and big data engines like Hadoop, Spark, HPCC Systems and others. These technologies were designed to allow the flexibility one needs when handling unpredictable data inputs.

    Jeff Lewis is SVP of Payments at Sutton Bank, a small community bank that’s challenging the status quo for other banks in the payments space. 'Banks have to learn to move on from data warehouses to data lakes. The speed, accuracy and flexibility of information coming out of a data lake is crucial to the increased operational efficiency of employees and to provide a better regulatory oversight', said Lewis. 'Bankers are no longer old school and are ready to innovate with the FinTechs of the world. A data centric thought process and approach is crucial for success'.

    Data lakes are a natural choice to handle the complexity of such data, and the application of machine learning and AI are also becoming more common, as well. From using AI to clean and augment incoming data, to running complex algorithms to correlate different sources of information to detect complex fraud, there is an algorithm for just about everything. And now, with the help of distributed processing, these algorithms can be run on multiple clusters and the workload can be spread across nodes.

    One thing to remember is that you should be building a data lake and not a data swamp. It’s hard to control a swamp. You cannot drink from it, nor can you navigate it easily. So, when you look at creating a data lake, think about what the ecosystem looks like and who your consumers are. Then, embark on a journey to build a lake on your own.

    Source: Insidebigdata

EasyTagCloud v2.8