5 items tagged "data storage"

  • Database possibilities in an era of big data

    Database possibilities in an era of big data

    We live in an era of big data. The sheer volume of data currently existing is huge enough without also grappling with the amount of new information that’s generated every day. Think about it: financial transactions, social media posts, web traffic, IoTsensor data, and much more, being ceaselessly pulled into databases the world over. Outdated technology simply can’t keep up.

    The modern types of databases that have arisen to tackle the challenges of big data take a variety of forms, each suited for different kinds of data and tasks. Whatever your company does, choosing the right database to build your product or service on top of is a vital decision. In this article, we’ll dig into the different types of database options you could be considering for your unique challenges, as well as the underlying database technologies you should be familiar with. We’ll be focusing on relational database management systems (RDBMS), NoSQL DBMS, columnar stores, and cloud solutions.


    First up, the reliable relational database management system. This widespread variety is renowned for its focus on the core database attributes of atomicity (keeping tasks indivisible and irreducible), consistency (actions taken by the database obey certain constraints), isolation (a transaction’s immediate state is invisible to other transactions), and durability (data changes reliably persist). Data in an RDBMS is stored in tables and an RDBMS is able to tackle tons of data and complex queries as opposed to flat files, which tend to take up more memory and are less efficient. An RDBMS is usually made up of a collection of tables, each with columns (fields) and records (rows). Popular examples of RDBM systems include Microsoft SQL, Oracle, MySQL, and Postgres.

    Some of the strengths of an RDBMS include flexibility and scalability. Given the huge amounts of information that modern businesses need to handle, these are important factors to consider when surveying different types of databases. Ease of management is another strength since each of the constituent tables can be changed without impacting the others. Additionally, administrators can choose to share different tables with certain users and not others (ideal if working with confidential information you might not want shared with all users). It’s easy to update data and expand your database, and since each piece of data is stored at a single point, it’s easy to keep your system free from errors as well.

    No system is perfect, however. Each RDBMS is built on a single server, so once you hit the limits of the machine you’ve got, you need to buy a new one. Rapidly changing data can also challenge these systems, as increased volume, variety, velocity, and complexity create complicated relationships that the RDBMS can have trouble keeping up with. Lastly, despite having 'relation' in the name, relational database management systems don’t store the relationships between elements, meaning that the system doesn’t actually understand the connections between data as pertains to various joins you may be using. 


    NoSQL (originally, 'non relational' or 'not SQL') DBMS emerged as web applications were becoming more complex. These types of databases are designed to handle heterogeneous data that’s difficult to stick in a normalization schema. While they can take a wide array of forms, the most important difference between NoSQL and RDBMS is that while relational databases rigidly define how all the data contained within must be arranged, NoSQL databases can be schema agnostic. This means that if you’ve got unstructured and semi-structured data, you can store and manipulate it easily, whereas an RDBMS might not be able to handle it at all. 

    Considering this, it’s no wonder that NoSQL databases are seeing a lot of use in big data and real-time web apps. Examples of these database technologies include MongoDB, Riak, Amazon S3, Cassandra, and Hbase. However, one drawback of NoSQL databases is that they have 'eventual consistency', meaning that all nodes will eventually have the same data. However, since there’s a lag while all the nodes update, it’s possible to get out-of-sync data depending on which node you end up querying during the update window. Data consistency is a challenge with NoSQL since they do not perform ACID transactions.

    Columnar storage database

    A columnar storage database’s defining characteristic is that it stores data tables by column rather than by row. The main benefit of this configuration is that it accelerates analyses because the system only has to read the locations your query is interested in, all within a single column. Also, these systems compress repeating volumes in storage, allowing better compression, since the data in one specific column is homogeneous across all the columns (or, columns are all the same type: integers, strings, etc. so that they can be better compressed). 

    However, due to this feature, Columnar storage databases are not typically used to build transactional databases. One of the drawbacks of these types of database is that inserts and updates on an entire row (necessary for apps like ERPs and CRMs, for example) can be expensive. It’s also slower for these types of applications. For example, when opening an account’s page in a CRM, the app needs to read the entire row (name, address, email, account id, etc) to populate the page and write back all that as well. In this example, a relational database would be more efficient. 

    Cloud solutions

    While not technically a type of database themselves, no discussion of modern types of database solutions would be complete without discussing the cloud. In this age of big data and fast-moving data sources, data engineers are increasingly turning to cloud solutions (AWS, Snowflake, etc.) to store, access, and analyze their data. One of the biggest advantages of cloud options is that you don’t have to pay for the physical space or the physical machine associated with your database (or its upkeep, emergency backups, etc.). Additionally, you only pay for what you use: as your memory and processing power needs scale up, you pay for the level of service you need, but you don’t have to pre-purchase these capabilities.

    There are some drawbacks to using a cloud solution, however. First off, since you’re connecting to a remote resource, bandwidth limitations can be a factor. Additionally, even though the cloud does offer cost savings, especially when starting a company from scratch, the lifetime costs of paying your server fees could exceed what you would have paid buying your own equipment. Lastly, depending on the type of data you’re dealing with, compliance and security can be issues because the responsibility of managing the data and its security is no longer handled by you, the data owner, and instead by the third party provider. For example, unsecured APIs and interfaces that can be more readily exploited, data breaches, data loss or leakage risks can be elevated, and unauthorized access through improperly configured firewalls are some ways in which cloud databases can be compromised.

    Decision time

    The era of Big Data is changing the way companies deal with their data. This means choosing new database models and finding the right analytics and BI tools to help your team get the most out of your data and build the apps, products, and services that will shape the world. Whatever you’re creating, picking the right database type for you, and build boldly.

    Author: Jack Cieslak

    Source: Sisense

  • IBM: Hadoop solutions and the data lakes of the future

    IBM: Hadoop solutions and the data lakes of the future

    The foundation of the AI Ladder is Information Architecture. The modern data-driven enterprise needs to leverage the right tools to collect, organize, and analyze their data before they can infuse their business with the results.

    Businesses have many types of data and many ways to apply it. We must look for approaches to manage all forms of data, regardless of techniques (e.g., relational, map reduce) or use case (e.g., analytics, business intelligence, business process automation). Data must be stored securely and reliably, while minimizing costs and maximizing utility.

    Object storage is the ideal place to collect, store, and manage data assets which will be used to generate business value.

    Object storage started as an archive

    Object storage was first conceived as a simplification: How can we remove the extraneous functions seen in file systems to make storage more scalable, reliable, and low-cost. Technologies like erasure encoding massively reduced costs by allowing reliable storage to be built on cheap commodity hardware. The interface was simple: a uniform limitless namespace, atomic data writes, all available over HTTP.

    And, object storage excels at data storage. For example, IBM Cloud Object Storage is designed for over 10 9’s of durability, has robust data protection and security features, flexible pricing, and is natively integrated with IBM Aspera on Cloud high-speed data transfer.

    The first use cases were obvious: Object storage was a more scalable file system that could be used to store unstructured data like music, images, and video. Or to store backup files, database dumps, and log files. Its single-namespace, data tiering options allow it to be used for data archiving, and its HTTP interface makes it convenient to serve static website content as part of cloud native applications.

    But, beyond that, it was just a data dump.

    Map reduce and the rise of Hadoop solutions

    At the same time object storage was replacing file system use cases, the map reduce programming model was emerging in data analytics. Apache Hadoop provided a software framework for processing big data workloads that traditional relational database management system (RDBMS) solutions could not effectively manage. Data scientists and analysts had to give up declarative data management and SQL queries, but gained the ability to work with exponentially larger data sets and the freedom to explore unstructured and semi-structured data.

    In the beginning, Hadoop was seen by some as a backwards step. It achieved a measure of scale and cost savings but gave up much of what made RDBMS systems so powerful and easy to work with. While not requiring schemas added flexibility, query latencies and overall performance decreased. But the Hadoop ecosystem has continued to expand and meet user needs. Spark massively increased performance, and Hive provided SQL query.

    Hadoop is not suitable for everything. Transactional processing is still a better fit for RDBMS. Businesses must use the appropriate technology for their various OLAP and OTLP needs.

    HDFS became the de facto data lake for many enterprises

    Like object storage, Hadoop was also designed as a scale-out architecture on cheap commodity hardware. The Hadoop File System (HDFS) began with the premise that compute should be moved to the data. It was designed to place data on locally attached storage on the compute nodes themselves. Data was stored in a form that could be directly read by locally running tasks, without a network hop.

    While this is beneficial for many types of workloads, it wasn’t the end of ETL envisioned by some. By placing readable copies of data directly on compute nodes, HDFS can’t take advantage of erasure coding to save costs. When data reliability is required, replication is just not as cost-effective. Furthermore, it can’t be independently scaled from compute. As workloads diversified, this inflexibility caused management and cost issues. For many jobs, network wasn’t actually the bottleneck. Compute bottlenecks are typically CPU or memory issues, and storage bottlenecks are typically hard drive spindle related, either disk throughput or seek constrained.

    When you separate compute from storage, both sides benefit. Compute nodes are cheaper because they don’t need large amounts of storage, can be scaled up or down quickly without massive data migration costs, and jobs can even be run in isolation when necessary.

    For storage, you want to spread out your load onto as many spindles as possible, so using a smaller active data set on a large data pool is beneficial, and your dedicated storage budget can be reserved for smaller flash clusters when latency is really an issue.

    The scale and cost savings of Hadoop attracted many businesses to use it wherever possible, and many businesses ended up using HDFS as their primary data store. This has led to cost, manageability, and flexibility issues.

    The data lake of the future supports all compute workloads

    Object storage was always useful as a way of backing up your databases, and it could be used to offload data from HDFS to a lower-cost tier as well. In fact, this is the first thing enterprises typically do.

    But, a data lake is more than just a data dump.

    A data lake is a place to collect an organization’s data for future use. Yes, it needs to store and protect data in a highly scalable, secure, and cost-effective manner, and object storage has always provided this. But when data is stored in the data lake, it is often not known how it will be used and how it will be turned into value. Thus, it is essential that the data lake include good integration with a range of data processing, analytics, and AI tools.

    Typical tools will not only include big data tools such as Hadoop, Spark, and Hive, but also deep learning frameworks (such as TensorFlow) and analytics tools (such as Pandas). In addition, it is essential for a data lake to support tools for cataloging, messaging, and transforming the data to support exploration and repurposing of data assets.

    Object storage can store data in formats native to big data and analytics tools. Your Hadoop and Spark jobs can directly access object storage through the S3a or Stocator connectors using the IBM Analytics Engine. IBM uses these techniques against IBM Cloud Object Storage for operational and analytic needs.

    Just like with Hadoop, you can also leverage object storage to directly perform SQL queries. The IBM Cloud SQL Query service uses Spark internally to perform ad-hoc and OLAP queries against data stored directly in IBM Cloud Object Storage buckets.

    TensorFlow can also be used to train and deploy ML models directly using data in object storage.

    This is the true future of object storage

    As organizations look to modernize their information architecture, they save money when they build their data lake with object storage.

    For existing HDFS shops, this can be done incrementally, but you need to make sure to keep your data carefully organized as you do so to take full advantage of the rich capabilities available now and in the future.

    Author: Wesly Leggette & Michael Factor

    Source: IBM

  • The (near) future of data storage

    The (near) future of data storage

    As data proliferates at an exponential rate, companies must not only store it. They must approach Data Management expertly and look to new approaches. Companies that take new and creative approaches to data storage will be able to transform their operations and thrive in the digital economy.

    How should companies approach data storage in the years to come? As we look into our crystal ball, here are important trends in 2020. Companies that want to make the most of data storage should be on top of these developments.

    A data-centric approach to data storage

    Companies today are generating oceans of data, and not all of that data is equally important to their function. Organizations that know this, and know which pieces of data are more critical to their success than others, will be in a position to better manage their storage and better leverage their data.

    Think about it. As organizations deal with a data deluge, they are trying hard to maximize their storage pools. As a result, they can inadvertently end up putting critical data on less critical servers. Doing so is a problem because it typically takes longer to access data on slower, secondary machines. It’s this lack of speed and agility that can have a detrimental impact on businesses’ ability to leverage their data.

    Traditionally organizations have taken a server-based approach to their data backup and recovery deployments. Their priority is to back up their most critical machines rather than focusing on their most business-critical data.

    So, rather than having backup and recovery policies based on the criticality of each server, we will start to see organizations match their most critical servers with their most important data. In essence, the actual content of the data will become more of a decision-driver from a backup point of view.

    The most successful companies in the digital economy will be those that implement storage policies based not on their server hierarchy but on the value of their data.

    The democratization of flash storage

    With the continuing rise of technologies like IoT, artificial intelligence, and 5G, there will be an ever-greater need for high-performance storage. This will lead to the broader acceptance of all-flash storage. The problem, of course, is that flash storage is like a high-performance car: cool and sexy, but the price is out of reach for most.

    And yet traditional disk storage simply isn’t up to the task. Disk drives are like your family’s old minivan: reliable but boring and slow, unable to turn on a dime. But we’re increasingly operating in a highly digital world where data has to be available the instant it’s needed, not the day after. In this world, every company (not just the biggest and wealthiest ones) needs high-performance storage to run their business effectively.

    As the cost of flash storage drops, more storage vendors, are bringing all-flash arrays to the mid-market and more organizations will be able to afford this high-performance solution. This price democratization will ultimately enable every business to benefit from technology.

    The repatriation of cloud data

    Many companies realize that moving to the cloud is not as cost-effective, secure, or scalable as they initially thought. They’re now looking to return at least some of their core data and applications to their on-premises data centers.

    The truth is that data volumes in the cloud have become unwieldy. And organizations are discovering that storing data in the cloud is not only more expensive than they thought but It’s also hard to access that data expeditiously due to the cloud’s inherent latency.

    As a result, it can be more beneficial in terms of cost, security, and performance to move at least some company data back on-premises.

    Now that they realize the cloud is not a panacea, organizations are embracing the notion of cloud data repatriation. They’re increasingly deploying a hybrid infrastructure in which some data and applications remain in the cloud, while more critical data and applications come back home to an on-premises storage infrastructure.

    Immutable storage for businesses of all sizes

    Ransomware will continue to be a scourge to all companies. Because hackers have realized that data stored on network-attached storage devices is extremely valuable, their attacks will become more sophisticated and targeted. This is a serious problem because backup data is typically the last line of defense. Hackers are also attacking unstructured data. The reason is that if the primary and secondary (backup) data is encrypted, businesses will have to pay the ransom if they want their data back. This increases the likelihood that an organization, without a specific and immutable recovery plan in place, will pay a ransom to regain control over its data.

    It is not a question of if, but when, an organization will need to recover from a ‘successful’ ransomware attack. Therefore, it’s more important than ever to protect this data with immutable object storage and continuous data protection. Organizations should look for a storage solution that protects information continuously by taking snapshots as frequently as possible (e.g., every 90 seconds). That way, even when data is overwritten, older objects remain as part of the snapshot: the original data. That way, even when data is overwritten,there always will be another, immutable copy of the original objects that constitute the company’s data that can be instantly recovered… even if it’s hundreds of terabytes.

    Green storage

    Global data centers consume massive amounts of energy, which contributes to global warming. Data centers now eat up around 3% of the world’s electricity supply. They are responsible for approximately two percent of global greenhouse gas emissions. These numbers put the carbon footprint of data centers on par with the entire airline industry.

    Many companies are seeking to reduce their carbon footprint and be good corporate citizens. As part of this effort, they are increasingly looking for more environmentally-friendly storage solutions, those that can deliver the highest levels of performance and capacity at the lowest possible power consumption.

    In 2020, organizations of all sizes will work hard to get the most from the data they create and store. By leveraging these five trends and adopting a modern approach to data storage, organizations can more effectively transform their business and thrive in the digital economy.

    The ‘Prevention Era’ will be overtaken by the ‘Recovery Era’

    Organizations will have to look to more efficient and different ways to protect unstructured and structured data. An essential element to being prepared in the ‘recovery era’ will involve moving unstructured data to immutable object storage with remote replication, which will eliminate the need for traditional backup. The nightly backup will become a thing of the past, replaced by snapshots every 90 seconds. This approach will free up crucial primary storage budget, VMware/Hyper-V storage, and CPU/memory for critical servers.

    While data protection remains crucial, in the data recovery era, the sooner organizations adopt a restore and recover mentality, the better they will be able to benefit from successful business continuity strategies in 2020 and beyond.

    Author: Sean Derrington

    Source: Dataversity

  • What is dark data? And how to deal with it

    What is dark data? And how to deal with it

    It’s easier than ever to collect data without a specific purpose, under the assumption that it may be useful later. Often, though, that data ends up unused and even forgotten because of several simple factors: The fact that the data is being collected isn’t effectively communicated to potential users within an organization. The repositories that hold the data aren’t widely known. Or perhaps there simply isn’t enough analysis capacity within the company to process it. This data that is collected but not used is often termed 'dark data'. 

    Dark data presents an organization with tremendous opportunities, as well as liabilities. If it is harnessed effectively, it can be used to produce insights that wouldn’t otherwise be available. With that in mind, it’s important to make this dark data accessible so it can power those innovative use cases.

    On the other hand, lack of visibility into all the data being collected within an organization can make it difficult to accurately manage costs, and easy to accidentally run afoul of retention policies. It can also hamper efforts to ensure compliance with regulations like the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

    So what can be done to maximize the benefits of dark data and avoid these problems?

    Some best practices

    When dealing with dark data, the foremost best practice is to shine a spotlight on it by communicating to potential users within the organization what data is being collected.

    Secondly, organizations need to evaluate whether and for how long it makes sense to retain the data. This is crucial to avoid incurring potentially substantial costs collecting and storing data that isn’t being used and won’t be used in the future, and even more importantly to ensure that the data is being handled and secured properly.

    Perhaps the biggest challenge when working with dark data is simply getting access to it, as it’s often stored in siloed repositories close to where the data is being collected. Additionally, it may be stored in systems and formats that are difficult to query or have limited analytics capabilities.

    So the next step is to ensure that the data that is collected can actually be used effectively. The two main approaches are: (1) investing in tooling that can query the data where it is currently stored, and (2) moving the data into centralized data storage platforms. 

    I recommend combining these two approaches. Firstly, adopt tools that provide the ability to discover, analyze, and visualize data from multiple platforms and locations via a single interface, which will increase data visibility and reduce the tendency to store the same data multiple times. Second, leverage storage platforms that can efficiently aggregate and store data that would otherwise be inaccessible, in order to reduce the number of data stores that must be tracked and managed.

    Considering the potential power and pitfalls that come with having dark data in your organization, it’s definitely worth the effort to bring it out of the dark.

    Author: Dan Cech

    Source: Insidebigdata

  • Why data lakes are the future of data storage

    Why data lakes are the future of data storage

    The term big data has been around since 2005, but what does it actually mean? Exactly how big is big? We are creating data every second. It’s generated across all industries and by a myriad of devices, from computers to industrial sensors to weather balloons and countless other sources. According to a recent study conducted by Data Never Sleeps, there are a quintillion bytes of data generated each minute, and the forecast is that our data will only keep growing at an unprecedented rate.

    We have also come to realize just how important data really is. Some liken its value to something as precious to our existence as water or oil, although those aren’t really valid comparisons. Water supplies can fall and petroleum stores can be depleted, but data isn’t going anywhere. It only continues to grow. Not just in volume, but also in variety and velocity. Thankfully, over the past decade, data storage has become cheaper, faster and more easily available, and as a result, where to store all this information isn’t the biggest concern anymore. Industries that work in the IoT and faster payments space are now starting to push data through at a very high speed and that data is constantly changing shape.

    In essence, all this gives rise to a 'data demon'. Our data has become so complex that normal techniques for harnessing it often fail, keeping us from realizing data’s full potential.

    Most organizations currently treat data as a cost center. Each time a data project is spun off, there is an 'expense' attached to it. It’s contradictive. On the one side, we’re proclaiming that data is our most valuable asset, but on the other side we perceive it as a liability. It’s time to change that perception, especially when it comes to banks. The volumes of data financial institutions have can be used to create tremendous value. Note that I’m not talking about 'selling the data', but leveraging it more effectively to provide crisp analytics that deliver knowledge and drive better business decisions.

    What’s stopping people from converting data from an expense to an asset, then? The technology and talent exist, but the thought process is lacking.

    Data warehouses have been around for a long time and traditionally were the only way to store large amounts of data that’s used for analytical and reporting purposes. However, a warehouse, as the name suggests, immediately makes one think of a rigid structure that’s limited. In a physical warehouse, you can store products in three dimensions: length, breadth and height. These dimensions, though, are limited by your warehouse’s architecture. If you want to add more products, you must go through a massive upgrade process. Technically, it’s doable, but not ideal. Similarly, data warehouses present a bit of rigidity when handling constantly changing data elements.

    Data lakes are a modern take on big data. When you think of a lake, you cannot define its shape and size, nor can you define what lives in it and how. Lakes just form, even if they are man-made. There is still an element of randomness to them and it’s this randomness that helps us in situations where the future is, well, sort of unpredictable. Lakes expand and contract, they change over periods of time, and they have an ecosystem that’s home to various types of animals and organisms. This lake can be a source of food (such as fish) or fresh water and can even be the locale for water-based adventures. Similarly, a data lake contains a vast body of data and is able to handle that data’s volume, velocity and variety.

    When the mammoth data organizations like Yahoo, Google, Facebook and LinkedIn started to realize that their data and data usage were drastically different and that it was almost impossible to use traditional methods to analyze it, they had to innovate. This in turn gave rise to technologies like document-based databases and big data engines like Hadoop, Spark, HPCC Systems and others. These technologies were designed to allow the flexibility one needs when handling unpredictable data inputs.

    Jeff Lewis is SVP of Payments at Sutton Bank, a small community bank that’s challenging the status quo for other banks in the payments space. 'Banks have to learn to move on from data warehouses to data lakes. The speed, accuracy and flexibility of information coming out of a data lake is crucial to the increased operational efficiency of employees and to provide a better regulatory oversight', said Lewis. 'Bankers are no longer old school and are ready to innovate with the FinTechs of the world. A data centric thought process and approach is crucial for success'.

    Data lakes are a natural choice to handle the complexity of such data, and the application of machine learning and AI are also becoming more common, as well. From using AI to clean and augment incoming data, to running complex algorithms to correlate different sources of information to detect complex fraud, there is an algorithm for just about everything. And now, with the help of distributed processing, these algorithms can be run on multiple clusters and the workload can be spread across nodes.

    One thing to remember is that you should be building a data lake and not a data swamp. It’s hard to control a swamp. You cannot drink from it, nor can you navigate it easily. So, when you look at creating a data lake, think about what the ecosystem looks like and who your consumers are. Then, embark on a journey to build a lake on your own.

    Source: Insidebigdata

EasyTagCloud v2.8