3 items tagged "NoSQL "

  • Hadoop engine benchmark: How Spark, Impala, Hive, and Presto compare

    forresters-hadoop-predictions-2015AtScale recently performed benchmark tests on the Hadoop engines Spark, Impala, Hive, and Presto. Find out the results, and discover which option might be best for your enterprise

    The global Hadoop market is expected to expand at an average compound annual growth rate (CAGR) of 26.3% between now and 2023, a testimony to how aggressively companies have been adopting this big data software framework for storing and processing the gargantuan files that characterize big data. But to turbo-charge this processing so that it performs faster, additional engine software is used in concert with Hadoop.

    AtScale, a business intelligence (BI) Hadoop solutions provider, periodically performs BI-on-Hadoop benchmarks that compare the performances of various Hadoop engines to determine which engine is best for which Hadoop processing scenario. The benchmark results assist systems professionals charged with managing big data operations as they make their engine choices for different types of Hadoop processing deployments.

    Recently, AtScale published a new survey that I discussed with Josh Klahr, AtScale's vice president of product management.

    "In this benchmark, we tested four different Hadoop engines," said Klahr. "The engines were Spark, Impala, Hive, and a newer entrant, Presto. We used the same cluster size for the benchmark that we had used in previous benchmarking."

    What AtScale found is that there was no clear engine winner in every case, but that some engines outperformed others depending on what the big data processing task involved. In one case, the benchmark looked at which Hadoop engine performed best when it came to processing large SQL data queries that involved big data joins.

    "There are companies out there that have six billion row tables that they have to join for a single SQL query," said Klahr. "The data architecture that these companies use include runtime filtering and pre-filtering of data based upon certain data specifications or parameters that end users input, and which also contribute to the processing load. In these cases, Spark and Impala performed very well. However, if it was a case of many concurrent users requiring access to the data, Presto processed more data."

    The AtScale benchmark also looked at which Hadoop engine had attained the greatest improvement in processing speed over the past six months.

    "The most noticeable gain that we saw was with Hive, especially in the process of performing SQL queries," said Klahr. "In the past six months, Hive has moved from release 1.4 to 2.1—and on an average, is now processing data 3.4 times faster."
    Other Hadoop engines also experienced processing performance gains over the past six months. Spark was processing data 2.4 times faster than it was six months ago, and Impala had improved processing over the past six months by 2.8%. In all cases, better processing speeds were being delivered to users.

    "What we found is that all four of these engines are well suited to the Hadoop environment and deliver excellent performance to end users, but that some engines perform in certain processing contexts better than others," said Klahr. "For instance, if your organization must support many concurrent users of your data, Presto and Impala perform best. However, if you are looking for the greatest amount of stability in your Hadoop processing engine, Hive is the best choice. And if you are faced with billions of rows of data that you must combine in complicated data joins for SQL queries in your big data environment, Spark is the best performer."

    Klahr said that many sites seems to be relatively savvy about Hadoop performance and engine options, but that a majority really hadn't done much benchmarking when it came to using SQL.

    "The best news for users is that all of these engines perform capably with Hadoop," sad Klahr. "Now that we also have benchmark information on SQL performance, this further enables sites to make the engine choices that best suit their Hadoop processing scenarios."

    Source: techrepublic.com, October 29, 2016

  • Modern Information Management: Understanding Big Data at Rest and in Motion

    Big data is the buzzword of the century, it seems. But, why is everyone so obsessed with it? Here’s what it’s all about, how companies are gathering it, and how it’s stored and used.

    7979558647 6c822e698d o YO

    What is it?

    Big data is simply large data sets that need to be analyzed computationally in order to reveal patterns, associations, or trends. This data is usually collected by governments and businesses on citizens and customers, respectively.

    The IT industry has had to shift its focus to big data over the last few years because of the sheer amount of interest being generated by big business. By collecting massive amounts of data, companies, like Amazon.com, Google, Walmart, Target, and others, are able to track buying behaviors of specific customers.

    Once enough data is collected, these companies then use the data to help shape advertising initiatives. For example, Target has used its big data collection initiative to help target (no pun intended) its customers with products it thought would be most beneficial given their past purchases.

    How Companies Store and Use It

    There are two ways that companies can use big data. The first way is to use the data at rest. The second way is to use it in motion.

    At Rest Data – Data at rest refers to information that’s collected and analyzed after the fact. It tells businesses what’s already happened. The analysis is done separately and distinctly from any actions that are taken upon conclusion of said analysis.

    For example, if a retailer wanted to analyze the previous month’s sales data. It would use data at rest to look over the previous month’s sales totals. Then, it would take those sales totals and make strategic decisions about how to move forward given what’s already happened.

    In essence, the company is using past data to guide future business activities. The data might drive the retailer to create new marketing initiatives, customize coupons, increase or decrease inventory, or to otherwise adjust merchandise pricing.

    Some companies might use this data to determine just how much of a discount is needed on promotions to spur sales growth.

    Some companies may use it to figure out how much they are able to discount in the spring and summer without creating a revenue problem later on in the year. Or, a company may use it to predict large sales events, like Black Friday or Cyber Monday.

    This type of data is batch processed since there’s no need to have the data instantly accessible or “streaming live.” There is a need, however, for storage of large amounts of data and for processing unstructured data. Companies often use a public cloud infrastructure due to the costs involved in storage and retrieval.

    Data In Motion – Data in motion refers to data that’s analyzed in real-time. Like data at rest, data may be captured at the point of sale, or at a contact point with a customer along the sales cycle. The difference between data in motion and data at rest is how the data is analyzed.

    Instead of batch processing and analyzation after the fact, data in motion uses a bare metal cloud environment because this type of infrastructure uses dedicated servers offering cloud-like features without virtualization.

    This allows for real-time processing of large amounts of data. Latency is also a concern for large companies because they need to be able to manage and use the data quickly. This is why many companies send their IT professionals to Simplilearn Hadoop admin training and then subsequently load them up on cloud-based training and other database training like NoSQL.

    9427663067 713fa3e786 o

    Big Data For The Future

    Some awesome, and potentially frightening, uses for big data are on the horizon. For example, in February 2014, the Chicago Police Department sent uniformed officers to make notification visits to targeted individuals they had identified as potential criminals. They used a computer-generated list which gathered data about those individuals’ backgrounds.

    Another possible use for big data is development of hiring algorithms. More and more companies are trying to figure out ways to hire candidates without trusting slick resume writing skills. New algorithms may eliminate job prospects based on statistics, rather than skillsets, however. For example, some algorithms find that people with shorter commutes are more likely to stay in a job longer.

    So, people who have long commutes are filtered out of the hiring process quickly.

    Finally, some insurance companies might use big data to analyze your driving habits and adjust your insurance premium accordingly. That might sound nice if you’re a good driver, but insurers know that driving late at night increases the risk for getting into an accident. Problem is, poorer people tend to work late shifts and overnights or second jobs just to make ends meet. The people who are least able to afford insurance hikes may be the ones that have to pay them.

    Source: Mobilemag

  • NoSQL and the Internet of Things

    many-sensorsInternet of Things technology is a hot topic. You can’t read a tech news site without coming across at least one mention of IoT. But if you’re looking to take advantage of sensors, you will likely have to update your data store to handle the workload. Once you’re set up data-wise, get ready to monitor everything from weather and the environment to overseas factory floors and even fleets of trucks.

    Why NoSQL for IoT?

    You might think your data needs for sensors are as tiny as these little devices, but there are several reasons you should consider a NoSQL database.

    The first reason is that these sensors can send huge amounts of data since they run 24/7. All of that data adds up to the need for a larger storage capacity. While you might be tempted to use an RDBMS, relational databases were never really meant to deal with the kind of data that sensors generate. For one thing, sensor data doesn’t always make sense in tabular format.

    SQL was originally designed for relatively static data structured as a table. Data from sensors can change a lot and provides a continuous stream. And you need to be able to add or remove entries on the fly, which can prove difficult with relational databases.

    NoSQL databases are also more scalable, offering flexibility in data models. You can have a structure similar to SQL with wide tables, or you might choose to go with a document-oriented database, key-value database, or graph database. Time series databases are one of the more obvious choices for Internet of Things applications specifically.

    Some businesses may join the big data revolution without knowing where they are actually going to store their data. You could have a cluster dedicated to your data and another to your analytics, but that’s expensive. Wouldn’t it be great if you could have your data and analytics in the same cluster? NoSQL eliminates budget waste for those with two different clusters that amount to the same thing.


    So now that you’ve got your IoT-capable database, what can you do with Internet of Things technology?

    In “The Only Living Boy in New York,” Paul Simon famously got all the news he needed on the weather report. While the weather might seem to most of us like no more than a cliche, “safe” topic for conversation, for many people, receiving the weather report is a matter of safety and survival.

    The severe weather that has impacted much of the U.S. in recent years shows how timely weather forecasts can save lives by allowing forecasters to give accurate, quick, and up-to-the-moment warnings and alerts. Both the National Weather Service and private forecasters use sophisticated models to predict the weather, and those models get better all the time. One of the primary reasons they continue to improve is that the forecasters feed the programs with real data gathered from weather stations around the world. The Weather Channel acquired Weather Underground largely for its extensive network of weather stations operated by enthusiasts.

    If you’re not interested in weather monitoring, IoT offers other options, such as monitoring pollution. Sensors can measure particles in the air, or chemicals and bacteria in the water. Agencies could use this information to plan congestion pricing for commuters or direct cleanup resources.

    Nearly every State in the U.S. has called for more manufacturers to bring jobs back to the U.S. instead of offshoring, but manufacturers cite high costs as a reason to keep factory jobs overseas. One way to monitor operations abroad is to deploy IoT on factory floors. While automated process control is nothing new, what is new is the ability to connect directly to factory floors from around the world. Businesses can monitor production and instantly track problems before they become big ones.

    One of the biggest successes for Internet of Things industry is in logistics, particularly fleet tracking. Trucking companies can see instantly where their vehicles are, and customers can know exactly where their stuff is. Managers can even track fuel usage and see when trucks are due for maintenance. All of these factors help logistics companies cut costs, save fuel, and keep customers.


    NoSQL may be just the solution you need to venture into IoT technology. With the ability to handle the vast workloads from sensors running 24/7, you’ll be able to react to new situations quickly. NoSQL can help you save money, save time, and even save lives. 

    Source: Smartdatacollective

EasyTagCloud v2.8