Hadoop - Benelux Intelligence Community

5 items tagged "Hadoop"

10 Big Data Trends for 2017

Infogix, a leader in helping companies provide end-to-end data analysis across the enterprise, today highlighted the top 10 data trends they foresee will be strategic for most organizations in 2017.

“This year’s trends examine the evolving ways enterprises can realize better business value with big data and how improving business intelligence can help transform organization processes and the customer experience (CX),” said Sumit Nijhawan, CEO and President of Infogix. “Business executives are demanding better data management for compliance and increased confidence to steer the business, more rapid adoption of big data and innovative and transformative data analytic technologies.”

The top 10 data trends for 2017 are assembled by a panel of Infogix senior executives. The key trends include:

1.    The Proliferation of Big Data

    Proliferation of big data has made it crucial to analyze data quickly to gain valuable insight.

    Organizations must turn the terabytes of big data that is not being used, classified as dark data, into useable data.

    Big data has not yet yielded the substantial results that organizations require to develop new insights for new, innovative offerings to derive a competitive advantage

2.    The Use of Big Data to Improve CX

    Using big data to improve CX by moving from legacy to vendor systems, during M&A, and with core system upgrades.

    Analyzing data with self-service flexibility to quickly harness insights about leading trends, along with competitive insight into new customer acquisition growth opportunities.

    Using big data to better understand customers in order to improve top line revenue through cross-sell/upsell or remove risk of lost revenue by reducing churn.

3.    Wider Adoption of Hadoop

    More and more organizations will be adopting Hadoop and other big data stores, in turn, vendors will rapidly introduce new, innovative Hadoop solutions.

    With Hadoop in place, organizations will be able to crunch large amounts of data using advanced analytics to find nuggets of valuable information for making profitable decisions.

4.    Hello to Predictive Analytics

    Precisely predict future behaviors and events to improve profitability.

    Make a leap in improving fraud detection rapidly to minimize revenue risk exposure and improve operational excellence.

5.    More Focus on Cloud-Based Data Analytics

    Moving data analytics to the cloud accelerates adoption of the latest capabilities to turn data into action.

    Cut costs in ongoing maintenance and operations by moving data analytics to the cloud.

6.    The Move toward Informatics and the Ability to Identify the Value of Data

    Use informatics to help integrate the collection, analysis and visualization of complex data to derive revenue and efficiency value from that data

    Tap an underused resource – data – to increase business performance

7.    Achieving Maximum Business Intelligence with Data Virtualization

    Data virtualization unlocks what is hidden within large data sets.

    Graphic data virtualization allows organizations to retrieve and manipulate data on the fly regardless of how the data is formatted or where it is located.

8.    Convergence of IoT, the Cloud, Big Data, and Cybersecurity

    The convergence of data management technologies such as data quality, data preparation, data analytics, data integration and more.

    As we continue to become more reliant on smart devices, inter-connectivity and machine learning will become even more important to protect these assets from cyber security threats.

9.    Improving Digital Channel Optimization and the Omnichannel Experience

    Delivering the balance of traditional channels with digital channels to connect with the customer in their preferred channel.

    Continuously looking for innovative ways to enhance CX across channels to achieve a competitive advantage.

10.    Self-Service Data Preparation and Analytics to Improve Efficiency

    Self-service data preparation tools boost time to value enabling organizations to prepare data regardless of the type of data, whether structured, semi-structured or unstructured.

    Decreased reliance on development teams to massage the data by introducing more self-service capabilities to give power to the user and, in turn, improve operational efficiency.

“Every year we see more data being generated than ever before and organizations across all industries struggle with its trustworthiness and quality. We believe the technology trends of cloud, predictive analysis and big data will not only help organizations deal with the vast amount of data, but help enterprises address today’s business challenges,” said Nijhawan. “However, before these trends lead to the next wave of business, it’s critical that organizations understand that the success is predicated upon data integrity.”

Source: dzone.com, November 20, 2016
Big data vendors see the internet of things (IoT) opportunity, pivot tech and message to compete

Open source big data technologies like Hadoop have done much to begin the transformation of analytics. We're moving from expensive and specialist analytics teams towards an environment in which processes, workflows, and decision-making throughout an organisation can - in theory at least - become usefully data-driven. Established providers of analytics, BI and data warehouse technologies liberally sprinkle Hadoop, Spark and other cool project names throughout their products, delivering real advantages and real cost-savings, as well as grabbing some of the Hadoop glow for themselves. Startups, often closely associated with shepherding one of the newer open source projects, also compete for mindshare and custom.

And the opportunity is big. Hortonworks, for example, has described the global big data market as a $50 billion opportunity. But that pales into insignificance next to what Hortonworks (again) describes as a $1.7 trillion opportunity. Other companies and analysts have their own numbers, which do differ, but the step-change is clear and significant. Hadoop, and the vendors gravitating to that community, mostly address 'data at rest'; data that has already been collected from some process or interaction or query. The bigger opportunity relates to 'data in motion,' and to the internet of things that will be responsible for generating so much of this.

My latest report, Streaming Data From The Internet Of Things Will Be The Big Data World’s Bigger Second Act, explores some of the ways that big data vendors are acquiring new skills and new stories with which to chase this new opportunity.

For CIOs embarking on their IoT journey, it may be time to take a fresh look at companies previously so easily dismissed as just 'doing the Hadoop thing.'

Source: Forrester.com,
Hadoop engine benchmark: How Spark, Impala, Hive, and Presto compare

AtScale recently performed benchmark tests on the Hadoop engines Spark, Impala, Hive, and Presto. Find out the results, and discover which option might be best for your enterprise

The global Hadoop market is expected to expand at an average compound annual growth rate (CAGR) of 26.3% between now and 2023, a testimony to how aggressively companies have been adopting this big data software framework for storing and processing the gargantuan files that characterize big data. But to turbo-charge this processing so that it performs faster, additional engine software is used in concert with Hadoop.

AtScale, a business intelligence (BI) Hadoop solutions provider, periodically performs BI-on-Hadoop benchmarks that compare the performances of various Hadoop engines to determine which engine is best for which Hadoop processing scenario. The benchmark results assist systems professionals charged with managing big data operations as they make their engine choices for different types of Hadoop processing deployments.

Recently, AtScale published a new survey that I discussed with Josh Klahr, AtScale's vice president of product management.

"In this benchmark, we tested four different Hadoop engines," said Klahr. "The engines were Spark, Impala, Hive, and a newer entrant, Presto. We used the same cluster size for the benchmark that we had used in previous benchmarking."

What AtScale found is that there was no clear engine winner in every case, but that some engines outperformed others depending on what the big data processing task involved. In one case, the benchmark looked at which Hadoop engine performed best when it came to processing large SQL data queries that involved big data joins.

"There are companies out there that have six billion row tables that they have to join for a single SQL query," said Klahr. "The data architecture that these companies use include runtime filtering and pre-filtering of data based upon certain data specifications or parameters that end users input, and which also contribute to the processing load. In these cases, Spark and Impala performed very well. However, if it was a case of many concurrent users requiring access to the data, Presto processed more data."

The AtScale benchmark also looked at which Hadoop engine had attained the greatest improvement in processing speed over the past six months.

"The most noticeable gain that we saw was with Hive, especially in the process of performing SQL queries," said Klahr. "In the past six months, Hive has moved from release 1.4 to 2.1—and on an average, is now processing data 3.4 times faster."

Other Hadoop engines also experienced processing performance gains over the past six months. Spark was processing data 2.4 times faster than it was six months ago, and Impala had improved processing over the past six months by 2.8%. In all cases, better processing speeds were being delivered to users.

"What we found is that all four of these engines are well suited to the Hadoop environment and deliver excellent performance to end users, but that some engines perform in certain processing contexts better than others," said Klahr. "For instance, if your organization must support many concurrent users of your data, Presto and Impala perform best. However, if you are looking for the greatest amount of stability in your Hadoop processing engine, Hive is the best choice. And if you are faced with billions of rows of data that you must combine in complicated data joins for SQL queries in your big data environment, Spark is the best performer."

Klahr said that many sites seems to be relatively savvy about Hadoop performance and engine options, but that a majority really hadn't done much benchmarking when it came to using SQL.

"The best news for users is that all of these engines perform capably with Hadoop," sad Klahr. "Now that we also have benchmark information on SQL performance, this further enables sites to make the engine choices that best suit their Hadoop processing scenarios."

Source: techrepublic.com, October 29, 2016
Hadoop: waarvoor dan?
Flexibel en schaalbaar managen van big data

Data-infrastructuur is het belangrijkste orgaan voor het creëren en leveren van goede bedrijfsinzichten . Om te profiteren van de diversiteit aan data die voor handen zijn en om de data-architectuur te moderniseren, zetten veel organisaties Hadoop in. Een Hadoop-gebaseerde omgeving is flexibel en schaalbaar in het managen van big data. Wat is de impact van Hadoop? De Aberdeen Group onderzocht de impact van Hadoop op data, mensen en de performance van bedrijven.

Nieuwe data uit verschillende bronnen

Er moet veel data opgevangen, verplaatst, opgeslagen en gearchiveerd worden. Maar bedrijven krijgen nu inzichten vanuit verborgen data buiten de traditionele gestructureerde transactiegegevens. Denk hierbij aan: e-mails, social data, multimedia, GPS-informatie en sensor-informatie. Naast nieuwe databronnen hebben we ook een grote hoeveelheid nieuwe technologieën gekregen om al deze data te beheren en te benutten. Al deze informatie en technologieën zorgen voor een verschuiving binnen big data; van probleem naar kans.

Wat zijn de voordelen van deze gele olifant (Hadoop)?

Een grote voorloper van deze big data-kans is de data architectuur Hadoop. Uit dit onderzoek komt naar voren dat bedrijven die Hadoop gebruiken meer gedreven zijn om gebruik te maken van ongestructureerde en semigestructureerd data. Een andere belangrijke trend is dat de mindset van bedrijven verschuift, ze zien data als een strategische aanwinst en als een belangrijk onderdeel van de organisatie.

De behoefte aan gebruikersbevoegdheid en gebruikerstevredenheid is een reden waarom bedrijven kiezen voor Hadoop. Daarnaast heeft een Hadoop-gebaseerde architectuur twee voordelen met betrekking tot eindgebruikers:
1. Data-flexibiliteit – Alle data onder één dak, wat zorgt voor een hogere kwaliteit en usability.
2. Data-elasticiteit – De architectuur is significant flexibeler in het toevoegen van nieuwe databronnen.
Wat is de impact van Hadoop op uw organisatie?

Wat kunt u nog meer met Hadoop en hoe kunt u deze data-architectuur het beste inzetten binnen uw databronnen? Lees in dit rapport hoe u nog meer tijd kunt besparen in het analyseren van data en uiteindelijk meer winst kunt behalen door het inzetten van Hadoop.

Bron: Analyticstoday
IBM: Hadoop solutions and the data lakes of the future

IBM: Hadoop solutions and the data lakes of the future

The foundation of the AI Ladder is Information Architecture. The modern data-driven enterprise needs to leverage the right tools to collect, organize, and analyze their data before they can infuse their business with the results.

Businesses have many types of data and many ways to apply it. We must look for approaches to manage all forms of data, regardless of techniques (e.g., relational, map reduce) or use case (e.g., analytics, business intelligence, business process automation). Data must be stored securely and reliably, while minimizing costs and maximizing utility.

Object storage is the ideal place to collect, store, and manage data assets which will be used to generate business value.

Object storage started as an archive

Object storage was first conceived as a simplification: How can we remove the extraneous functions seen in file systems to make storage more scalable, reliable, and low-cost. Technologies like erasure encoding massively reduced costs by allowing reliable storage to be built on cheap commodity hardware. The interface was simple: a uniform limitless namespace, atomic data writes, all available over HTTP.

And, object storage excels at data storage. For example, IBM Cloud Object Storage is designed for over 10 9’s of durability, has robust data protection and security features, flexible pricing, and is natively integrated with IBM Aspera on Cloud high-speed data transfer.

The first use cases were obvious: Object storage was a more scalable file system that could be used to store unstructured data like music, images, and video. Or to store backup files, database dumps, and log files. Its single-namespace, data tiering options allow it to be used for data archiving, and its HTTP interface makes it convenient to serve static website content as part of cloud native applications.

But, beyond that, it was just a data dump.

Map reduce and the rise of Hadoop solutions

At the same time object storage was replacing file system use cases, the map reduce programming model was emerging in data analytics. Apache Hadoop provided a software framework for processing big data workloads that traditional relational database management system (RDBMS) solutions could not effectively manage. Data scientists and analysts had to give up declarative data management and SQL queries, but gained the ability to work with exponentially larger data sets and the freedom to explore unstructured and semi-structured data.

In the beginning, Hadoop was seen by some as a backwards step. It achieved a measure of scale and cost savings but gave up much of what made RDBMS systems so powerful and easy to work with. While not requiring schemas added flexibility, query latencies and overall performance decreased. But the Hadoop ecosystem has continued to expand and meet user needs. Spark massively increased performance, and Hive provided SQL query.

Hadoop is not suitable for everything. Transactional processing is still a better fit for RDBMS. Businesses must use the appropriate technology for their various OLAP and OTLP needs.

HDFS became the de facto data lake for many enterprises

Like object storage, Hadoop was also designed as a scale-out architecture on cheap commodity hardware. The Hadoop File System (HDFS) began with the premise that compute should be moved to the data. It was designed to place data on locally attached storage on the compute nodes themselves. Data was stored in a form that could be directly read by locally running tasks, without a network hop.

While this is beneficial for many types of workloads, it wasn’t the end of ETL envisioned by some. By placing readable copies of data directly on compute nodes, HDFS can’t take advantage of erasure coding to save costs. When data reliability is required, replication is just not as cost-effective. Furthermore, it can’t be independently scaled from compute. As workloads diversified, this inflexibility caused management and cost issues. For many jobs, network wasn’t actually the bottleneck. Compute bottlenecks are typically CPU or memory issues, and storage bottlenecks are typically hard drive spindle related, either disk throughput or seek constrained.

When you separate compute from storage, both sides benefit. Compute nodes are cheaper because they don’t need large amounts of storage, can be scaled up or down quickly without massive data migration costs, and jobs can even be run in isolation when necessary.

For storage, you want to spread out your load onto as many spindles as possible, so using a smaller active data set on a large data pool is beneficial, and your dedicated storage budget can be reserved for smaller flash clusters when latency is really an issue.

The scale and cost savings of Hadoop attracted many businesses to use it wherever possible, and many businesses ended up using HDFS as their primary data store. This has led to cost, manageability, and flexibility issues.

The data lake of the future supports all compute workloads

Object storage was always useful as a way of backing up your databases, and it could be used to offload data from HDFS to a lower-cost tier as well. In fact, this is the first thing enterprises typically do.

But, a data lake is more than just a data dump.

A data lake is a place to collect an organization’s data for future use. Yes, it needs to store and protect data in a highly scalable, secure, and cost-effective manner, and object storage has always provided this. But when data is stored in the data lake, it is often not known how it will be used and how it will be turned into value. Thus, it is essential that the data lake include good integration with a range of data processing, analytics, and AI tools.

Typical tools will not only include big data tools such as Hadoop, Spark, and Hive, but also deep learning frameworks (such as TensorFlow) and analytics tools (such as Pandas). In addition, it is essential for a data lake to support tools for cataloging, messaging, and transforming the data to support exploration and repurposing of data assets.

Object storage can store data in formats native to big data and analytics tools. Your Hadoop and Spark jobs can directly access object storage through the S3a or Stocator connectors using the IBM Analytics Engine. IBM uses these techniques against IBM Cloud Object Storage for operational and analytic needs.

Just like with Hadoop, you can also leverage object storage to directly perform SQL queries. The IBM Cloud SQL Query service uses Spark internally to perform ad-hoc and OLAP queries against data stored directly in IBM Cloud Object Storage buckets.

TensorFlow can also be used to train and deploy ML models directly using data in object storage.

This is the true future of object storage

As organizations look to modernize their information architecture, they save money when they build their data lake with object storage.

For existing HDFS shops, this can be done incrementally, but you need to make sure to keep your data carefully organized as you do so to take full advantage of the rich capabilities available now and in the future.

Author: Wesly Leggette & Michael Factor

Source: IBM