70 items tagged "machine learning"

  • 2017 Investment Management Outlook

    2017 investment management outlook infographic

    Several major trends will likely impact the investment management industry in the coming year. These include shifts in buyer behavior as the Millennial generation becomes a greater force in the investing marketplace; increased regulation from the Securities and Exchange Commission (SEC); and the transformative effect that blockchain, robotic process automation, and other
    emerging technologies will have on the industry.

    Economic outlook: Is a major stimulus package in the offing?

    President-elect Donald Trump may have to depend heavily on private-sector funding to proceed with his $1 trillion infrastructure spending program, considering Congress ongoing reluctance to increase spending. The US economy may be nearing full employment with the younger cohorts entering the labor market as more Baby Boomers retire. In addition, the prospects for a fiscal stimulus seem greater now than they were before the 2016 presidential election.

    Steady improvement and stability is the most likely scenario for 2017. Although weak foreign demand may continue to weigh on growth, domestic demand should be strong enough to provide employment for workers returning to the labor force, as the unemployment rate is expected to remain at approximately 5 percent. GDP annual growth is likely to hit a maximum of 2.5 percent. In the medium term, low productivity growth will likely put a ceiling on the economy, and by 2019, US GDP growth may be below 2 percent, despite the fact that the labor market might be at full employment. Inflation is expected to remain subdued. Interest rates are likely to rise in 2017, but should remain at historically low levels throughout the year. If the forecast holds, asset allocation shifts among cash, commodities, and fixed income may begin by the end of 2017.

    Investment industry outlook: Building upon last year’s performance
    Mutual funds and exchange-traded funds (ETFs) have experienced positive growth. Worldwide regulated funds grew at 9.1 percent CAGR versus 8.6 percent by US mutual funds and ETFs. Non-US investments grew at a slightly faster pace due to global demand. Both worldwide and US investments seem to show declining demand in 2016 as returns remained low.

    Hedge fund assets have experienced steady growth over the past five years, even through performance swings.

    Private equity investments continued a track record of strong asset appreciation. Private equity has continued to attract investment even with current high valuations. Fundraising increased incrementally over the past five years as investors increased allocations in the sector.

    Shifts in investor buying behavior: Here come the Millennials
    Both institutional and retail customers are expected to continue to drive change in the investment management industry. The two customer segments are voicing concerns about fee sensitivity and transparency. Firms that enhance the customer experience and position advice, insight, and expertise as components of value should have a strong chance to set themselves apart from their competitors.

    Leading firms may get out in front of these issues in 2017 by developing efficient data structures to facilitate accounting and reporting and by making client engagement a key priority. On the retail front, the SEC is acting on retail investors’ behalf with reporting modernization rule changes for mutual funds. This focus on engagement, transparency, and relationship over product sales are integral to creating a strong brand as a fiduciary, and they may prove to differentiate some firms in 2017.

    Growth in index funds and other passive investments should continue as customers react to market volatility. Investors favor the passive approach in all environments, as shown by net flows. They are using passive investments alongside active investments, rather than replacing the latter with the former. Managers will likely continue to add index share classes and index-tracking ETFs in 2017, even if profitability is challenged. In addition, the Department of Labor’s new fiduciary rule is expected to promote passive investments as firms alter their product offerings for retirement accounts.

    Members of the Millennial generation—which comprises individuals born between 1980 and 2000—often approach investing differently due to their open use of social media and interactions with people and institutions. This market segment faces different challenges than earlier generations, which influences their use of financial services.

    Millennials may be less prosperous than their parents and may need to own less in order to fully fund retirement. Many start their careers burdened by student debt. They may have a negative memory of recent stock market volatility, distrust financial institutions, favor socially conscious investments, and rely on recommendations from their friends when seeking financial advice.

    Investment managers likely need to consider several steps when targeting Millennials. These include revisiting product lines, offering socially conscious “impact investments,” assigning Millennial advisers to client service teams, and employing digital and mobile channels to reach and serve this market segment.

    Regulatory developments: Seeking greater transparency, incentive alignment, and risk control
    Even with a change in leadership in the White House and at the SEC, outgoing Chair Mary Jo White’s major initiatives are expected to endure in 2017 as they seek to enhance transparency, incentive alignment, and risk control, all of which build confidence in the markets. These changes include the following:

    Reporting modernization. Passed in October 2016, this new requirement of forms, rules, and amendments for information disclosure and standardization will require development by registered investment companies (RICs). Advisers will need technology solutions that can capture data that may not currently exist from multiple sources; perform high-frequency calculations; and file requisite forms with the SEC.

    Liquidity risk management (LRM). Passed in October 2016, this rule requires the establishment of LRM programs by open-end funds (except money market) and ETFs to reduce the risk of inability to meet redemption requirements without dilution of the interests of remaining shareholders.

    Swing pricing. Also passed in October 2016, this regulation provides an option for open-end funds (except money market and ETFs) to adjust net asset values to pass the costs stemming from purchase and redemption activity to shareholders.

    Use of derivatives. Proposed in December 2015, this requires RICs and business development companies to limit the use of derivatives and put risk management measures in place.

    Business continuity and transition plans. Proposed in June 2016, this measure requires registered investment advisers to implement written business continuity and transition plans to address operational risk arising from disruptions.

    The Dodd-Frank Act, Section 956. Reproposed in May 2016, this rule prohibits compensation structures that encourage individuals to take inappropriate risks that may result in either excessive compensation or material loss.

    The DOL’s Conflict-of-Interest Rule. In 2017, firms must comply with this major expansion of the “investment advice fiduciary” definition under the Employee Retirement Income Security Act of 1974. There are two phases to compliance:

    Phase one requires compliance with investment advice standards by April 10, 2017. Distribution firms and advisers must adhere to the impartial conduct standards, provide a notice to retirement investors that acknowledge their fiduciary status, and describes their material conflicts of interest. Firms must also designate a person responsible for addressing material conflicts of interest monitoring advisers' adherence to the impartial conduct standards.

    Phase two requires compliance with exemption requirements by January 1, 2018. Distribution firms must be in full compliance with exemptions, including contracts, disclosures, policies and procedures, and documentation showing compliance.

    Investment managers may need to create new, customized share classes driven by distributor requirements; drop distribution of certain share classes post-rule implementation, and offer more fee reductions for mutual funds.

    Financial advisers may need to take another look at fee-based models, if they are not using already them; evolve their viewpoint on share classes; consider moving to zero-revenue share lineups; and contemplate higher use of ETFs, including active ETFs with a low-cost structure and 22(b) exemption (which enables broker-dealers to set commission levels on their own).

    Retirement plan advisers may need to look for low-cost share classes (R1-R6) to be included in plan options and potentially new low-cost structures.

    Key technologies: Transforming the enterprise

    Investment management poised to become even more driven by advances in technology in 2017, as digital innovations play a greater role than ever before.

    Blockchain. A secure and effective technology for tracking transactions, blockchain should move closer to commercial implementation in 2017. Already, many blockchain-based use cases and prototypes can be found across the investment management landscape. With testing and regulatory approvals, it might take one to two years before commercial rollout becomes more widespread.

    Big data, artificial intelligence, and machine learning. Leading asset management firms are combining big data analytics along with artificial intelligence (AI) and machine learning to achieve two objectives: (1) provide insights and analysis for investment selection to generate alpha, and (2) improve cost effectiveness by leveraging expensive human analyst resources with scalable technology. Expect this trend to gain momentum in 2017.

    Robo-advisers. Fiduciary standards and regulations should drive the adoption of robo-advisers, online investment management services that provide automated, portfolio management advice. Improvements in computing power are making robo-advisers more viable for both retail and institutional investors. In addition, some cutting-edge robo-adviser firms could emerge with AI-supported investment decision and asset allocation algorithms in 2017.

    Robotic process automation. Look for more investment management firms to employ sophisticated robotic process automation (RPA) tools to streamline both front- and back-office functions in 2017. RPA can automate critical tasks that require manual intervention, are performed frequently, and consume a signifcant amount of time, such as client onboarding and regulatory compliance.

    Change, development, and opportunity
    The outlook for the investment management industry in 2017 is one of change, development, and opportunity. Investment management firms that execute plans that help them anticipate demographic shifts, improve efficiency and decision making with technology, and keep pace with regulatory changes will likely find themselves ahead of the competition.

    Download 2017 Investment management industry outlook

    Source: Deloitte.com


  • 6 Changes in the jobs of marketers and market analysts caused by AI

    6 Changes in the jobs of marketers and market analysts caused by AI

    Artificial intelligence is having a profound impact on the state of marketing in 2019. And AI technology will be even more influential in the years to come.

    If you’re a marketer or a business owner in today’s competitive marketplace, you’ve probably tried just about everything you can think of to maximize your success. You’ve dabbled in digital marketing, visited trade shows, paid for print advertising, and incentivized customer testimonials. It’s probably resulted in lots of stress, sleepless nights, and even CBD oil drops to give you the energy and focus to keep going.

    Marketing requires multiple approaches to succeed, so while you should stick with the things you’ve been doing successfully, you’ll also want to include artificial intelligence (AI) in your current strategy. If you haven’t already, you might be falling behind your competitors. As many as 80% of marketers believe that AI will be the most effective tool by 2020.

    AI marketing techniques increase the efficiency of your marketing and are often more effective than some of the traditional tactics you may be using. You’ll combine big data and inbound marketing to deliver a practical marketing strategy that drives conversions. Here are some ways you can apply this seemingly magical tool:

    1. Customer personas

    The most basic rule about marketing is that you can’t hope to run successful campaigns if you don’t know who you’re targeting. A good marketer will create customer personas that tell you who your target market is and how you can best service them. Personas are made at the basic level by listing demographics, interests, and other information that can help you target an audience.

    About 53% of marketers say that AI is extremely useful in identifying customers. It provides information that you might not have otherwise considered when drafting a marketing strategy. This is extremely valuable since more specific information leads to more effective marketing.

    In order to capture this essential data, look through your company analytics. Define the demographics of those who follow you on social media, make purchases on your website, and comment or inquire about your products/services. This essential data can develop a more profound persona designed to target the right customer base.

    2. Digital advertising campaigns

    Many marketers have heard about the essentials of a digital advertising campaign in furthering sales, but they haven’t seen the results they hoped for. Artificial intelligence can significantly improve these campaigns. Once you’ve created a comprehensive view of your customer base, you’ll experience far more effective digital advertising campaigns.

    A great example of this is Facebook advertising, which is named by many marketing experts as the best bang for your buck. It allows you to create advertisements that are specifically targeted towards those who are most likely to make a purchase. However, it only works if you know exactly who your target audience is.

    Thanks to the abundance of consumer data collected by websites, social sites, and keyword searches, you’ll have all the information you need for more effective digital ads.

    3. Automated e-mail and SMS campaigns

    E-mail and SMS marketing are considered some of the best lead-generating marketing tactics out there. E-mail is the number-one source of business communication with 86% of consumers and business professionals reporting it as their preferred source. More importantly for sales, nearly 60% say it’s their most effective channel for revenue generation.

    SMS marketing, although not as popular as e-mail among marketers, boasts similar data for millennial clients, or those aged 18-36. Thanks to AI, we know that about 83% of millennials open an SMS message within 90 seconds of receiving it. Three quarters say they prefer SMS for promotions, surveys, reminders, and similar communications from brands.

    With the help of AI, we do not only understand the essentials of e-mail and SMS for marketing, but we have the insights that help to make it better. AI-enabled tools facilitate targeted campaigns to a specific audience. They handle the busy work behind these campaigns so that you can focus more on developing products and customer service.

    4. Market research

    Savvy marketers begin every new campaign with market research, gathering information about customers, effective marketing strategies, and trends in the industry. This information is invaluable for directing campaigns effectively and making products more appealing to the intended audience.

    Big data provides all that information for you, although it’s difficult to understand it all on the surface. There’s so much information that you’ll need analytics tools to decipher the most useful data that can be used to direct your marketing efforts.

    Once you’ve broadened your horizons with data-deciphering tools, you’ll have an easier time interpreting customer emotions and their perceptions of your brand. You’ll be able to make changes or continue implementing an effective strategy with this insightful information.

    5. User experience

    As business owners know, it’s all about the user experience. A good marketing campaign begins with a website and advertisements designed specifically for customers’ benefits. In fact, customers are beginning to demand information, products, and services at lightning speed. AI can help you give that to them.

    One example is the use of chatbots for customer service. When customers reach out to you on Facebook Messenger, for example, you can set up a chatbot to respond immediately and let them know you’ll be with them shortly.

    Another example is personalization that comes through AI. As you know your audience better, you can set your advertisements and website experiences to be catered to the individual. Each time they log onto your website, they’ll be greeted by name and advertisements all over the web will show them only the things they’re interested in seeing. e-mail marketing will improve with personalization as well.

    Social media and Google advertisements are all about catering more directly to the user experience. The data you collect about individual consumers all but guarantees your ads will be shown to the right people.

    6. Sales forecasting

    Fruitful marketing drives sales, a metric that’s easier to forecast and understand with the use of AI. Marketers can use all the information derived from inbound communication and compare it to traditional metrics in order to determine updates and improvements for sales strategies.

    It can show you a forecasting of the results of a certain metric, so you can determine if it’s worth the expense to do so. This can save marketers significant time and money in the industry all while driving more sales and growth as a result.

    AI is redefining the state of marketing

    Artificial intelligence is having a profound impact on the state of marketing in 2019. And AI technology will be even more influential in the years to come. Make sure that you understand its impact and find ways to utilize it to its full potential.

    Author: Diana Hope

    Source: SmartDataCollective

  • 9 Data issues to deal with in order to optimize AI projects

    9 Data issues to deal with in order to optimize AI projects

    The quality of your data affects how well your AI and machine learning models will operate. Getting ahead of these nine data issues will poise organizations for successful AI models.

    At the core of modern AI projects are machine-learning-based systems which depend on data to derive their predictive power. Because of this, all artificial intelligence projects are dependent on high data quality.

    However, obtaining and maintaining high quality data is not always easy. There are numerous data quality issues that threaten to derail your AI and machine learning projects. In particular, these nine data quality issues need to be considered and prevented before issues arise.

    1. Inaccurate, incomplete and improperly labeled data

    Inaccurate, incomplete or improperly labeled data is typically the cause of AI project failure. These data issues can range from bad data at the source to data that has not been cleaned or prepared properly. Data might be in the incorrect fields or have the wrong labels applied.

    Data cleanliness is such an issue that an entire industry of data preparation has emerged to address it. While it might seem an easy task to clean gigabytes of data, imagine having petabytes or zettabytes of data to clean. Traditional approaches simply don't scale, which has resulted in new AI-powered tools to help spot and clean data issues.

    2. Having too much data

    Since data is important to AI projects, it's a common thought that the more data you have, the better. However, when using machine learning sometimes throwing too much data at a model doesn't actually help. Therefore, a counterintuitive issue around data quality is actually having too much data.

    While it might seem like too much data can never be a bad thing, more often than not, a good portion of the data is not usable or relevant. Having to go through to separate useful data from this large data set wastes organizational resources. In addition, all that extra data might result in data "noise" that can result in machine learning systems learning from the nuances and variances in the data rather than the more significant overall trend.

    3. Having too little data

    On the flip side, having too little data presents its own problems. While training a model on a small data set may produce acceptable results in a test environment, bringing this model from proof of concept or pilot stage into production typically requires more data. In general, small data sets can produce results that have low complexity, are biased or too overfitted and will not be accurate when working with new data.

    4. Biased data

    In addition to incorrect data, another issue is that the data might be biased. The data might be selected from larger data sets in ways that doesn't appropriately convey the message of the wider data set. In other ways, data might be derived from older information that might have been the result of human bias. Or perhaps there are some issues with the way that data is collected or generated that results in a final biased outcome.

    5. Unbalanced data

    While everyone wants to try to minimize or eliminate bias from their data sets, this is much easier said than done. There are several factors that can come into play when addressing biased data. One factor can be unbalanced data. Unbalanced data sets can significantly hinder the performance of machine learning models. Unbalanced data has an overrepresentation of data from one community or group while unnecessarily reducing the representation of another group.

    An example of an unbalanced data set can be found in some approaches to fraud detection. In general, most transactions are not fraudulent, which means that only a very small portion of your data set will be fraudulent transactions. Since a model trained on this fraudulent data can receive significantly more examples from one class versus another, the results will be biased towards the class with more examples. That's why it's essential to conduct thorough exploratory data analysis to discover such issues early and consider solutions that can help balance data sets.

    6. Data silos

    Related to the issue of unbalanced data is the issue of data silos. A data silo is where only a certain group or limited number of individuals at an organization have access to a data set. Data silos can result from several factors, including technical challenges or restrictions in integrating data sets as well as issues with proprietary or security access control of data.

    They are also the product of structural breakdowns at organizations where only certain groups have access to certain data as well as cultural issues where lack of collaboration between departments prevents data sharing. Regardless of the reason, data silos can limit the ability of those at a company working on artificial intelligence projects to gain access to comprehensive data sets, possibly lowering quality results.

    7. Inconsistent data

    Not all data is created the same. Just because you're collecting information, that doesn't mean that it can or should always be used. Related to the collection of too much data is the challenge of collecting irrelevant data to be used for training. Training the model on clean, but irrelevant data results in the same issues as training systems on poor quality data.

    In conjunction with the concept of data irrelevancy is inconsistent data. In many circumstances, the same records might exist multiple times in different data sets but with different values, resulting in inconsistencies. Duplicate data is one of the biggest problems for data-driven businesses. When dealing with multiple data sources, inconsistency is a big indicator of a data quality problem.

    8. Data sparsity

    Another issue is data sparsity. Data sparsity is when there is missing data or when there is an insufficient quantity of specific expected values in a data set. Data sparsity can change the performance of machine learning algorithms and their ability to calculate accurate predictions. If data sparsity is not identified, it can result in models being trained on noisy or insufficient data, reducing the effectiveness or accuracy of results.

    9. Data labeling issues

    Supervised machine learning models, one of the fundamental types of machine learning, require data to be labeled with correct metadata for machines to be able to derive insights. Data labeling is a hard task, often requiring human resources to put metadata on a wide range of data types. This can be both complex and expensive. One of the biggest data quality issues currently challenging in-house AI projects is the lack of proper labeling of machine learning training data. Accurately labeled data ensures that machine learning systems establish reliable models for pattern recognition, forming the foundations of every AI project. Good quality labeled data is paramount to accurately training the AI system on what data it is being fed.

    Organizations looking to implement successful AI projects need to pay attention to the quality of their data. While reasons for data quality issues are many, a common theme that companies need to remember is that in order to have data in the best condition possible, proper management is key. It's important to keep a watchful eye on the data that is being collected, run regular checks on this data, keep the data as accurate as possible, and get the data in the right format before having machine learning models learn on this data. If companies are able to stay on top of their data, quality issues are less likely to arise.

    Author: Kathleen Walch

    Source: TechTarget

  • A brief look into Reinforcement Learning

    A brief look into Reinforcement Learning

    Reinforcement Learning (RL) is a very interesting topic within Artificial Intelligence, and the concept is quite fascinating. In this post I will try to give a nice initial picture for those who want to know more about RL.

    What is Reinforcement Learning?

    Conceptually, RL is a framework that describes systems (here called agents) that are able to learn how to interact with the surrounding environment only by means of gathered experience. After each action (or interaction), the agent earns some reward, a feedback from the environment that quantify the quality of that given action.

    Humans learn by the same principle. Think about a baby walking around. For this bay, everything is new. How can a baby know that grabbing something hot is dangerous? Of course, after touching this hot object he can get a painful burn. With this bad reward (or punishment) the baby will learn that it is good to avoid touching anything too hot.

    It is important to point out that the terms agent and environment must be interpreted in a broader sense. It is easier to visualize the agent as something like a robot and the environment as the place where it is situated in. This is a right analogy, however it can be much more complex. I like to think that the agent is like a controller in a closed loop system: It is basically an algorithm responsible for making decisions. The environment can be anything that the agent interacts with.

    A simple example to help you understand

    For a better understanding I will use a simple example here. Imagine a wheeled robot inside of a maze, trying to learn how to reach a goal marker. However, some obstacles are in its way. The aim is that the agent learns how to reach the goal without crashing into the obstacles. So, let's highlight the main components that compose this RL problem:

    • Agent: The decision making system. The robot, in our example.
    • Environment: A system which the agent interacts with. The maze, in this case.
    • State: For the agent to choose how to behave, it is necessary to estimate the environment state. For each state, it should exist an optimal action for the agent to choose. It can be the robot position, or some obstacle detected by the sensors.
    • Action: This is how the agent interacts with the environment. Usually there is a finite number of actions that the agent is able to perform. In our example it is the direction that the robot should move to.
    • Reward: It is the feedback that allows the agent to know if the action was good or not. A bad reward (it can be a low or negative value) can be also interpreted as a punishment. The main goal of RL algorithms is to maximize the long-term reward. If the robot achieves the goal mark, a big reward should be given. However, if it crashes into an obstacle, a punishment should be given instead.
    • Episode: Most of the RL problems are episodic. The meaning is that it has to exist some event that terminates the episode execution. In our example the episode should finish when the robot reaches the goal or if some time limit is exceeded (to avoid the robot to stay still forever).

    Usually, it is supposed that the agent has no previous knowledge about the environment. Therefore, in the beginning actions will be chosen randomly. For each wrong decision the agent will be punished (for example, by crashing into an obstacle). Good decisions will be rewarded, on the other hand. The learning happens by the agent figuring out how to avoid getting into situations where punishment may occur and choosing actions that will allow the agent to find the goal.

    The reward accumulated in each episode is expected to increase and can be used to estimate the agent’s learning rate. After many episodes, the robot should be able to know how to behave in order to find the goal marker while avoiding any occasional obstacle with no previous information about the environment. Of course there are many other things to be considered, but let’s keep it simple for now.

    Author: Felp Roza

    Source: Towards Data Science

  • A Shortcut Guide to Machine Learning and AI in The Enterprise


    Predictive analytics / machine learning / artificial intelligence is a hot topic – what’s it about?

    Using algorithms to help make better decisions has been the “next big thing in analytics” for over 25 years. It has been used in key areas such as fraud the entire time. But it’s now become a full-throated mainstream business meme that features in every enterprise software keynote — although the industry is battling with what to call it.

    It appears that terms like Data Mining, Predictive Analytics, and Advanced Analytics are considered too geeky or old for industry marketers and headline writers. The term Cognitive Computing seemed to be poised to win, but IBM’s strong association with the term may have backfired — journalists and analysts want to use language that is independent of any particular company. Currently, the growing consensus seems to be to use Machine Learning when talking about the technology and Artificial Intelligence when talking about the business uses.

    Whatever we call it, it’s generally proposed in two different forms: either as an extension to existing platforms for data analysts; or as new embedded functionality in diverse business applications such as sales lead scoring, marketing optimization, sorting HR resumes, or financial invoice matching.

    Why is it taking off now, and what’s changing?

    Artificial intelligence is now taking off because there’s a lot more data available and affordable, powerful systems to crunch through it all. It’s also much easier to get access to powerful algorithm-based software in the form of open-source products or embedded as a service in enterprise platforms.

    Organizations today have also more comfortable with manipulating business data, with a new generation of business analysts aspiring to become “citizen data scientists.” Enterprises can take their traditional analytics to the next level using these new tools.

    However, we’re now at the “Peak of Inflated Expectations” for these technologies according to Gartner’s Hype Cycle — we will soon see articles pushing back on the more exaggerated claims. Over the next few years, we will find out the limitations of these technologies even as they start bringing real-world benefits.

    What are the longer-term implications?

    First, easier-to-use predictive analytics engines are blurring the gap between “everyday analytics” and the data science team. A “factory” approach to creating, deploying, and maintaining predictive models means data scientists can have greater impact. And sophisticated business users can now access some the power of these algorithms without having to become data scientists themselves.

    Second, every business application will include some predictive functionality, automating any areas where there are “repeatable decisions.” It is hard to think of a business process that could not be improved in this way, with big implications in terms of both efficiency and white-collar employment.

    Third, applications will use these algorithms on themselves to create “self-improving” platforms that get easier to use and more powerful over time (akin to how each new semi-autonomous-driving Tesla car can learn something new and pass it onto the rest of the fleet).

    Fourth, over time, business processes, applications, and workflows may have to be rethought. If algorithms are available as a core part of business platforms, we can provide people with new paths through typical business questions such as “What’s happening now? What do I need to know? What do you recommend? What should I always do? What can I expect to happen? What can I avoid? What do I need to do right now?”

    Fifth, implementing all the above will involve deep and worrying moral questions in terms of data privacy and allowing algorithms to make decisions that affect people and society. There will undoubtedly be many scandals and missteps before the right rules and practices are in place.

    What first steps should companies be taking in this area?
    As usual, the barriers to business benefit are more likely to be cultural than technical.

    Above all, organizations need to make sure they have the right technical expertise to be able to navigate the confusion of new vendors offers, the right business knowledge to know where best to apply them, and the awareness that their technology choices may have unforeseen moral implications.

    Source: timoelliot.com, October 24, 2016


  • A three-stage approach to make your business AI ready

    A three-stage approach to make your business AI ready

    Organizations implementing artificial intelligence (AI) have increased by 270% over the last four years, according to a recent survey by Gartner. Even though the implementation of AI is a growing trend, 63% of organizations haven’t deployed this technology. What is holding them back: cost? talent shortage? something else?

    For many organizations it is the inability to reach the desired confidence level in the algorithm itself. Data science teams often blow their budget, time and resources on AI models that never make it out of the beginning stages of testing. And even if projects make it out of the initial stage, not all projects are successful.

    One example we saw last year was Amazon’s attempt to implement AI in their HR department. Amazon received a huge number of resumes for their thousands of open positions. They hypothesized that they could use machine learning to go through all of the resumes and find the top talent. While the system was able to filter the resumes and apply scores to the candidates, it also showed gender bias. While this proof of concept was approved, they didn’t watch for bias in their training data and the project was recalled.

    Companies want to jump on the “Fourth Industrial revolution” bandwagon and prove that AI will deliver ROI for their businesses. The truth is AI is in its early stages and many companies are just now getting AI ready. For machine learning (ML) project teams that are starting a project for the first time, a deliberate, three-stage approach to project evolution will pave a shortcut to success:

    1. Test the fundamental efficacy of your model with an internal Proof of Concept (POC)

    The point of a POC is to prove that in a certain case it is possible to save money or improve a customer experience using AI. You are not attempting to get the model to the level of confidence needed to deploy it, but just to say (and show) the project can work.

    A POC like this is all about testing things to see if a given approach produces results. There is no sense in making deep investments for a POC. You can use an off-the-shelf algorithm, find open source training data, purchase a sample dataset, create your own algorithm with limited functionality, and/or label your own data. Find what works for you to prove that your project will achieve the intended corporate goal. A successful POC is what is going to get the rest of the project funded.

    In the grand scheme of your AI project, this step is the easiest part of your journey. Keep in mind, as you get further into training your algorithm, you will not be able to use sample data or prepare all of your training data yourself. The subsequent improvements in model confidence required to make your system production ready will take immense amounts of training data.

    2. Prepare the data you’ll need to train your algorithm… and keep going

    In this step the hard work really begins. Let’s say that your POC using pre-labeled data got your model to a 60% confidence. 60% is not ready for primetime. In theory, that could mean that 40 percent of the interactions your algorithm has with customers will be unsatisfactory. How to reach a higher level of confidence? More training data.

    Proving AI will work for your business is a huge step toward implementing it and actually reaping the benefits. But don’t let it lull you into thinking the next 10% confidence is going to be 6x easier than that. The ugly truth is that models have an insatiable appetite for training data and getting from 60% to 70% confidence could take more training data that it took to get to the original 60 percent. The needs become exponential. 

    3. Watch out for possible roadblocks

    Imagine: if it took tens of thousands of labeled images to prove one use case for a successful POC, it is going to take tens of thousands of images for each use case you need your algorithm to learn. How many use cases is that? Hundreds? Thousands? There are edge cases that will continually arise, and each of those will require training data. And on and on. It is understandable that data science teams often underestimate the quantity of training data they will need and attempt to do the labeling and annotating in-house. This could also partially account for why data scientists are leaving their jobs.

    While not enough training data is one common pitfall, there are others. It is essential that you are watching for and eliminating any sample, measurement, algorithm, or prejudicial bias in your training data as you go. You’ll want to implement agile practices to catch these things early and make adjustments.

    And one final thing to keep in mind,=: AI labs, data scientists, AI teams, and training data are expensive. Yet, in a Gartner report that says that AI projects are in the top three priorities, it also states that AI is thirteenth on their list of funding priorities. Yes, you’re going to need a bigger budget.

    Author: Glen Ford

    Source: Dataconomy

  • AI and the risks of Bias

    BIAS cartoon006

    From facial recognition for unlocking our smartphones to speech recognition and intent analysis for voice assistance, artificial intelligence is all around us today. In the business world, AI is helping us uncover new insight from data and enhance decision-making.

    For example, online retailers use AI to recommend new products to consumers based on past purchases. And, banks use conversational AI to interact with clients and enhance their customer experiences.

    However, most of the AI in use now is “narrow AI,” meaning it is only capable of performing individual tasks. In contrast, general AI – which is not available yet – can replicate human thought and function, taking emotions and judgment into account. 

    General AI is still a way off so only time will tell how it will perform. In the meantime, narrow AI does a good job at executing tasks, but it comes with limitations, including the possibility of introducing biases.  

    AI bias may come from incomplete datasets or incorrect values. Bias may also emerge through interactions overtime, skewing the machine’s learning. Moreover, a sudden business change, such as a new law or business rule, or ineffective training algorithms can also cause bias. We need to understand how to recognize these biases, and design, implement and govern our AI applications in order to make sure the technology generates its desired business outcomes.

    Recognize and evaluate bias – in data samples and training

    One of the main drivers of bias is the lack of diversity in the data samples used to train an AI system. Sometimes the data is not readily available or it may not even exist, making it hard to address all potential use cases.

    For instance, airlines routinely apply sensor data from in-flight aircraft engines through AI algorithms to predict needed maintenance and improve overall performance. But if the machine is trained with only data from flights over the Northern Hemisphere and then applied to a flight across sub-Saharan Africa, the conditions will provide inaccurate results. We need to evaluate the data used to train these systems and strive for well-rounded data samples.

    Another driver of bias is incomplete training algorithms. For example, a chatbot designed to learn from conversations may be exposed to politically incorrect language. Unless trained not to, the chatbot may start using the same language with consumers, which Microsoft unfortunately learned in 2016 with its now-defunct Twitter bot, “Tay.” If a system is incomplete or skewed through learning like Tay, then teams have to adjust the use case and pivot as needed.

    Rushed training can also lead to bias. We often get excited about introducing AI into our businesses so naturally want to start developing projects and see some quick wins. 

    However, early applications can quickly expand beyond their intended purpose. Given that current AI cannot cover the gamut of human thought and judgement, eliminating emerging biases becomes a necessary task. Therefore, people will continue to be important in AI applications. Only people have the domain knowledge – acquired industry, business, and customer knowledge – needed to evaluate the data for biases and train the models accordingly.

    Diversify datasets and the teams working with AI

    Diversity is the key to mitigating AI biases – diversity in the datasets and the workforce working day to day with the models. As stated above, we need to have comprehensive, well-rounded datasets that can broadly cover all possible use cases. If there is underrepresented or disproportionate internal data, such as if the AI only has homogenous datasets, then external sources may fill in the gaps in information. This gives the machine a richer pool of data to learn and work with – and leads to predictions that are far more accurate. 

    Likewise, diversity in the teams working with AI can help mitigate bias. When there is only a small group within one department working on an application, it is easy for the thinking of these individuals to influence the system’s design and algorithms. Starting with a diverse team or introducing others into an existing group can make for a much more holistic solution. A team with varying skills, thinking, approaches and backgrounds is better equipped to recognize existing AI bias and anticipate potential bias. 

    For example, one bank used AI to automate 80 percent of its financial spreading process for public and private companies. It involved extracting numbers out of documents and formatting them into templates, while logging each step along the way. To train the AI and make sure the system pulled the right data while avoiding bias, the bank relied on a diverse team of experts with data science, customer experience, and credit decisioning expertise. Today, it applies AI to spreading on 45,000 customer accounts across 35 countries.

    Consider emerging biases and preemptively train the machine

    While AI can introduce biases, proper design (including the data samples and models) and thoughtful usage (such as governance over the AI’s learning) can help reduce and prevent them. And, in many situations, AI can actually minimize bias that would otherwise be present in human decision-making. An objective algorithm can compensate for the natural bias that a human might introduce such in approving a customer for a loan based on their appearance.

    In recruiting, an AI program can review job descriptions to eliminate unconscious gender biases by flagging and removing words that may be construed as more masculine or feminine, and replacing them with more neutral terms. It is important to note that a domain expert needs to go in and make sure the changes are still accurate, but the system can recognize things that people could miss. 

    Bias is an unfortunate reality in today’s AI applications. But by evaluating the data samples and training algorithms and making sure that both are comprehensive and complete, we can mitigate unintended biases. We need to task diverse teams with governing the machines to prevent unwanted outcomes. With the right protocol and measures, we can ensure that AI delivers on its promise and yields the best business results

     Author: Sanjay Srivastava

    Source: Information Management

  • AI omzetten in een succesvolle strategie: 8 tips voor marketeers

    AI omzetten in een succesvolle strategie: 8 tips voor marketeers

    Artificial Intelligence (AI) zou het belangrijkste aspect moeten zijn van een datastrategie. Dat vindt meer dan 60 procent van de marketeers, blijkt uitonderzoek van MemSQL. Maar het daadwerkelijk inzetten van AI blijkt een ander verhaal. Hoe kunnen bedrijven AI omzetten in een succesvolle strategie? Hier volgen 8 tips voor marketeers:

    1. Recommendation engines

    Richt je op upselling door recommendation engines in te zetten. Recommendation engines zijn gebouwd om te voorspellen wat gebruikers op basis van hun zoektermen verder interessant zouden kunnen vinden, met name als er veel keuze is. Recommendatin engines tonen gebruikers informatie of inhoud die ze anders misschien niet hadden gezien, wat uiteindelijk kan leiden tot hogere inkomsten uit meer verkopen. Naarmate er meer bekend is over een bezoeker, is een steeds betere aanbeveling te doen en daarmee wordt de verkoopkans steeds groter. Zo is meer dan 80 procent van de programma’s die mensen kijken op Netflix door hen gevonden via de recommendation engine. Hoe dit werkt? Ten eerste verzamelt Netflix alle data van zijn gebruikers. Wat kijken ze? Wat keken ze vorig jaar? Welke series kijken na elkaar? En ga zo maar door. Bovendien is er een groep freelance en in house taggers actief, die alle content van beoordelingen en tags voorzien. Speelt een serie zich af in de ruimte of is de held een politieman? Alles krijgt een tag. Vervolgens worden machine learning algoritmes losgelaten op deze gecombineerde data en worden kijkers opgedeeld in meer dan 2000 verschillende ‘smaakgroepen’. De groep waarin een gebruiker is ingedeeld bepaalt welke kijkvoorstellen hij/zij krijgt.

    2. Forecasting

    Goede salesprognoses helpen bedrijven te groeien. Maar voorspellingen worden al jarenlang door mensen gedaan, terwijl emoties een kwartaal kunnen maken of breken. Zonder wetenschap zijn voorspellingen vaak ofwel overdreven optimistisch, ofwel overdreven pessimistisch. AI kan helpen met forecasting louter gebaseerd op gegevens en feiten. Deze gevgevens en feiten zijn met dank aan AI ook uit te leggen, waardoor bedrijven kunnen leren van eerdere voorspellingen en de volgende prognose alleen maar nauwkeuriger wordt.

    3. Ga ‘churn’ tegen

    Zoals iedere marketeer weet is het werven van nieuwe klanten veel duurder dan het behouden van de huidige klanten. Maar hoe voorkom je dat klanten zich uitschrijven voor je diensten of kiezen voor andere oplossingen? Zorg dat je klanten die de website willen verlaten steeds beter begrijpt en hun gedrag kunt voorspellen, want daarmee is klantverlies te minimaliseren. Wanneer je klanten die op het punt staan je website te verlaten effectief aanspreekt, vergroot je de kans op conversie. Door met behulp van AI een voorspellend analysemodel te bouwen dat potentiële ‘churners’ detecteert en hier vervolgens een marketingcampagne op in te zetten, voorkom je klantverlies en kun je veranderingen in je product aanbrengen om churn tegen te gaan.

    4. Content generation

    Content blijft koning. En daar kun je op inspelen met Natural Language Processing (NLP). Dit is de vaardigheid van een computerprogramma om menselijke taal te begrijpen. NLP zal zich in de nabije toekomst steeds verder ontwikkelen en wordt meer mainstream. Doordat computers taal steeds beter begrijpen, kan simpele content steeds beter automatisch gegenereerd worden. Dat content enorm belangrijk blijft, blijkt uit onderzoek van het Content Marketing Institute (CMI). Content marketing blijkt wel drie keer zo veel leads per uitgegeven dollar op te leveren als het genereren van betaalde zoekopdrachten! Bovendien kost content marketing minder terwijl het tegelijkertijd grotere langetermijnvoordelen biedt.

    5. Hyper-Targeted advertising

    Klanten hebben steeds meer toegang tot informatie en worden met een overschot aan keuzes minder loyaal aan een product of merk. De klantervaring die een bedrijf biedt is steeds belangrijker, dus ook advertenties moeten aanvoelen als een persoonlijk aanbod. Uit onderzoek van SalesForce blijkt dat 51 procent van de consumenten verwacht dat bedrijven rond 2020 zullen anticiperen op hun behoeften en actief relevante suggesties doen, oftewel hyper-targeted advertising inzetten. Zet daarom AI in voor data-driven klantsegmentatie en maak advertenties steeds relevanter per doelgroep.

    6. Prijsoptimalisatie

    McKinsey schat dat zo’n 30% van alle prijsbeslissingen die bedrijven elk jaar maken niet leiden tot de optimale prijs. Om concurrerend te blijven is het van belang continu het evenwicht te vinden tussen wat klanten willen betalen voor een product/dienst en wat de winstmarges aan kunnen. Grote bedrijven tonen aan dat prijsoptimalisatie vaak cruciaal is voor hun succes. Naar verluidt wijzigt Walmart zijn prijzen wel meer dan 50.000 keer per maand. Door met behulp van AI dynamische prijsbepaling in te zetten, zijn prijzen continu te updaten op basis van veranderende factoren en ben je niet meer afhankelijk van statische gegevens.

    7. Scoor betere leads

    Zet voorspellende lead scoring in om betere leads te scoren en daarmee alle pijlen te richten op diegenen die het meest waarschijnlijk zullen kopen. Uit een IDC-enquête blijkt dat 83 procent van de bedrijven voorspellende lead scoring voor verkoop en marketing al gebruikt of van plan is te gebruiken. En met de hulp van AI is daar een enorme slag in te slaan. Voorspellende lead scoring is speciaal ontwikkeld om te bepalen welke criteria bij een goede lead horen. Het maakt gebruik van algoritmes die vast kunnen stellen welke eigenschappen geconverteerde leads en niet-geconverteerde leads met elkaar gemeen hebben. Met die kennis kan lead scoring-software verschillende modellen voor voorspellende lead scoring maken en testen, en vervolgens automatisch het model kiezen dat het meest geschikt is voor een set voorbeeldgegevens. Omdat lead scoring-software ook machine learning gebruikt worden lead scores steeds nauwkeuriger.

    8. Marketingattributie

    En tot slot: begrijp tot op de details waar de beste (en slechtste) conversies vandaan komen, zodat je hiermee aan de slag kunt gaan. Met conversieattributie is goed te meten via welke website, zoekmachine, advertentie etc. een bezoeker op jouw website kwam en hier wel of niet een bestelling plaatste. Met behulp va machine learning kun je een slimmer marketingattributiesysteem bouwen, waarmee precies geïdentificeerd kan worden wat individuen beïnvloedt om gewenst gedrag te vertonen. In dit geval is overgaan tot koop het gewenste gedrag. Een goed marketingattributiesysteem met behulp van AI kan dus zorgen voor meer conversie.

    Auteur: Hylke Visser

    Bron: Emerce

  • An overview of Morgan Stanley's surge toward data quality

    An overview of Morgan Stanley's surge toward data quality

    Jeff McMillan, chief analytics and data officer at Morgan Stanley, has long worried about the risks of relying solely on data. If the data put into an institution's system is inaccurate or out of date, it will give customers the wrong advice. At a firm like Morgan Stanley, that just isn't an option.

    As a result, Morgan Stanley has been overhauling its approach to data. Chief among them is that it wants to improve data quality in core business processing.

    “The acceleration of data volume and the opportunity this data presents for efficiency and product innovation is expanding dramatically,” said Gerard Hester, head of the bank’s data center of excellence. “We want to be sure we are ahead of the game.”

    The data center of excellence was established in 2018. Hester describes it as a hub with spokes out to all parts of the organization, including equities, fixed income, research, banking, investment management, wealth management, legal, compliance, risk, finance and operations. Each division has its own data requirements.

    “Being able to pull all this data together across the firm we think will help Morgan Stanley’s franchise internally as well as the product we can offer to our clients,” Hester said.

    The firm hopes that improved data quality will let the bank build higher quality artificial intelligence and machine learning tools to deliver insights and guide business decisions. One product expected to benefit from this is the 'next best action' the bank developed for its financial advisers.

    This next best action uses machine learning and predictive analytics to analyze research reports and market data, identify investment possibilities, and match them to individual clients’ preferences. Financial advisers can choose to use the next best action’s suggestions or not.

    Another tool that could benefit from better data is an internal virtual assistant called 'ask research'. Ask research provides quick answers to routine questions like, “What’s Google’s earnings per share?” or “Send me your latest model for Google.” This technology is currently being tested in several departments, including wealth management.

    New data strategy

    Better data quality is just one of the goals of the revamp. Another is to have tighter control and oversight over where and how data is being used, and to ensure the right data is being used to deliver new products to clients.

    To make this happen, the bank recently created a new data strategy with three pillar. The first is working with each business area to understand their data issues and begin to address those issues.

    “We have made significant progress in the last nine months working with a number of our businesses, specifically our equities business,” Hester said.

    The second pillar is tools and innovation that improve data access and security. The third pillar is an identity framework.

    At the end of February, the bank hired Liezel McCord to oversee data policy within the new strategy. Until recently, McCord was an external consultant helping Morgan Stanley with its Brexit strategy. One of McCord’s responsibilities will be to improve data ownership, to hold data owners accountable when the data they create is wrong and to give them credit when it’s right.

    “It’s incredibly important that we have clear ownership of the data,” Hester said. “Imagine you’re joining lots of pieces of data. If the quality isn’t high for one of those sources of data, that could undermine the work you’re trying to do.”

    Data owners will be held accountable for the accuracy, security and quality of the data they contribute and make sure that any issues are addressed.

    Trend of data quality projects

    Arindam Choudhury, the banking and capital markets leader at Capgemini, said many banks are refocusing on data as it gets distributed in new applications.

    Some are driven by regulatory concerns, he said. For example, the Basel Committee on Banking Supervision's standard number 239 (principles for effective risk data aggregation and risk reporting) is pushing some institutions to make data management changes.

    “In the first go-round, people complied with it, but as point-to-point interfaces and applications, which was not very cost effective,” Choudhury said. “So now people are looking at moving to the cloud or a data lake, they’re looking at a more rationalized way and a more cost-effective way of implementing those principles.”

    Another trend pushing banks to get their data house in order is competition from fintechs.

    “One challenge that almost every financial services organization has today is they’re being disintermediated by a lot of the fintechs, so they’re looking at assets that can be used to either partner with these fintechs or protect or even grow their business,” Choudhury said. “So they’re taking a closer look at the data access they have. Organizations are starting to look at data as a strategic asset and try to find ways to monetize it.”

    A third driver is the desire for better analytics and reports.

    "There’s a strong trend toward centralizing and figuring out, where does this data come from, what is the provenance of this data, who touched it, what kinds of rules did we apply to it?” Choudhury said. That, he said, could lead to explainable, valid and trustworthy AI.

    Author: Penny Crosman

    Source: Information-management

  • BERT-SQuAD: Interviewing AI about AI

    BERT-SQuAD: Interviewing AI about AI

    If you’re looking for a data science job, you’ve probably noticed that the field is hyper-competitive. AI can now even generate code in any language. Below, we’ll explore how AI can extract information from paragraphs to answer questions.

    One day you might be competing against AI, if AutoML isn’t that competitor already.

    What is BERT-SQuAD?

    Google BERT and the Stanford Question Answering Dataset.

    BERT is a cutting-edge Natural Language Processing algorithm that can be used for tasks like question answering (which we’ll go into here), sentiment analysis, spam filtering, document clustering, and more. It’s all language!

    “Bidirectionality” refers to the fact that many words change depending on their context, like “let’s hit he club” versus “an idea hit him”, so it’ll consider words on both sides of the keyword.

    “Encoding” just means assigning numbers to characters, or turning an input like “let’s hit the club” into a machine-workable format.

    “Representations” are the general understanding of words you get by looking at many of their encodings in a corpus of text.

    “Transformers” are what you use to get from embeddings to representations. This is the most complex part.

    As mentioned, BERT can be trained to work on basically any kind of language task, so SQuAD refers to the dataset we’re using to train it on a specific language task: Question answering.

    SQuAD is a reading comprehension dataset, containing questions asked by crowdworkers on Wikipedia articles, where the answer to every question is a segment of text from the corresponding passage.

    BERT-SQuAD, then, allows us to answer general questions by fishing out the answer from a body of text. It’s not cooking up answers from scratch, but rather, it understands the context of the text enough to find the specific area of an answer.

    For example, here’s a context paragraph about lasso and ridge regression:

    “You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

    Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.”

    Now, we could ask BERT-SQuAD:

    “When is Ridge regression favorable over Lasso regression?”

    And it’ll answer:

    “In presence of correlated variables”

    While I show around 100 words of context here, you could input far more context into BERT-SQuAD, like whole documents, and quickly retrieve answers. An intelligent Ctrl-F, if you will.

    To test the following 7 questions, I used Gradio, a library that lets developers make interfaces out of models. In this case, I used the BERT-SQuAD interface created out of Google Colab.

    I used the contexts from a Kaggle thread as inputs, and modified the questions for simplicities sake.

    Q1: What will happen if you don’t rotate PCA components?

    The effect of PCA will diminish

    Q2. How do you reduce the dimensions of data to reduce computation time?

    We can separate the numerical and categorical variables and remove the correlated variables

    Q3: Why is Naive Bayes “naive” ?

    It assumes that all of the features in a data set are equally important and independent

    Q4: Which algorithm should you use to tackle low bias and high variance?


    Q5: How are kNN and kmeans clustering different?

    kmeans is unsupervised in nature and kNN is supervised in nature

    Q6: When is Ridge regression favorable over Lasso regression?

    In presence of correlated variables

    Q7: What is convex hull?

    Represents the outer boundaries of the two group of data points

    Author: Frederik Bussler

    Source: Towards Data Science


  • Brouwer AB InBev genereert extra omzet met DataRobot AI-platform  

    Brouwer AB InBev genereert extra omzet met DataRobot AI-platform

    AB InBev, de grootste bierproducent ter wereld, heeft het AI-platform van DataRobot geselecteerd. Het bedrijf gebruikt het platform onder meer om AutoML-modellen te ontwikkelen voor haar zes belangrijkste markten, waaronder Nederland. Op basis van de consumptiecijfers leveren de modellen inzichten waarmee sales-medewerkers van AB InBev klanten kunnen adviseren welke producten zij het best kunnen inkopen om hun verkoopcijfers te verhogen. De datagedreven aanbevelingen van het DataRobot-platform leverden AB InBev substantieel extra omzet op in het eerste half jaar van 2021. 
    Het data science team van AB InBev had al ervaring met verschillende AI-platformen. Hierbij liepen zij echter tegen problemen aan met beperkte schaalbaarheid van de modellen en waren de nauwkeurigheid en snelheid onvoldoende. Zo kon de kwaliteit van de data niet worden gegarandeerd en konden salesmedewerkers de inzichten vaak pas na 1,5 week met klanten delen, waarbij de informatie al was verouderd. Om sneller en beter te kunnen werken en de kracht van AI breed beschikbaar te maken, koos het bedrijf voor DataRobot.  
    “Data science wordt doorgaans heel traditioneel benaderd. Het model moet door mensen worden gebouwd en begrijpelijk zijn, anders vertrouwt men het niet. In die zin was het inzetten van een nieuwe technologie als AutoML een uitdaging. We hebben met onze keuze voor DataRobot echt een barrière doorbroken”, stelt Renato Piai, commercial & consumer analytics director bij AB InBev. “Een verkeerde sales-aanbeveling kan bovendien op grote schaal miljoenen kosten, dus er is ook een zekere ‘fear to fail’. Het platform van DataRobot speelt hier slim op in met functies als Modelgrader, Bias & Fairness Production Monitoring en Continuous AI zodat we al snel overtuigd waren.”

    Meest waardevolle producten voorspellen

    AB InBev wist in slechts 9 weken de infrastructuur op te zetten, de benodigde data uit de regio’s samen te brengen in één data pijplijn, het platform van DataRobot te implementeren, de modellen te ontwerpen, te valideren en in productie te brengen. Dankzij de AutoML-technologie van DataRobot kan AB InBev beter voorspellen welke producten uit het uitgebreide assortiment het meest waardevol zijn voor horecagelegenheden. Hierdoor konden de sales-teams in verschillende markten betere productaanbevelingen doen, wat leidde tot een forse omzetstijging in de eerste helft van het jaar, ondanks dat horecazaken tijdens de lockdown deels gesloten waren. 
    Piai: “We kunnen onze business stakeholders nu heel duidelijk de concrete waarde van data science en AI laten zien. Met het DataRobot-platform leveren we inzichten waarmee productportfolio’s bij onze klanten worden geoptimaliseerd. Daarmee hebben we een forse omzetstijging gerealiseerd en hadden we qua ROI al snel een positieve case. Ik verwacht dat we als de horeca weer volledig opent en we de oplossing uitrollen naar meer markten, deze stijging op grotere schaal kunnen doorzetten. We investeren in innovatie om te investeren in de toekomst.” 
    “Op dit moment trainen we al sales teams in andere landen hoe ze de dashboards met inzichten kunnen gebruiken. Met DataRobot hebben we het werk wat onze data scientists jarenlang op een laptop deden naar de cloud gebracht en sneller en vrijwel oneindig schaalbaar gemaakt. Zo kijken we nu naast modellen voor aanbevelingen ook naar het inzetten van het DataRobot-platform voor promoties”, aldus Piai.    
    “Het is geweldig hoe AB InBev de kracht van ons platform inzet om op grote schaal en met veel impact AutoML toe te passen”, zegt Joep Gerrits, regional director Benelux bij DataRobot. “AB InBev heeft met hetzelfde team een aanzienlijk grotere bijdrage kunnen leveren aan de bedrijfsresultaten en laten zien waartoe data science en AI in staat zijn, mits je de goede mensen, processen en tools in huis hebt.”
    Bron: DataRobot
  • Context & Uncertainty in Web Analytics

    Context & Uncertainty in Web Analytics

    Trying to make decisions with data

    “If a measurement matters at all, it is because it must have some conceivable effect on decisions and behaviour. If we can’t identify a decision that could be affected by a proposed measurement and how it could change those decisions, then the measurement simply has no value” - Douglas W. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business, 2007

    Like many digital businesses we use web analytics tools that measure how visitors interact with our websites and apps. These tools provide dozens of simple metrics, but in our experience their value for informing a decision is close to zero without first applying a significant amount of time, effort and experience to interpret them.

    Ideally we would like to use web analytics data to make inferences about what stories our readers value and care about. We can then use this to inform a range of decisions: what stories to commission, how many articles to publish, how to spot clickbait, which headlines to change, which articles to reposition on the page, and so on.

    Finding what is newsworthy can not and should not be as mechanistic as analysing an e-commerce store, where the connection between the metrics and what you are interested in measuring (visitors and purchases) is more direct. We know that — at best — this type of data can only weakly approximate what readers really think, and too much reliance on data for making decisions will have predictable negative consequences. However, if there is something of value the data has to say, we would like to hear it.

    Unfortunately, simple web analytics metrics fail to account for key bits of  that are vital if we want to understand if their values are higher or lower than what we should expect (and therefore interesting).

    Moreover, there is inherent  in the data we are using, and even if we can tell whether the value is higher or lower than expected, it is difficult to tell whether this is just down to chance.

    Good analysts, familiar with their domain often get good at doing the mental gymnastics required to account for context and uncertainty, so they can derive the insights that support good decisions. But doing this systematically when presented with a sea of metrics is rarely possible or the best use of an analyst’s valuable sense-making skills. Rather than all their time being spent trying to identify what is unusual, it would be better if their skills could be applied to learning why something is unusual or deciding how we might improve things. But if all of our attention is focused on the lower level what questions, we never get to the why or how questions — which is where we stand a chance of getting some value from the data.


    “The value of a fact shrinks enormously without context” - Howard Wainer, Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot, 1997

    Take two metrics that we would expect to be useful — how many people start reading an article (we call this ), and how long they spend on it (we call this the average ). If the metrics worked as intended, they could help us identify the stories our readers care about, but in their raw form, they tell us very little about this.

    • : If an article is in a more prominent position on the website or app, more people will see it and click on it.
    • If an article is longer, on average, people will tend to spend more time reading it.

    Counting the number of readers tells us more about where an article was placed, and dwell time more about the length of the article than anything meaningful.

    It’s not just length and position that matter. Other context such as the section, the day of the week, how long since it was published, and whether people are reading it on our website or apps all systematically influence these numbers. So much so, that we can do a reasonable job of predicting how many readers an article will get and how long they will spend on it by only , and completely ignoring the content of the article.

    From this perspective, articles are a victim of circumstance, and the raw metrics we see in so many dashboards tell us more about their circumstances than anything more meaningful — it’s all noise and very little signal.

    Knowing this, what we really want to understand is how much better or worse an article did than we would expect, given that context. In our newsroom, we do this by turning each metric (readers, dwell time and some others) into an index that compares the actual metric for an article to it’s expected value. We score it on a scale from 1 to 5, where 3 is expected, 4 or 5 is better than expected and 1 or 2 is worse than expected.

    Article A: a longer article in a more prominent position. Neither the number of readers nor the time they spent reading it was different from what we would expect (both indices = 3).

    Article B: a shorter article in a less prominent position. Whilst it had the expected number of readers (index = 3), they spent longer reading it than we would expect (index = 4).

    The figures above show how we present this information when looking at individual articles. Article A had 7,129 readers, more than four thousand more readers than article B, and people spent 2m 44s reading article A, almost a minute longer than article B. A simple web analytics display would pick article A as the winner on both counts by a large margin. And completely mislead us.

    Once we take into account the context, and calculate the indices, we find that both articles had about as many readers as we would expect, no more or lessEven though article B had four thousand fewer, it was in a less prominent position, and so we wouldn’t expect so many. However, people did spend longer reading article B than we would expect, given factors such as it’s length (it was shorter than article A).

    The indices are the output of a predictive model, which predicts a certain value (e.g. number of readers), based on the context (the features in the model). The difference between the actual value and the predicted value (the residuals in the modelthen form the basis of the index, which we rescale into the 1–5 score. An additional benefit is that we also have a common scale for different measures, and a common language for discussing these metrics across the newsroom.

    Unless we account for context, we can only really use data for : ‘Just tell me which article got me the most readers, I don’t care why’. If the article only had more readers because it was at the top of the edition we’re not learning anything useful from the data, and at worst it creates a self fulfilling feedback loop (more prominent articles get more readers — similar to the popularity bias that can occur in recommendation engines).

    In his excellent book Upstream, Dan Heath talks about moving from . Data for learning is fundamental if we want to make better decisions. If we want to use data for learning in the newsroom, it’s incredibly useful to be able to identify which articles are performing better or worse than we would expect, but that is only ever the start. The real learning comes from what we do with that information, trying something different, and seeing if it has a positive effect on our readers’ experience.

    “Using data for inspection is so common that leaders are sometimes oblivious to any other model.” - Dan Heath, Upstream: The Quest to Solve Problems Before They Happen, 2020


    “What is not surrounded by uncertainty cannot be truth” - Richard Feynman (probably)

    The metrics presented in web analytics tools are incredibly precise. 7,129 people read the article we looked at earlier. How do we compare that to an article with 7,130 readers? What about one with 8,000? When presented with numbers, we can’t help making comparisons, even if we have no idea whether the difference matters.

    We developed our indices to avoid meaningless comparisons that didn’t take into account context, but earlier versions of our indices were displayed in a way that suggested more preciseness than they provided — we used a scale from 0 to 200 (with 100* as expected).

    *Originally we had 0 as our expected value, but quickly learnt that nobody likes having a negative score for their article, but something below 100 is more palatable.

    Predictably, people started worrying about small differences in the index values between articles. ‘This article scored 92 , but that one scored 103, that second article did better, let’s look at what we can learn from it’. Sadly the model we use to generate the index is not that accurate, and models, like data have uncertainty associated with them. Just as people agonise over small meaningless differences in raw numbers, the same was happening with the indices, and so we moved to a simple 5 point scale.

    Most articles get a 3, which can be interpreted as ‘we don’t think there is anything to see here, the article is doing as well as we’d expect on this measure’. An index of 2 or 1 means it is doing a bit worse or a lot worse than expected, and a 4 or a 5 means it is doing a bit better or a lot better than expected.

    In this format, the indices provide just enough information for us to know —  — how an article is doing. We use this alongside other data visualisations of indices or raw metrics where more precision is helpful, but in all cases our aim is to help focus attention on what matters, and free up time to validate these insights and decide what to do with them.

    Why are context and uncertainty so often ignored?

    These problems are not new and covered in many great books on data sense-making — some are decades old, but more recently Howard WainerStephen Few and R J Andrews.

    Practical guidance on dealing with  is easier to come by, but in our experience, thinking about  is trickier. From some perspectives this is odd. Predictive models — the bread and butter of data scientists — inherently deal with context as well as uncertainty, as do many of the tools for analysing time series data and detecting anomalies (such as statistical process control). But we are also taught to be cautious when making comparisons where there are fundamental differences between the things we are measuring. Since there are so many differences between the articles we publish, from length, position, who wrote them, what they are about, to the section and day of week on which they appear, we are left wondering whether we can or should use data to compare any of them. Perhaps the guidance on piecing all of this together to build better measurement metrics is less common, because how you deal with context is so contextual.

    Even if you set out on this path, there are many mundane reasons to fail. Often the valuable . It took us months to bring basic metadata about our articles— such as length and the position in which they appear— into the same system as the web analytics data. An even bigger obstacle is how much time it takes just to maintain a  metrics system (digital products are constantly changing, and this often breaks the web analytics data, including ours as I wrote this). Ideas for improving metrics often stay as ideas or proof of concepts that are not fully rolled out as you deal with these issues.

    If you do get started, there are myriad choices to make to account for context and uncertainty— from technical to ethical — all involving value judgements. If you stick with a simple metric you can avoid these choices. Bad choices can derail you, but even if you make good ones, if you can’t adequatelywhat you have done, you can’t expect the people who use the metrics to trust them. By accounting for context and uncertainty you may replace a simple (but not very useful) metric with something that is in theory more useful, but the opaqueness causes more problems than it solves. Even worse, people place too much trust in the metric and use it without questioning it.

    As for using data to make decisions. We will leave that for another post. But if the data is all noise and no signal, how do you present it in a clear way so the people using it understand what decisions it can help them make? The short answer is you can’t. But if the pressure is on to present some data, it is easier to passively display it in a big dashboard, filled with metrics and leave it to others to work out what to do, in the same way passive language can shield you if you have nothing interesting to say (or bullshit as Carl T. Bergstrom would call it). This is something else we have battled with, and we have tried to avoid replacing big dashboards filled with metrics with big dashboards filled with indices.

    Adding an R for reliable and an E for explainable, we end up with a checklist to help us avoid bad — or CRUDE — metrics (ontext eliability ncertainty ecision orientated xplainability). Checklists are always useful, as it’s easy to forget what matters along the way.

    Anybody promising a quick and easy path to metrics that solve all your problems is probably trying to sell you something. In our experience, it takes time and a significant commitment by everybody involved to build something better. If you don’t have this, it’s tough to even get started.

    Non-human metrics

    Part of the joy and pain of applying these principles to metrics used for analytics — that is, numbers that are put in front of people who then use them to help them make decisions — is that it provides a visceral feedback loop when you get it wrong. If the metrics cannot be easily understood, if they don’t convey enough information (or too much), if they are biased, or if they are unreliable or if they just look plain wrong vs. everything the person using them knows, you’re in trouble. Whatever the reason, you hear about it pretty quickly, and this is a good motivator for addressing problems head on if you want to maintain trust in the system you have built.

    Many metrics are not designed to be consumed by humans. The metrics that live inside automated decision systems are subject to many of the same considerations, biases and value judgements. It is sobering to consider the number of changes and improvements we have made based on the positive feedback loop from people using our metrics in the newsroom on a daily basis. This is not the case with many automated decision systems.

    Author: Dan Gilbert

    Source: Medium

  • DataRobot actief in AI-initiatief World Economic Forum

    DataRobot actief in AI-initiatief World Economic Forum

    Voor rechtvaardigheid, verantwoording en transparantie van Artificial Intelligence

    DataRobot, snelgroeiende leverancier van enterprise AI, heeft zich aangesloten bij een nieuw initiatief van het World Economic Forum: 'Shaping the Future of Technology Governance: Artificial Intelligence and Machine Learning'. Met dit initiatief wil het WEF de maatschappelijke impact van AI en machine learning vergroten, met waarborging van gelijkheid, privacy, transparantie, verantwoordingsplicht en sociale impact.
    Het World Economic Forum brengt in dit initiatief experts samen uit de publieke en private sector om beleidskaders te ontwikkelen en te testen die de ontwikkelingen van AI en machine learning moeten versnellen en de risico's ervan moeten verkleinen. Zo houdt het initiatief zich bezig met projecten als kinderbescherming, een moderne AI-regulator en beleid rondom gezichtsherkenningstechnologie. Binnen het initiatief gaat DataRobot nauw samenwerken met onderzoekers, organisaties en andere belanghebbenden om nieuwe inzichten te creëren over hoe AI kan en moet worden gebruikt ter verbetering van de samenleving terwijl ethiek en rechtvaardigheid gewaarborgd blijven. 
    De samenwerking met het World Economic Forum borduurt voort op DataRobots jarenlange inspanningen op het gebied van vertrouwen, monitoring en ethiek. In 2019 vormde het een Trusted AI-team, dat wordt geleid door Ted Kwartler. Dit team richt zich op het bouwen en leveren van betrouwbare en ethische AI-systemen, en het begeleiden van klanten op dit gebied. Tot deze klanten behoren enkele van de grootste banken ter wereld, zorgverzekeraars en diverse organisaties binnen de overheid.
    “Naarmate machine learning en AI zich blijven ontwikkelen en de acceptatie groeit, is samenwerking tussen organisaties essentieel om verantwoording, transparantie, privacy en onpartijdigheid te garanderen”, aldus Kay Firth-Butterfield, hoofd AI & Machine Learning en lid van het uitvoerend comité van het World Economic Forum. “Dit initiatief brengt experts samen die niet alleen op zoek zijn naar de positieve impact die AI op de samenleving in het algemeen kan hebben, maar die ook het vertrouwen in de organisaties en individuen die gebruikmaken van de technologie willen waarborgen.”
    “We bevinden ons op een cruciaal technologisch moment in de geschiedenis om verandering te stimuleren en een meer rechtvaardige, AI-gedreven toekomst vorm te geven ten behoeve van iedereen. Als leider op het gebied van enterprise AI en machine learning is het onze verantwoordelijkheid om een ​​actieve rol te spelen om ervoor te zorgen dat AI wordt gebruikt voor de verbetering van de samenleving”, zegt Ted Kwartler, VP Trusted AI bij DataRobot. “We zijn verheugd om onze krachten te bundelen met het World Economic Forum om de middelen te mobiliseren die nodig zijn om technologie duurzamer en inclusief te maken. We zien er naar uit onze inzichten met de branche te delen en samen te werken om een ​​meer ethisch, transparant en rechtvaardig AI-ecosysteem te bouwen."
    Bron: DataRobot
  • Deepfake Neural Networks: What are GANs?

    Deepfake Neural Networks: What are GANs?

    Generative adversarial networks (GANs) are one of the newer machine learning algorithms that data scientists are tapping into. When I first heard it, I wondered how can networks be adversarial? I envisioned networks with swords drawn going at it. Close… but I can assure you that no networks were harmed in the making of this article.

    Let’s break GAN down further to understand how this algorithm works and dispel the mystery behind it.

    • Generative model: A statistical model that can generate new data. This includes the distribution of the data.
    • Adversarial training process: There are two networks involved in training. One network generates the data (the generator) while the other network tries to discriminate (the discriminator) if that data is real or fake. If it is deemed to be fake, the generator is notified and tries to improve on the next batch of generated data. Therefore, the two networks are training against each other, hence the adversarial part.
    • Deep learning Networks: Deep learning methods use neural network architectures to process data, which is why they are often referred to as deep neural networks.

    Why on earth would you want to use a GAN?

    Now that you know what a GAN is, what do you do with it? You may have heard of deepfakes and enjoyed seeing videos of political leaders uttering some unbelievable statements. (Somedays, I wonder how we would know the difference!) Other than playing tricks on the world, GANs do have a valuable purpose.

    Deep learning models are data-hungry. What if you could just snap your fingers and grow your training data set? Well, GANs can help you create synthetic data for those deep learning models. Synthetic data, or artificial data, serves as proxy data because it maintains the statistical characteristics of the real-world data that it is based off. Synthetic data should generate observations based on existing variable distributions and preserve correlations amongst the variables in the data set.

    Deepfakes typically use image data and the type of GAN to create synthetic image data is called a styleGAN. However, other types of data such as tabular data (think rows and columns of integers, text, etc.) can also be created. This is a tabular GAN.

    I see lots of potential with GANs and synthetic data. Synthetic data allows you to create deep learning models when you may not have previously been able to do so. There simply may not be the volume of data available that is required, especially when you are working with new products or processes. Data may also be expensive and time-consuming to acquire from third-party resources or through data collection methods such as surveys and studies. Synthetic data may also help fulfill the gaps in underrepresented groups such as customer segments, regions, or even the different driving conditions required by computer vision models for self-driving cars. Lastly, because this data is generated, it does not impact human privacy (think GDPR and personal data sharing regulations) and is less risky should the data be breached.

    To remind us, that while synthetic data has the potential to help us progress with deep learning, the patterns in the synthetic data must be representative of the real data and should be verified as an initial step in the modeling process. 

    Author: Susan Kahler

    Source: Open Data Science

  • Determining the feature set complexity

    Determining the feature set complexity

    Thoughtful predictor selection is essential for model fairness

    One common AI-related fear I’ve often heard is that machine learning models will leverage oddball facts buried in vast databases of personal information to make decisions impacting lives. For example, the fact that you used Arial font in your resume, plus your cat ownership and fondness for pierogi, will prevent you from getting a job. Associated with such concerns is fear of discrimination based on sex or race due to this kind of inference. Are such fears silly or realistic? Machine learning models are based on correlation, and any feature associated with an outcome can be used as a decision basis; there is reason for concern. However, the risks of such a scenario occurring depend on the information available to the model and on the specific algorithm used. Here, I will use sample data to illustrate differences in incorporation of incidental information in random forest vs. XGBoost models, and discuss the importance of considering missing information, appropriateness and causality in assessing model fairness.

    Feature choice — examining what might be missing as well as what’s included– is very important for model fairness. Often feature inclusion is thought of only in terms of keeping or omitting “sensitive” features such as race or sex, or obvious proxies for these. However, a model may leverage any feature associated with the outcome, and common measures of model performance and fairness will be essentially unaffected. Incidental correlated features may not be appropriate decision bases, or they may represent unfairness risks. Incidental feature risks are highest when appropriate predictors are not included in the model. Therefore, careful consideration of what might be missing is crucial.


    This article builds on results from a previous blog post and uses the same dataset and code base to illustrate the effects of missing and incidental features [1, 2]. In brief, I use a publicly-available loans dataset, in which the outcome is loan default status (binary), and predictors include income, employment length, debt load, etc. I preferentially (but randomly) sort lower-income cases into a made-up “female” category, and for simplicity consider only two gender categories (“males” and “females”). The result is that “females” on average have a lower income, but male and female incomes overlap; some females are high-income, and some males low-income. Examining common fairness and performance metrics, I found similar results whether the model relied on income or on gender to predict defaults, illustrating risks of relying only on metrics to detect bias.

    My previous blog post showed what happens when an incidental feature substitutes for an appropriate feature. Here, I will discuss what happens when both the appropriate predictor and the incidental feature are included in the data. I test two model types, and show that, as might be expected, the female status contributes to predictions despite the fact it contains no additional information. However, the incidental feature contributes much more to the random forest model than to the XGBoost model, suggesting that model selection may be help reduce unfairness risk, although tradeoffs should be considered.

    Fairness metrics and global importances

    In my example, the female feature adds no information to a model that already contains income. Any reliance on female status is unnecessary and represents “direct discrimination” risk. Ideally, a machine learning algorithm would ignore such a feature in favor of the stronger predictor.

    When the incidental feature, female status, is added to wither a random forest or XGBoost model, I see little change in overall performance characteristics or performance metrics (data not shown). ROC scores barely budge (as should be expected). False positive rates show very slight changes.

    Demographic parity, or the difference in loan default rates for females vs. males, remain essentially unchanged for XGBoost (5.2% vs.5.3%) when the female indicator is included, but for random forest, this metric does change from 4.3% to 5.0%; I discuss this observation in detail below.

    Global permutation importances show weak influences from the female feature for both model types. This feature ranks 12/14 for the random forest model, and 22/26 for XGBoost (when female=1). The fact that female status is of relatively low importance may seem reassuring, but any influence from this feature is a fairness risk.

    There are no clear red flags in global metrics when female status is included in the data — but this is expected as fairness metrics are similar whether decisions are based on an incidental or causal factor [1]. The key question is: does incorporation of female status increase disparities in outcome?

    Aggregated shapley values

    We can measure the degree to which a feature contributes to differences in group predictions using aggregated Shapley values [3]. This technique distributes differences in predicted outcome rates across features so that we can determine what drives differences for females vs. males. Calculation involves constructing a reference dataset consisting of randomly selected males, calculating Shapley feature importances for randomly-selected females using this “foil”, and then aggregating the female Shapley values (also called “phi” values).

    Results are shown below for both model types, with and without the “female” feature. The top 5 features for the model not including female is plotted along with female status for the model that includes that feature. All other features are summed into “other”.

    1 w4QsW620 U9Y 5z0G9xNRw

    Image by author

    First, note that the blue bar for female (present for the model including female status only) is much larger for random forest than for XGBoost. The bar magnitudes indicate the amount of probability difference for women vs. men that is attributed to a feature. For random forest, the female status feature increases the probability of default for females relative to males by 1.6%, compared to 0.3% for XGBoost, an ~5x difference.

    For random forest, female status ranks in the top 3 influential features in determining the difference in prediction for males vs. females, even though the feature was the 12th most important globally. The global importance does not capture this feature’s impact on fairness.

    As mentioned in the section above, the random forest model shows decreased demographic parity when female status is included in the model. This effect is also apparent in the Shapley plots– the increase due to the female bar is not compensated for by any decrease in the other bars. For XGBoost, the small contribution from female status appears to be offset by tiny decreases in contributions from other features.

    The reduced impact of the incidental feature for XGBoost compared to random forest makes sense when we think about how the algorithms work. Random forests create trees using random subsets of features, which are examined for optimal splits. Some of these initial feature sets will include the incidental feature but not the appropriate predictor, in which case incidental features may be chosen for splits. For XGBoost models, split criteria are based on improvements to a previous model. An incidental feature can’t improve a model based on a stronger predictor; therefore, after several rounds, we expect trees to include the appropriate predictor only.

    Demographic parity decreases for random forest can also be understood considering model building mechanisms. When a subset of features to be considered for a split is generated in the random forest, we essentially have two “income” features, and so it’s more likely that (direct or indirect) income information will be selected.

    The random forest model effectively uses a larger feature set than XGBoost. Although numerous features are likely to appear in both model types to some degree, XGBoost solutions will be weighted towards a smaller set of more predictive features. This reduces, but does not eliminate, risks related to incidental features for XGBoost.

    Is XGBoost fairer than Random Forest?

    In a previous blog post [4], I showed that incorporation of interactions to mitigate feature bias was more effective for XGBoost than for random forest (for one test scenario). Here, I observe that the XGBoost model is also less influenced by incidental information. Does this mean that we should prefer XGBoost for fairness reasons?

    XGBoost has advantages when both an incidental and appropriate feature are included in the data but doesn’t reduce risk when only the incidental feature is included. A random forest model’s reliance on a larger set of features may be a benefit, especially when additional features are correlated with the missing predictor.

    Furthermore, the fact that XGBoost doesn’t rely much on the incidental feature does not mean that it doesn’t contribute at all. It may be that only a smaller number of decisions are based on inappropriate information.

    Leaving fairness aside, the fact that the random forest samples a larger portion of what you might think of as the “solution space”, and relies on more predictors, may be have some advantages for model robustness. When a model is deployed and faces unexpected errors in data, the random forest model may be somewhat more able to compensate. (On the other hand, if random forest incorporates a correlated feature that is affected by errors, it might be compromised while an XGBoost model remains unaffected).

    XGBoost may have some fairness advantages, but the “fairest” model type is context-dependent, and robustness and accuracy must also be considered. I feel that fairness testing and explainability, as well thoughtful feature choices, are probably more valuable than model type in promoting fairness.

    What am I missing?

    Fairness considerations are crucial in feature selection for models that might affect lives. There are numerous existing feature selection methods, which generally optimize accuracy or predictive power, but do not consider fairness. One question that these don’t address is “what feature am I missing?”

    A model that relies on an incidental feature that happens to be correlated with a strong predictor may appear to behave in a reasonable manner, despite making unfair decisions [1]. Therefore, it’s very important to ask yourself, “what’s missing?” when building a model. The answer to this question may involve subject matter expertise or additional research. Missing predictors thought to have causal effects may be especially important to consider [5, 6].

    Obviously, the best solution for a missing predictor is to incorporate it. Sometimes, this may be impossible. Some effects can’t be measured or are unobtainable. But you and I both know that simple unavailability seldom determines the final feature set. Instead, it’s often, “that information is in a different database and I don’t know how to access it”, or “that source is owned by a different group and they are tough to work with”, or “we could get it, but there’s a license fee”. Feature choice generally reflects time and effort — which is often fine. Expediency is great when it’s possible. But when fairness is compromised by convenience, something does need to give. This is when fairness testing, aggregated Shapley plots, and subject matter expertise may be needed to make the case to do extra work or delay timelines in order to ensure appropriate decisions.

    What am I including?

    Another key question is “what am I including?”, which can often be restated as “for what could this be a proxy?” This question can be superficially applied to every feature in the dataset but should be very carefully considered for features identified as contributing to group differences; such features can be identified using aggregated Shapley plots or individual explanations. It may be useful to investigate whether such features contribute additional information above what’s available from other predictors

    Who am I like, and what have they done before?

    A binary classification model predicting something like loan defaults, likelihood to purchase a product, or success at a job, is essentially asking the question, “Who am I like, and what have they done before?” The word “like” here means similar values of the features included in the data, weighted according to their predictive contribution to the model. We then model (or approximate) what this cohort has done in the past to generate a probability score, which we believe is indicative of future results for people in that group.

    The “who am I like?” question gets to the heart of worries that people will be judged if they eat too many pierogis, own too many cats, or just happen to be a certain race, sex, or ethnicity. The concern is that it is just not fair to evaluate individual people due to their membership in such groups, regardless of the average outcome for overall populations. What is appropriate depends heavily on context — perhaps pierogis are fine to consider in a heart attack model, but would be worrisome in a criminal justice setting.

    Our models assign people to groups — even if models are continuous, we can think of that as the limit of very little buckets — and then we estimate risks for these populations. This isn’t much different than old-school actuarial tables, except that we may be using a very large feature set to determine group boundaries, and we may not be fully aware of the meaning of information we use in the process.

    Final thoughts

    Feature choice is more than a mathematical exercise, and likely requires the judgment of subject matter experts, compliance analysts, or even the public. A data scientist’s contribution to this process should involve using explainability techniques to populations and discover features driving group differences. We can also identify at-risk populations and ask questions about features known to have causal relationships with outcomes.

    Legal and compliance departments often focus on included features, and their concerns may be primarily related to specific types of sensitive information. Considering what’s missing from a model is not very common. However, the question, “what’s missing?” is at least as important as, “what’s there?” in confirming that models make fair and appropriate decisions.

    Data scientists can be scrappy and adept at producing models with limited or noisy data. There is something satisfying about getting a model that “works” from less than ideal information. It can be hard to admit that something can’t be done, but sometimes fairness dictates that what we have right now really isn’t enough — or isn’t enough yet.

    Author: Valerie Carey

    Source: Towards Data Science

  • Different Roles in Data Science

    Different Roles in Data Science

    In this article, we will have a look at five distinct data careers, and hopefully provide some advice on how to get one's feet wet in this convoluted field.

    The data-related career landscape can be confusing, not only to newcomers, but also to those who have spent time working within the field.

    Get in where you fit in. Focusing on newcomers, however, I find from requests that I receive from those interested in join the data field in some capacity that there is often (and rightly) a general lack of understanding of what it is one needs to know in order to decide where it is that they fit in. In this article, we will have a look at five distinct data career archetypes, and hopefully provide some advice on how to get one's feet wet in this vast, convoluted field.

    We will focus solely on industry roles, as opposed to those in research, as not to add an additional layer of complication. We will also omit executive level positions such as Chief Data Officer and the like, mostly because if you are at the point in your career that this role is an option for you, you probably don't need the information in this article.

    So here are 5 data career archetypes, replete with descriptions and information on what makes them distinct from one another.

    FigureSource: KDnuggets

    Data Architect

    The data architect focuses on engineering and managing data stores and the data that reside within them.

    The data architect is concerned with managing data and engineering the infrastructure which stores and supports this data. There is generally little to no data analysis needing to take place in such a role (beyond data store analysis for performance tuning), and the use of languages such as Python and R is likely not necessary. An expert level knowledge of relational and non-relational databases, however, will undoubtedly be necessary for such a role. Selecting data stores for the appropriate types of data being stored, as well as transforming and loading the data, will be necessary. Databases, data warehouses, and data lakes; these are among the storage landscapes that will be in the data architect's wheelhouse. This role is likely the one which will have the greatest understanding of and closest relationship with hardware, primarily that related to storage, and will probably have the best understanding of cloud computing architectures of anyone else in this article as well.

    SQL and other data query languages — such as Jaql, Hive, Pig, etc. — will be invaluable, and will likely be some of the main tools of an ongoing data architect's daily work after a data infrastructure has been designed and implemented. Verifying the consistency of this data as well as optimizing access to it are also important tasks for this role. A data architect will have the know-how to maintain appropriate data access rights, ensure the infrastructure's stability, and guarantee the availability of the housed data.

    This is differentiated from the data engineer role by focus: while a data engineer is concerned with building and maintaining data pipelines (see below), the data architect is focused on the data itself. There may be overlap between the 2 roles, however: ETL; any task which could transform or move data, especially from one store to another; starting data on a journey down a pipeline.

    Like other roles in this article, you might not necessarily see a "data architect" role advertised as such, and might instead see related job titles, such as:

    • Database Administrator
    • Spark Administrator
    • Big Data Administrator
    • Database Engineer
    • Data Manager

    Data Engineer

    The data engineer focuses on engineering and managing the infrastructure which supports the data and data pipelines.

    What is the data infrastructure? It's the collection of software and storage solutions that allow for the retrieval of data from a data store, the processing of data in some specified manner (or series of manners), the movement of data between tasks (as well as the tasks themselves), as data is on its way to analysis or modeling, as well as the tasks which come after this analysis or modeling. It's the pathway that the data takes as it moves along its journey from its home to its ultimate location of usefulness, and beyond. The data engineer is certainly familiar with DataOps and its integration into the data lifecycle.

    From where does the data infrastructure come? Well, it needs to be designed and implemented, and the data engineer does this. If the data architect is the automobile mechanic, keeping the car running optimally, then data engineering can be thought of as designing the roadway and service centers that the automobile requires to both get around and to make the changes needed to continue on the next section of its journey. The pair of these roles are crucial to both the functioning and movement of your automobile, and are of equal importance when you are driving from point A to point B.

    Truth be told, some the technologies and skills required for data engineering and data management are similar; however, the practitioners of these disciplines use and understand these concepts at different levels. The data engineer may have a foundational knowledge of securing data access in a relational database, while the data architect has expert level knowledge; the data architect may have some understanding of the transformation process that an organization requires its stored data to undergo prior to a data scientist performing modeling with that data, while a data engineer knows this transformation process intimately. These roles speak their own languages, but these languages are more or less mutually intelligible.

    You might find related job titles advertised for such as:

    • Big Data Engineer
    • Data Infrastructure Engineer

    Data Analyst

    The data analyst focuses on the analysis and presentation of data.

    I'm using data analyst in this context to refer to roles related strictly to the descriptive statistical analysis and presentation of data. This includes the preparation of reporting, dashboards, KPIs, business performance metrics, as well as encompassing anything referred to as "business intelligence." The role often requires interaction with (or querying of) databases, both relational and non-relational, as well as with other data frameworks.

    While the previous pair of roles were related to designing the infrastructure to manage and facilitate the movement of the data, as well managing the data itself, data analysts are chiefly concerned with pulling from the data and working with it as it currently exists. This can be contrasted with the following 2 roles, machine learning engineers and data scientists, both of which focus on eliciting insights from data above and beyond what it already tells us at face value. If we can draw parallels between data scientists and inferential statisticians, then data analysts are descriptive statisticians; here is the current data, here is what it looks like, and here is what we know from it.

    Data analysts require a unique set of skills among the roles presented. Data analysts need to have an understanding of a variety of different technologies, including SQL & relational databases, NoSQL databases, data warehousing, and commercial and open-source reporting and dashboard packages. Along with having an understanding of some of the aforementioned technologies, just as important is an understanding of the limitations of these technologies. Given that a data analyst's reporting can often be ad hoc in nature, knowing what can and cannot be done without spending an ordination amount of time on a task prior to coming to this determination is important. If an analyst knows how data is stored, and how it can be accessed, they can also know what kinds of requests — often from people with absolutely no understanding of this — are and are not serviceable, and can suggest ways in which data can be pulled in a useful manner. Knowing how to quickly adapt can be key for a data analyst, and can separate the good from the great.

    Related job titles include:

    Machine Learning Engineer

    The machine learning engineer develops and optimizes machine learning algorithms, and implements and manages (near) production level machine learning models.

    Machine learning engineers are those crafting and using the predictive and correlative tools used to leverage data. Machine learning algorithms allow for the application of statistical analysis at high speeds, and those who wield these algorithms are not content with letting the data speak for itself in its current form. Interrogation of the data is the modus operandi of the machine learning engineer, but with enough of a statistical understanding to know when one has pushed too far, and when the answers provided are not to be trusted.

    Statistics and programming are some of the biggest assets to the machine learning researcher and practitioner. Maths such as linear algebra and intermediate calculus are useful for those employing more complex algorithms and techniques, such as neural networks, or working in computer vision, while an understanding of learning theory is also useful. And, of course, a machine learning engineer must have an understanding of the inner workings of an arsenal of machine learning algorithms (the more algorithms the better, and the deeper the understanding the better!).

    Once a machine learning model is good enough for production, a machine learning engineer may also be required to take it to production. Those machine learning engineers looking to do so will need to have knowledge of MLOps, a formalized approach for dealing with the issues arising in productionizing machine learning models.

    Related job titles:

    • Machine Learning Scientist
    • Machine Learning Practitioner
    • <specific machine learning technology> Engineer, e.g. Natural Language Processing Engineer, Computer Vision Engineer, etc.

    Data Scientist

    The data scientist is concerned primarily with the data, the insights which can be extracted from it, and the stories that it can tell.

    The data architect and data engineer are concerned with the infrastructure which houses and transports the data. The data analyst is concerned with pulling descriptive facts from the data as it exists. The machine learning engineer is concerned with advancing and employing the tools available to leverage data for predictive and correlative capabilities, as well as making the resulting models widely-available. The data scientist is concerned primarily with the data, the insights which can be extracted from it, and the stories that it can tell, regardless of what technologies or tools are needed to carry out that task.

    The data scientist may use any of the technologies listed in any of the roles above, depending on their exact role. And this is one of the biggest problems related to "data science"; the term means nothing specific, but everything in general. This role is the Jack Of All Trades of the data world, knowing (perhaps) how to get a Spark ecosystem up and running; how to execute queries against the data stored within; how to extract data and house in a non-relational database; how to take that non-relational data and extract it to a flat file; how to wrangle that data in R or Python; how to engineer features after some initial exploratory descriptive analysis; how to select an appropriate machine learning algorithm to perform some predictive analytics on the data; how to statistically analyze the results of said predictive task; how to visualize the results for easy consumption by non-technical folks; and how to tell a compelling story to executives with the end result of the data processing pipeline just described.

    And this is but one possible set of skills a data scientist may possess. Regardless, however, the emphasis in this role is on the data, and what can be gleaned from it. Domain knowledge is often a very large component of such a role as well, which is obviously not something that can be taught here. Key technologies and skills for a data scientist to focus on are statistics (!!!), programming languages (particularly Python, R, and SQL), data visualization, and communication skills — along with everything else noted in the above archetypes.

    There can be a lot of overlap between the data scientist and the machine learning engineer, at least in the realm of data modeling and everything that comes along with that. However, there is often confusion as to what the differences are between these roles as well. For a very solid discussion of the relationship between data engineers and data scientists, a pair of roles which also can also have significant overlap, have a look at this great article by Mihail Eric.

    Remember that these are simply archetypes of five major data profession roles, and these can vary between organizations. The flowchart in the image from the beginning of the article can be useful in helping you navigate the landscape and where you might find your role within it. Enjoy the ride to your ideal data profession!

    Author: Matthew Mayo

    Source: KDnuggets

  • Do data scientists have the right stuff for the C-suite?

    The Data Science Clock v1.1 Simple1What distinguishes strong from weak leaders? This raises the question if leaders are born or can be grown. It is the classic “nature versus nurture” debate. What matters more? Genes or your environment?

    This question got me to thinking about whether data scientists and business analysts within an organization can be more than just a support to others. Can they be become leaders similar to C-level executives? 

    Three primary success factors for effective leaders

    Having knowledge means nothing without having the right types of people. One person can make a big difference. They can be someone who somehow gets it altogether and changes the fabric of an organization’s culture not through mandating change but by engaging and motivating others.

    For weak and ineffective leaders irritating people is not only a sport for them but it is their personal entertainment. They are rarely successful. 

    One way to view successful leadership is to consider that there are three primary success factors for effective leaders. They are (1) technical competence, (2) critical thinking skills, and (3) communication skills. 

    You know there is a problem when a leader says, “I don’t do that; I have people who do that.” Good leaders do not necessarily have high intelligence, good memories, deep experience, or innate abilities that they are born with. They have problem solving skills. 

    As an example, the Ford Motor Company’s CEO Alan Mulally came to the automotive business from Boeing in the aerospace industry. He was without deep automotive industry experience. He has been successful at Ford. Why? Because he is an analytical type of leader.

    Effective managers are analytical leaders who are adaptable and possess systematic and methodological ways to achieve results. It may sound corny but they apply the “scientific method” that involves formulating hypothesis and testing to prove or disprove them. We are back to basics.

    A major contributor to the “scientific method” was the German mathematician and astronomer Johannes Kepler. In the early 1600s Kepler’s three laws of planetary motion led to the Scientific Revolution. His three laws made the complex simple and understandable, suggesting that the seemingly inexplicable universe is ultimately lawful and within the grasp of the human mind. 

    Kepler did what analytical leaders do. They rely on searching for root causes and understanding cause-and-effect logic chains. Ultimately a well-formulated strategy, talented people, and the ability to execute the executive team’s strategy through robust communications are the key to performance improvement. 

    Key characteristics of the data scientist or analyst as leader

    The popular Moneyball book and subsequent movie about baseball in the US demonstrated that traditional baseball scouts methods (e.g., “He’s got a good swing.”) gave way to fact-based evidence and statistical analysis. Commonly accepted traits of a leader, such as being charismatic or strong, may also be misleading.

    My belief is that the most scarce resource in an organization is human ability and competence. That is why organizations should desire that every employee be developed for growth in their skills. But having sound competencies is not enough. Key personal qualities complete the package of an effective leader. 

    For a data scientist or analyst to evolve as an effective leader three personal quality characteristics are needed: curiosity, imagination, and creativity. The three are sequentially linked. Curious people constantly ask “Why are things the way they are?” and “Is there a better way of doing things?” Without these personal qualities then innovation will be stifled. The emergence of analytics is creating opportunities for analysts as leaders. 

    Weak leaders are prone to a diagnostic bias. They can be blind to evidence and somehow believe their intuition, instincts, and gut-feel are acceptable masquerades for having fact-based information. In contrast, a curious person always asks questions. They typically love what they do. If they are also a good leader they infect others with enthusiasm. Their curiosity leads to imagination. Imagination considers alternative possibilities and solutions. Imagination in turn sparks creativity.

    Creativity is the implementation of imagination

    Good data scientists and analysts have a primary mission: to gain insights relying on quantitative techniques to result in better decisions and actions. Their imagination that leads to creativity can also result in vision. Vision is a mark of a good leader. In my mind, an executive leader has one job (aside from hiring good employees and growing them). That job is to answer the question, “Where do we want to go?” 

    After that question is answered then managers and analysts, ideally supported by the CFO’s accounting and finance team, can answer the follow-up question, “How are we going to get there?” That is where analytics are applied with the various enterprise and corporate performance management (EPM/CPM) methods that I regularly write about. EPM/CPM methods include a strategy map and its associated balance scorecard with KPIs; customer profitability analysis; enterprise risk management (ERM), and capacity-sensitive driver-based rolling financial forecasts and plans. Collectively they assure that the executive team’s strategy can be fully executed.

    My belief is that that other perceived characteristics of a good leader are over-rated. These include ambition, team spirit, collegiality, integrity, courage, tenacity, discipline, and confidence. They are nice-to-have characteristics, but they pale compared to the technical competency and critical thinking and communications skills that I earlier described. 

    Be analytical and you can be a leader. You can eventually serve in a C-suite role

    Author: Gary Cokins 

    Source: Information Management

  • Essential Data Science Tools And Frameworks

    Essential Data Science Tools And Frameworks

    The fields of data science and artificial intelligence see constant growth. As more companies and industries find value in automation, analytics, and insight discovery, there comes a need for the development of new tools, frameworks, and libraries to meet increased demand. There are some tools that seem to be popular year after year, but some newer tools emerge and quickly become a necessity for any practicing data scientist. As such, here are ten trending data science tools that you should have in your repertoire in 2021.


    PyTorch can be used for a variety of functions from building neural networks to decision trees due to the variety of extensible libraries including Scikit-Learn, making it easy to get on board. Importantly, the platform has gained substantial popularity and established community support that can be integral in solving usage problems. A key feature of Pytorch is its use of dynamic computational graphs, which state the order of computations defined by the model structure in a neural network for example.


    Scikit-learn has been around for quite a while and is widely used by in-house data science teams. Thus it’s not surprising that it’s a platform for not only training and testing NLP models but also NLP and NLU workflows. In addition to working well with many of the libraries already mentioned such as NLTK, and other data science tools, it has its own extensive library of models. Many NLP and NLU projects involve classic workflows of feature extraction, training, testing, model fit, and evaluation, meaning scikit-learn’s pipeline module fits this purpose well. 


    Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. For a number of years, it has remained the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies: web search, recommendation systems, weather forecasting, and many others. CatBoost is a popular open-source gradient boosting library with a whole set of advantages, such as being able to incorporate categorical features in your data (like music genre or city) with no additional preprocessing.


    AutoML automatically finds well-performing machine learning pipelines which allow data scientists to focus their efforts on other tasks, reducing the barrier to broadly apply machine learning and makes it available for everyone. Auto-Sklearn frees a machine learning user from algorithm selection and hyperparameter tuning, allowing them to use other data science tools. It leverages recent advantages in Bayesian optimization, meta-learning, and ensemble construction.


    As data becomes increasingly interconnected and systems increasingly sophisticated, it’s essential to make use of the rich and evolving relationships within our data. Graphs are uniquely suited to this task because they are, very simply, a mathematical representation of a network. Neo4J is a native graph database platform, built from the ground up to leverage not only data but also data relationships.


    This Google-developed framework excels where many other libraries don’t, such as with its scalable nature designed for production deployment. Tensorflow is often used for solving deep learning problems and for training and evaluating processes up to the model deployment. Apart from machine learning purposes, TensorFlow can be also used for building simulations, based on partial derivative equations. That’s why it is considered to be an all-purpose and one of the more popular data science tools for machine learning engineers.


    Apache Airflow is a data science tool created by the Apache community to programmatically author, schedule, and monitor workflows. The biggest advantage of Airflow is the fact that it does not limit the scope of pipelines. Airflow can be used for building machine learning models, transferring data, or managing the infrastructure. The most important thing about Airflow is the fact that it is an “orchestrator.” Airflow does not process data on its own, Airflow only tells others what has to be done and when.


    Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Originally developed by Google, Kubernetes progressively rolls out changes to your application or its configuration, while monitoring application health to ensure it doesn’t kill all your instances at the same time.


    Pandas is a popular data analysis library built on top of the Python programming language, and getting started with Pandas is an easy task. It assists with common manipulations for data cleaning, joining, sorting, filtering, deduping, and more. First released in 2009, pandas now sits as the epicenter of Python’s vast data science ecosystem and is an essential tool in the modern data analyst’s toolbox.


    Generative Pre-trained Transformer 3 (GPT-3) is a language model that uses deep learning to produce human-like text. GPT-3 is the most recent language model coming from the OpenAI research lab team. They announced GPT-3 in a May 2020 research paper, “Language Models are Few-Shot Learners.” While a tool like this may not be something you use daily as an NLP professional, it’s still an interesting skill to have. Being able to spit out human-like text, answer questions, and even create code, it’s a fun factoid to have.

    Author: Alex Landa

    Source: Open Data Science

  • Four important drivers of data science developments

    Four important drivers of data science developments

    According to the Gartner Group, digital business reached a tipping point last year, with 49% of CIOs reporting that their enterprises have already changed their business models or are in the process of doing so. When Gartner asked CIOs and IT leaders which technologies they expect to be most disruptive, artificial intelligence (AI) was the top-mentioned technology.

    AI and ML are having a profound impact on enterprise digital transformation becoming crucial as a competitive advantage and even for survival. As the field grows, four trends emerge, shaping data science in the next five years:

    Accelerate the full data science life-cycle

    The pressure to grow ROI from AI and ML initiatives has pushed demand for new innovative solutions that accelerate AI and data science. Although data science processes are iterative and highly manual, more than 40% of data science tasks are expected to be automated by 2020, according to Gartner, resulting in increased productivity and broader usage of data across the enterprise.

    Recently, automated machine learning (AutoML) has become one of the fastest-growing technologies for data science. Machine learning, however,  typically accounts for only 10-20% of the entire data science process. Real pains exist before the machine learning stage with data and feature engineering.  The new concept of data science automation goes beyond machine learning automation, including data preparation, feature engineering, machine learning, and the production of full data science pipelines. With data science automation, enterprises can genuinely accelerate AI and ML initiatives.

    Leverage existing resources for democratization

    Despite substantial investments in data science across many industries, the scarcity of data science skills and resources often limits the advancement of AI and ML projects in organizations. The shortage of data scientists has created a challenge for anyone implementing AI and ML initiatives, forcing a closer look at how to build and leverage data science resources.

    Other than the need for highly specialized technical skills and mathematical aptitude, data scientists must also couple these skills with domain/industry knowledge that is relevant to a specific business area. Domain knowledge is required for problem definition and result validation and is a crucial enabler to deliver business value from data science. Relying on 'data science unicorns' that have all these skill sets is neither realistic nor scalable.

    Enterprises are focusing on repurposing existing resources as 'citizen' data scientists. The rise of AutoML and data science automation can unlock data science to a broader user base and allow the practice to scale. By empowering citizen data scientists allowing them to execute standard use cases, skilled data scientists can focus on high-impact, technically-challenging projects to produce higher values.

    Augment insights for greater transparency

    As more organizations are adopting data science in their business process, relying on AI-derived recommendations that lack transparency is becoming problematic. Increased regulatory oversight like the GDPR has exacerbated the problem. Transparent insights make AI models more 'oversight' friendly and have the added benefit of being far more actionable.

    White-box AI models help organizations maintain accountability in data-driven decisions and allow them to live within the boundaries of regulations. The challenge is the need for high-quality and transparent inputs (aka 'features'), often requiring multiple manual iterations to achieve the needed transparency. Data science automation allows data scientists to explore millions of hypotheses and augments their ability to discover transparent and predictive features as business insights.

    Operationalize data science in business

    Although ML models are often tiny pieces of code, when models are finally deemed ready for production, deploying them can be complicated and problematic. For example, since data scientists are not software engineers, the quality of their code may not be production-ready. Data scientists often validate the models with down-sampled datasets in labs environments and models may not be scalable enough for production-scale datasets. Also, the performance of deployed models decreases as data invariably changes, making model maintenance pivotal to extract business value from AI and ML models continuously. Data and feature pipelines are much bigger and more complex than ML models themselves, and operationalizing data and feature pipelines is even more complicated.  One of the promising approaches is to leverage concepts from continuous deployment through APIs. Data science automation can generate APIs to execute the full data science pipeline, accelerating deployments while also providing an ongoing connection to development systems to accelerate the optimization and maintenance of models.

    Data science is at the heart of AI and ML. While the promise of AI is real, the problems associated with data science are also real. Through better planning, closer cooperation with line of business and by automating the more tedious and repetitive parts of the process, data scientists can finally begin to focus on what to solve, rather than how to solve.

    Author: Daniel Gutierrez

    Source: Insidebigdata

  • Getting Your Machine Learning Model To Production: Why Does It Take So Long?

    Getting Your Machine Learning Model To Production: Why Does It Take So Long?

    A Gentle Guide to the complexities of model deployment, and integrating with the enterprise application and data pipeline. What the Data Scientist, Data Engineer, ML Engineer, and ML Ops do, in Plain English.

    Let’s say we’ve identified a high-impact business problem at our company, built an ML (machine learning) model to tackle it, trained it, and are happy with the prediction results. This was a hard problem to crack that required much research and experimentation. So we’re excited about finally being able to use the model to solve our user’s problem!

    However, what we’ll soon discover is that building the model itself is only the tip of the iceberg. The bulk of the hard work to actually put this model into production is still ahead of us. I’ve found that this second stage could take even up to 90% of the time and effort for the project.

    So what does this stage comprise of? And why is it that it takes so much time? That is the focus of this article.

    Over several articles, my goal is to explore various facets of an organization’s ML journey as it goes all the way from deploying its first ML model to setting up an agile development and deployment process for rapid experimentation and delivery of ML projects. In order to understand what needs to be done in the second stage, let’s first see what gets delivered at the end of the first stage.

    What does the Model Building and Training phase deliver?

    Models are typically built and trained by the Data Science team. When it is ready, we have model code in Jupyter notebooks along with trained weights.

    • It is often trained using a static snapshot of the dataset, perhaps in a CSV or Excel file.
    • The snapshot was probably a subset of the full dataset.
    • Training is run on a developer’s local laptop, or perhaps on a VM in the cloud

    In other words, the development of the model is fairly standalone and isolated from the company’s application and data pipelines.

    What does “Production” mean?

    When a model is put into production, it operates in two modes:

    • Real-time Inference — perform online predictions on new input data, on a single sample at a time
    • Retraining — for offline retraining of the model nightly or weekly, with a current refreshed dataset

    The requirements and tasks involved for these two modes are quite different. This means that the model gets put into two production environments:

    • A Serving environment for performing Inference and serving predictions
    • A Training environment for retraining

    Real-time Inference and Retraining in Production (Source: Author)

    Real-time Inference is what most people would have in mind when they think of “production”. But there are also many use cases that do Batch Inference instead of Real-time.

    • Batch Inference — perform offline predictions nightly or weekly, on a full dataset

    Batch Inference and Retraining in Production (Source: Author)

    For each of these modes separately, the model now needs to be integrated with the company’s production systems — business application, data pipeline, and deployment infrastructure. Let’s unpack each of these areas to see what they entail.

    We’ll start by focusing on Real-time Inference, and after that, we’ll examine the Batch cases (Retraining and Batch Inference). Some of the complexities that come up are unique to ML, but many are standard software engineering challenges.

    Inference — Application Integration

    A model usually is not an independent entity. It is part of a business application for end users eg. a recommender model for an e-commerce site. The model needs to be integrated with the interaction flow and business logic of the application.

    The application might get its input from the end-user via a UI and pass it to the model. Alternately, it might get its input from an API endpoint, or from a streaming data system. For instance, a fraud detection algorithm that approves credit card transactions might process transaction input from a Kafka topic.

    Similarly, the output of the model gets consumed by the application. It might be presented back to the user in the UI, or the application might use the model’s predictions to make some decisions as part of its business logic.

    Inter-process communication between the model and the application needs to be built. For example, we might deploy the model as its own service accessed via an API call. Alternately, if the application is also written in the same programming language (eg. Python), it could just make a local function call to the model code.

    This work is usually done by the Application Developer working closely with the Data Scientist. As with any integration between modules in a software development project, this requires collaboration to ensure that assumptions about the formats and semantics of the data flowing back and forth are consistent on both sides. We all know the kinds of issues that can crop up. eg. If the model expects a numeric ‘quantity’ field to be non-negative, will the application do the validation before passing it to the model? Or is the model expected to perform that check? In what format is the application passing dates and does the model expect the same format?


    Real-time Inference Lifecycle (Source: Author)

    Inference — Data Integration

    The model can no longer rely on a static dataset that contains all the features it needs to make its predictions. It needs to fetch ‘live’ data from the organization’s data stores.

    These features might reside in transactional data sources (eg. a SQL or NoSQL database), or they might be in semi-structured or unstructured datasets like log files or text documents. Perhaps some features are fetched by calling an API, either an internal microservice or application (eg. SAP) or an external third-party endpoint.

    If any of this data isn’t in the right place or in the right format, some ETL (Extract, Transform, Load) jobs may have to be built to pre-fetch the data to the store that the application will use.

    Dealing with all the data integration issues can be a major undertaking. For instance:

    • Access requirements — how do you connect to each data source, and what are its security and access control policies?
    • Handle errors — what if the request times out, or the system is down?
    • Match latencies — how long does a query to the data source take, versus how quickly do we need to respond to the user?
    • Sensitive data — Is there personally identifiable information that has to be masked or anonymized.
    • Decryption — does data need to decrypted before the model can use it?
    • Internationalization — can the model handle the necessary character encodings and number/date formats?
    • and many more…

    This tooling gets built by a Data Engineer. For this phase as well, they would interact with the Data Scientist to ensure that the assumptions are consistent and the integration goes smoothly. eg. Is the data cleaning and pre-processing done by the model enough, or do any more transformations have to be built?

    Inference — Deployment

    It is now time to deploy the model to the production environment. All the factors that one considers with any software deployment come up:

    • Model Hosting — on a mobile app? In an on-premise data center or on the cloud? On an embedded device?
    • Model Packaging — what dependent software and ML libraries does it need? These are typically different from your regular application libraries.
    • Co-location — will the model be co-located with the application? Or as an external service?
    • Model Configuration settings — how will they be maintained and updated?
    • System resources required — CPU, RAM, disk, and most importantly GPU, since that may need specialized hardware.
    • Non-functional requirements — volume and throughput of request traffic? What is the expected response time and latency?
    • Auto-Scaling — what kind of infrastructure is required to support it?
    • Containerization — does it need to be packaged into a Docker container? How will container orchestration and resource scheduling be done?
    • Security requirements — credentials to be stored, private keys to be managed in order to access data?
    • Cloud Services — if deploying to the cloud, is integration with any cloud services required eg. (Amazon Web Services) AWS S3? What about AWS access control privileges?
    • Automated deployment tooling — to provision, deploy and configure the infrastructure and install the software.
    • CI/CD — automated unit or integration tests to integrate with the organization’s CI/CD pipeline.

    The ML Engineer is responsible for implementing this phase and deploying the application into production. Finally, you’re able to put the application in front of the customer, which is a significant milestone!

    However, it is not yet time to sit back and relax 😃. Now begins the ML Ops task of monitoring the application to make sure that it continues to perform optimally in production.

    Inference — Monitoring

    The goal of monitoring is to check that your model continues to make correct predictions in production, with live customer data, as it did during development. It is quite possible that your metrics will not be as good.

    In addition, you need to monitor all the standard DevOps application metrics just like you would for any application — latency, response time, throughput as well as system metrics like CPU utilization, RAM, etc. You would run the normal health checks to ensure uptime and stability of the application.

    Equally importantly, monitoring needs to be an ongoing process, because there is every chance that your model’s evaluation metrics will deteriorate with time. Compare your evaluation metrics to past metrics to check that there is no deviation from historical trends.

    This can happen because of data drift.

    Inference — Data Validation

    As time goes on, your data will evolve and change — new data sources may get added, new feature values will get collected, new customers will input data with different values than before. This means that the distribution of your data could change.

    So validating your model with current data needs to be an ongoing activity. It is not enough to look only at evaluation metrics for the global dataset. You should evaluate metrics for different slices and segments of your data as well. It is very likely that as your business evolves and as customer demographics, preferences, and behavior change, your data segments will also change.

    The data assumptions that were made when the model was first built may no longer hold true. To account for this, your model needs to evolve as well. The data cleaning and pre-processing that the model does might also need to be updated.

    And that brings us to the second production mode — that of Batch Retraining on a regular basis so that the model continues to learn from fresh data. Let’s look at the tasks required to set up Batch Retraining in production, starting with the development model.


    Retraining Lifecycle (Source: Author)

    Retraining — Data Integration

    When we discussed Data Integration for Inference, it involved fetching a single sample of the latest ‘live’ data. On the other hand, during Retraining, we need to fetch a full dataset of historical data. Also, this Retraining happens in batch mode, say every night or every week.

    Historical doesn’t necessarily mean “old and outdated” data — it could include all of the data gathered until yesterday, for instance.

    This dataset would typically reside in an organization’s analytics stores, such as a data warehouse or data lake. If some data isn’t present there, you might need to build additional ETL jobs to transfer that data into the warehouse in the required format.

    Retraining — Application Integration

    Since we’re only retraining the model by itself, the whole application is not involved. So no Application Integration work is needed.

    Retraining — Deployment

    Retraining is likely to happen with a massive amount of data, probably far larger than what was used during development.

    You will need to figure out the hardware infrastructure needed to train the model — what are its GPU and RAM requirements? Since training needs to complete in a reasonable amount of time, it will need to be distributed across many nodes in a cluster, so that training happens in parallel. Each node will need to be provisioned and managed by a Resource Scheduler so that hardware resources can be efficiently allocated to each training process.

    The setup will also need to ensure that these large data volumes can be efficiently transferred to all the nodes on which the training is being executed.

    And before we wrap up, let’s look at our third production use case — the Batch Inference scenario.

    Batch Inference

    Often, the Inference does not have to run ‘live’ in real-time for a single data item at a time. There are many use cases for which it can be run as a batch job, where the output results for a large set of data samples are pre-computed and cached.

    The pre-computed results can then be used in different ways depending on the use case. eg.

    • They could be stored in the data warehouse for reporting or for interactive analysis by business analysts.
    • They could be cached and displayed by the application to the user when they log in next.
    • Or they could be cached and used as input features by another downstream application.

    For instance, a model that predicts the likelihood of customer churn (ie. they stop buying from you) can be run every week or every night. The results could then be used to run a special promotion for all customers who are classified as high risks. Or they could be presented with an offer when they next visit the site.

    A Batch Inference model might be deployed as part of a workflow with a network of applications. Each application is executed after its dependencies have completed.

    Many of the same application and data integration issues that come up with Real-time Inference also apply here. On the other hand, Batch Inference does not have the same response-time and latency demands. But, it does have high throughput requirements as it deals with enormous data volumes.


    As we have just seen, there are many challenges and a significant amount of work to put a model in production. Even after the Data Scientists ready a trained model, there are many roles in an organization that all come together to eventually bring it to your customers and to keep it humming month after month. Only then does the organization truly get the benefit of harnessing machine learning.

    We’ve now seen the complexity of building and training a real-world model, and then putting it into production. In the next article, we’ll take a look at how the leading-edge tech companies have addressed these problems to churn out ML applications rapidly and smoothly.

    And finally, if you liked this article, you might also enjoy my other series on Transformers, Audio Deep Learning, and Geolocation Machine Learning.

    Author: Ketan Doshi

    Source: Towards Data Science

  • Hé Data Scientist! Are you a geek, nerd or suit?

    NerdData scientists are known for their unique skill sets. While thousands of compelling articles have been written about what a data scientist does, most of these articles fall short in examining what happens after you’ve hired a new data scientist to your team. 

    The onboarding process for your data scientist should be based on the skills and areas of improvement you’ve identified for the tasks you want them to complete. Here’s how we do it at Elicit.

    We’ve all seen the data scientist Venn diagrams over the past few years, which includes three high-level types of skills: programming, statistics/modeling, and domain expertise. Some even feature the ever-elusive “unicorn” at the center. 

    While these diagrams provide us with a broad understanding of the skillset required for the role in general, they don’t have enough detail to differentiate data scientists and their roles inside a specific organization. This can lead to poor hires and poor onboarding experiences.

    If the root of what a data scientist does and is capable of is not well understood, then both parties are in for a bad experience. Near the end of 2016, Anand Ramanathan wrote a post that really stuck with me called //medium.com/@anandr42/the-data-science-delusion-7759f4eaac8e" style="box-sizing:border-box;background-color:transparent;color:rgb(204, 51, 51);text-decoration:none">The Data Science Delusion. In it, Ramanathan talks about how within each layer of the data science Venn diagram there are degrees of understanding and capability.

    For example, Ramanathan breaks down the modeling aspect into four quadrants based on modeling difficulty and system complexity, explaining that not every data scientist has to be capable in all four quadrants—that different problems call for different solutions and different skillsets. 

    For example, if I want to understand customer churn, I probably don’t need a deep learning solution. Conversely, if I’m trying to recognize images, a logistic regression probably isn’t going to help me much.

    In short, you want your data scientist to be skilled in the specific areas that role will be responsible for within the context of your business.

    Ramanathan’s article also made me reflect on our data science team here at Elicit. Anytime we want to solve a problem internally or with a client we use our "Geek Nerd Suit" framework to help us organize our thoughts.

    Basically, it states that for any organization to run at optimal speed, the technology (Geek), analytics (Nerd), and business (Suit) functions must be collaborating and making decisions in lockstep. Upon closer inspection, the data science Venn diagram is actually comprised of Geek (programming), Nerd (statistics/modeling), and Suit (domain expertise) skills.

    But those themes are too broad; they still lack the detail needed to differentiate the roles of a data scientist. And we’d heard this from our team internally: in a recent employee survey, the issue of career advancement, and more importantly, skills differentiation, cropped up from our data science team.

    As a leadership team, we always knew the strengths and weaknesses of our team members, but for their own sense of career progression they were asking us to be more specific and transparent about them. This pushed us to go through the exercise of taking a closer look at our own evaluation techniques, and resulted in a list of specific competencies within the Geek, Nerd, and Suit themes. We now use these competencies both to assess new hires and to help them develop in their careers once they’ve joined us.

    For example, under the Suit responsibilities we define a variety of competencies that, amongst other things, include adaptability, business acumen, and communication. Each competency then has explicit sets of criteria associated with them that illustrate a different level of mastery within that competency. 

    We’ve established four levels of differentiation: “entry level,” “intermediate,” “advanced” and “senior.” To illustrate, here’s the distinction between “entry level” and “intermediate” for the Suit: Adaptability competency:

    Entry Level:

    • Analyzes both success and failures for clues to improvement.
    • Maintains composure during client meetings, remaining cool under pressure and not becoming defensive, even when under criticism.


    • Experiments and perseveres to find solutions.
    • Reads situations quickly.
    • Swiftly learns new concepts, skills, and abilities when facing new problems.

    And there are other specific criteria for the “advanced” and “senior” levels as well. 

    This led us to four unique data science titles—Data Scientist I, II, and III, as well as Senior Data Scientist, with the latter title still being explored for further differentiation. 

    The Geek Nerd Suit framework, and the definitions of the competencies within them, gives us clear, explicit criteria for assessing a new hire’s skillset in the three critical dimensions that are required for a data scientist to be successful.

    In Part 2, I’ll discuss what we specifically do within the Geek Nerd Suit framework to onboard a new hire once they’ve joined us—how we begin to groom the elusive unicorn. 

    Source: Information Management

    Author: Liam Hanham

  • How AI is influencing web design

    How AI is influencing web design

    Artificial intelligence in web design is making a major impact. This is what to know about how it works and how effective it can be.

    When Alan Turing invented the first intelligent machine, few could have predicted that the advanced technology would become as widespread and ubiquitous as it is today.

    Since then, companies have adopted AI (artificial intelligence) for pretty much everything, from self-driving cars to medical technology to banking. We live in the age of big data, an age in which we use machines to collect and analyze massive amounts of data in a way that humans couldn’t do on their own. In many respects, the cognition of machines is already surpassing that of humans.

    With the explosion of the internet, AI has also become a critical element of web design. Artificial intelligence has helped with everything from the building and customization of websites and brands to the way users experience those websites themselves.

    Here are some of the ways AI is making web design increasingly sophisticated:

    AI designs websites

    Artificial design intelligence (ADI) tools are the building blocks of many of today’s websites. These days, ADI systems have evolved into effective tools with functional and attractive results. Wix and Bookmark, for example, offer popular automated website building tools with customizable options. Designers, developers, and everyday entrepreneurs no longer have to build websites from the ground up, nor do they need to spend hours choosing the perfect template. Instead, both Wix and Bookmark claim that websites can intelligently design themselves, using nothing more than the site’s name and the answers to a few quick questions.

    Not only does AI help engineer the web building process, but it’s also become the designer behind the brand names and logos that dominate a website’s home page. Companies are turning to artificial intelligence to automate their branding process, using AI tools like Tailor Brands to design their own customized logos in seconds. In this way, AI has made good web design more accessible and affordable for big companies and small-scale entrepreneurs alike.

    AI enhances user experience

    AI isn’t just changing web design on the developer end, it’s changing the way users experience websites, too. AI is the force behind the chatbots that offer conversation or assistance on many companies’ websites. While conversations with chatbots once felt frustrating, repetitive, and a little too robotic, more sophisticated AI-powered chatbots use natural language processing (NLP) to have more natural, authentic conversations and to genuinely “understand” their customers’ needs. Sephora’s chatbot Kik is one example of a powerful NLP chatbot that understands customers’ beauty needs and provides them with recommendations based on these needs.

    In addition to the practical value of chatbots, the prevalence of chatbots indicates an increasing shift towards customer-focused websites, ones that prioritize drawing customers in over getting their message out. With the emergence of AI chatbots, websites have transformed into customer engagement platforms, where customers can offer their feedback, ask for help, or find products or services suited to their preferences.

    AI analyzes results

    We’ve seen how AI has benefitted both website building and user experience. A third way AI is affecting web design is by making possible analytics tools that help companies analyze their results and refine their websites accordingly.

    By crunching down big data into analyzable numbers and patterns, predictive analytics tools like TensorFlow and Infosys Nia reveal real-time insights about what does and doesn’t work for website visitors and prospective customers. This enables businesses to understand which types of customers are drawn to their site, and to accommodate those visitors with a seamless user experience. Using results from AI-powered analytics platforms, web developers and designers are able to tweak and refine their site and make it increasingly user-friendly.

    AI in web design: where is it heading next?

    AI is already being used in web design to make site building and design easier and more accessible, to enhance UX and further user engagement, and to drive site improvement through big data analytics. As artificial intelligence becomes even more advanced, affordable, and widespread, it will continue to affect web design in ways we can only imagine. Will improved natural language processing make chatbots indistinguishable from human representatives? Will websites readily adapt, real-time, to users’ preferences and needs? Whatever happens, AI is already the new normal.

    Author: Diana Hope

    Source: SmartDataCollective

  • How algorithms mislead the human brain in social media - Part 1

    How algorithms mislead the human brain in social media - Part 1

    Consider Andy, who is worried about contracting COVID-19. Unable to read all the articles he sees on it, he relies on trusted friends for tips. When one opines on Facebook that pandemic fears are overblown, Andy dismisses the idea at first. But then the hotel where he works closes its doors, and with his job at risk, Andy starts wondering how serious the threat from the new virus really is. No one he knows has died, after all. A colleague posts an article about the COVID “scare” having been created by Big Pharma in collusion with corrupt politicians, which jibes with Andy's distrust of government. His Web search quickly takes him to articles claiming that COVID-19 is no worse than the flu. Andy joins an online group of people who have been or fear being laid off and soon finds himself asking, like many of them, “What pandemic?” When he learns that several of his new friends are planning to attend a rally demanding an end to lockdowns, he decides to join them. Almost no one at the massive protest, including him, wears a mask. When his sister asks about the rally, Andy shares the conviction that has now become part of his identity: COVID is a hoax.

    This example illustrates a minefield of cognitive biases. We prefer information from people we trust, our in-group. We pay attention to and are more likely to share information about risks—for Andy, the risk of losing his job. We search for and remember things that fit well with what we already know and understand. These biases are products of our evolutionary past, and for tens of thousands of years, they served us well. People who behaved in accordance with them—for example, by staying away from the overgrown pond bank where someone said there was a viper—were more likely to survive than those who did not.

    Modern technologies are amplifying these biases in harmful ways, however. Search engines direct Andy to sites that inflame his suspicions, and social media connects him with like-minded people, feeding his fears. Making matters worse, bots—automated social media accounts that impersonate humans—enable misguided or malevolent actors to take advantage of his vulnerabilities.

    Compounding the problem is the proliferation of online information. Viewing and producing blogs, videos, tweets and other units of information called memes has become so cheap and easy that the information marketplace is inundated. Unable to process all this material, we let our cognitive biases decide what we should pay attention to. These mental shortcuts influence which information we search for, comprehend, remember and repeat to a harmful extent.

    The need to understand these cognitive vulnerabilities and how algorithms use or manipulate them has become urgent. At the University of Warwick in England and at Indiana University Bloomington's Observatory on Social Media (OSoMe, pronounced “awesome”), our teams are using cognitive experiments, simulations, data mining and artificial intelligence to comprehend the cognitive vulnerabilities of social media users. Insights from psychological studies on the evolution of information conducted at Warwick inform the computer models developed at Indiana, and vice versa. We are also developing analytical and machine-learning aids to fight social media manipulation. Some of these tools are already being used by journalists, civil-society organizations and individuals to detect inauthentic actors, map the spread of false narratives and foster news literacy.

    Information Overload

    The glut of information has generated intense competition for people's attention. As Nobel Prize–winning economist and psychologist Herbert A. Simon noted, “What information consumes is rather obvious: it consumes the attention of its recipients.” One of the first consequences of the so-called attention economy is the loss of high-quality information. The OSoMe team demonstrated this result with a set of simple simulations. It represented users of social media such as Andy, called agents, as nodes in a network of online acquaintances. At each time step in the simulation, an agent may either create a meme or reshare one that he or she sees in a news feed. To mimic limited attention, agents are allowed to view only a certain number of items near the top of their news feeds.

    Running this simulation over many time steps, Lilian Weng of OSoMe found that as agents' attention became increasingly limited, the propagation of memes came to reflect the power-law distribution of actual social media: the probability that a meme would be shared a given number of times was roughly an inverse power of that number. For example, the likelihood of a meme being shared three times was approximately nine times less than that of its being shared once.

    This winner-take-all popularity pattern of memes, in which most are barely noticed while a few spread widely, could not be explained by some of them being more catchy or somehow more valuable: the memes in this simulated world had no intrinsic quality. Virality resulted purely from the statistical consequences of information proliferation in a social network of agents with limited attention. Even when agents preferentially shared memes of higher quality, researcher Xiaoyan Qiu, then at OSoMe, observed little improvement in the overall quality of those shared the most. Our models revealed that even when we want to see and share high-quality information, our inability to view everything in our news feeds inevitably leads us to share things that are partly or completely untrue.

    Cognitive biases greatly worsen the problem. In a set of groundbreaking studies in 1932, psychologist Frederic Bartlett told volunteers a Native American legend about a young man who hears war cries and, pursuing them, enters a dreamlike battle that eventually leads to his real death. Bartlett asked the volunteers, who were non-Native, to recall the rather confusing story at increasing intervals, from minutes to years later. He found that as time passed, the rememberers tended to distort the tale's culturally unfamiliar parts such that they were either lost to memory or transformed into more familiar things. We now know that our minds do this all the time: they adjust our understanding of new information so that it fits in with what we already know. One consequence of this so-called confirmation bias is that people often seek out, recall and understand information that best confirms what they already believe.

    This tendency is extremely difficult to correct. Experiments consistently show that even when people encounter balanced information containing views from differing perspectives, they tend to find supporting evidence for what they already believe. And when people with divergent beliefs about emotionally charged issues such as climate change are shown the same information on these topics, they become even more committed to their original positions.

    Making matters worse, search engines and social media platforms provide personalized recommendations based on the vast amounts of data they have about users' past preferences. They prioritize information in our feeds that we are most likely to agree with—no matter how fringe—and shield us from information that might change our minds. This makes us easy targets for polarization. Nir Grinberg and his co-workers at Northeastern University recently showed that conservatives in the U.S. are more receptive to misinformation. But our own analysis of consumption of low-quality information on Twitter shows that the vulnerability applies to both sides of the political spectrum, and no one can fully avoid it. Even our ability to detect online manipulation is affected by our political bias, though not symmetrically: Republican users are more likely to mistake bots promoting conservative ideas for humans, whereas Democrats are more likely to mistake conservative human users for bots.

    Social Herding

    In New York City in August 2019, people began running away from what sounded like gunshots. Others followed, some shouting, “Shooter!” Only later did they learn that the blasts came from a backfiring motorcycle. In such a situation, it may pay to run first and ask questions later. In the absence of clear signals, our brains use information about the crowd to infer appropriate actions, similar to the behavior of schooling fish and flocking birds.

    Such social conformity is pervasive. In a fascinating 2006 study involving 14,000 Web-based volunteers, Matthew Salganik, then at Columbia University, and his colleagues found that when people can see what music others are downloading, they end up downloading similar songs. Moreover, when people were isolated into “social” groups, in which they could see the preferences of others in their circle but had no information about outsiders, the choices of individual groups rapidly diverged. But the preferences of “nonsocial” groups, where no one knew about others' choices, stayed relatively stable. In other words, social groups create a pressure toward conformity so powerful that it can overcome individual preferences, and by amplifying random early differences, it can cause segregated groups to diverge to extremes.

    Social media follows a similar dynamic. We confuse popularity with quality and end up copying the behavior we observe. Experiments on Twitter by Bjarke Mønsted and his colleagues at the Technical University of Denmark and the University of Southern California indicate that information is transmitted via “complex contagion”: when we are repeatedly exposed to an idea, typically from many sources, we are more likely to adopt and reshare it. This social bias is further amplified by what psychologists call the “mere exposure” effect: when people are repeatedly exposed to the same stimuli, such as certain faces, they grow to like those stimuli more than those they have encountered less often.

    Such biases translate into an irresistible urge to pay attention to information that is going viral—if everybody else is talking about it, it must be important. In addition to showing us items that conform with our views, social media platforms such as Facebook, Twitter, YouTube and Instagram place popular content at the top of our screens and show us how many people have liked and shared something. Few of us realize that these cues do not provide independent assessments of quality.

    In fact, programmers who design the algorithms for ranking memes on social media assume that the “wisdom of crowds” will quickly identify high-quality items; they use popularity as a proxy for quality. Our analysis of vast amounts of anonymous data about clicks shows that all platforms—social media, search engines and news sites—preferentially serve up information from a narrow subset of popular sources.

    To understand why, we modeled how they combine signals for quality and popularity in their rankings. In this model, agents with limited attention—those who see only a given number of items at the top of their news feeds—are also more likely to click on memes ranked higher by the platform. Each item has intrinsic quality, as well as a level of popularity determined by how many times it has been clicked on. Another variable tracks the extent to which the ranking relies on popularity rather than quality. Simulations of this model reveal that such algorithmic bias typically suppresses the quality of memes even in the absence of human bias. Even when we want to share the best information, the algorithms end up misleading us.

    Want to continue reading? You can find part 2 of this article herehere

    Authors: Filippo Menczer

    Source: Scientific American

  • How algorithms mislead the human brain in social media - Part 2

    How algorithms mislead the human brain in social media - Part 2

    If you haven't read part 1 of this article yet, be sure to check it out here.

    Echo Chambers

    Most of us do not believe we follow the herd. But our confirmation bias leads us to follow others who are like us, a dynamic that is sometimes referred to as homophily—a tendency for like-minded people to connect with one another. Social media amplifies homophily by allowing users to alter their social network structures through following, unfriending, and so on. The result is that people become segregated into large, dense and increasingly misinformed communities commonly described as echo chambers.

    At OSoMe, we explored the emergence of online echo chambers through another simulation, EchoDemo. In this model, each agent has a political opinion represented by a number ranging from −1 (say, liberal) to +1 (conservative). These inclinations are reflected in agents' posts. Agents are also influenced by the opinions they see in their news feeds, and they can unfollow users with dissimilar opinions. Starting with random initial networks and opinions, we found that the combination of social influence and unfollowing greatly accelerates the formation of polarized and segregated communities.

    Indeed, the political echo chambers on Twitter are so extreme that individual users' political leanings can be predicted with high accuracy: you have the same opinions as the majority of your connections. This chambered structure efficiently spreads information within a community while insulating that community from other groups. In 2014 our research group was targeted by a disinformation campaign claiming that we were part of a politically motivated effort to suppress free speech. This false charge spread virally mostly in the conservative echo chamber, whereas debunking articles by fact-checkers were found mainly in the liberal community. Sadly, such segregation of fake news items from their fact-check reports is the norm.

    Social media can also increase our negativity. In a recent laboratory study, Robert Jagiello, also at Warwick, found that socially shared information not only bolsters our biases but also becomes more resilient to correction. He investigated how information is passed from person to person in a so-called social diffusion chain. In the experiment, the first person in the chain read a set of articles about either nuclear power or food additives. The articles were designed to be balanced, containing as much positive information (for example, about less carbon pollution or longer-lasting food) as negative information (such as risk of meltdown or possible harm to health).

    The first person in the social diffusion chain told the next person about the articles, the second told the third, and so on. We observed an overall increase in the amount of negative information as it passed along the chain—known as the social amplification of risk. Moreover, work by Danielle J. Navarro and her colleagues at the University of New South Wales in Australia found that information in social diffusion chains is most susceptible to distortion by individuals with the most extreme biases.

    Even worse, social diffusion also makes negative information more “sticky.” When Jagiello subsequently exposed people in the social diffusion chains to the original, balanced information—that is, the news that the first person in the chain had seen—the balanced information did little to reduce individuals' negative attitudes. The information that had passed through people not only had become more negative but also was more resistant to updating.

    2015 study by OSoMe researchers Emilio Ferrara and Zeyao Yang analyzed empirical data about such “emotional contagion” on Twitter and found that people overexposed to negative content tend to then share negative posts, whereas those overexposed to positive content tend to share more positive posts. Because negative content spreads faster than positive content, it is easy to manipulate emotions by creating narratives that trigger negative responses such as fear and anxiety. Ferrara, now at the University of Southern California, and his colleagues at the Bruno Kessler Foundation in Italy have shown that during Spain's 2017 referendum on Catalan independence, social bots were leveraged to retweet violent and inflammatory narratives, increasing their exposure and exacerbating social conflict.

    Rise of the Bots

    Information quality is further impaired by social bots, which can exploit all our cognitive loopholes. Bots are easy to create. Social media platforms provide so-called application programming interfaces that make it fairly trivial for a single actor to set up and control thousands of bots. But amplifying a message, even with just a few early upvotes by bots on social media platforms such as Reddit, can have a huge impact on the subsequent popularity of a post.

    At OSoMe, we have developed machine-learning algorithms to detect social bots. One of these, Botometer, is a public tool that extracts 1,200 features from a given Twitter account to characterize its profile, friends, social network structure, temporal activity patterns, language and other features. The program compares these characteristics with those of tens of thousands of previously identified bots to give the Twitter account a score for its likely use of automation.

    In 2017 we estimated that up to 15 percent of active Twitter accounts were bots—and that they had played a key role in the spread of misinformation during the 2016 U.S. election period. Within seconds of a fake news article being posted—such as one claiming the Clinton campaign was involved in occult rituals—it would be tweeted by many bots, and humans, beguiled by the apparent popularity of the content, would retweet it.

    Bots also influence us by pretending to represent people from our in-group. A bot only has to follow, like and retweet someone in an online community to quickly infiltrate it. OSoMe researcher Xiaodan Lou developed another model in which some of the agents are bots that infiltrate a social network and share deceptively engaging low-quality content—think of clickbait. One parameter in the model describes the probability that an authentic agent will follow bots—which, for the purposes of this model, we define as agents that generate memes of zero quality and retweet only one another. Our simulations show that these bots can effectively suppress the entire ecosystem's information quality by infiltrating only a small fraction of the network. Bots can also accelerate the formation of echo chambers by suggesting other inauthentic accounts to be followed, a technique known as creating “follow trains.”

    Some manipulators play both sides of a divide through separate fake news sites and bots, driving political polarization or monetization by ads. At OSoMe, we recently uncovered a network of inauthentic accounts on Twitter that were all coordinated by the same entity. Some pretended to be pro-Trump supporters of the Make America Great Again campaign, whereas others posed as Trump “resisters”; all asked for political donations. Such operations amplify content that preys on confirmation biases and accelerate the formation of polarized echo chambers.

    Curbing Online Manipulation

    Understanding our cognitive biases and how algorithms and bots exploit them allows us to better guard against manipulation. OSoMe has produced a number of tools to help people understand their own vulnerabilities, as well as the weaknesses of social media platforms. One is a mobile app called Fakey that helps users learn how to spot misinformation. The game simulates a social media news feed, showing actual articles from low- and high-credibility sources. Users must decide what they can or should not share and what to fact-check. Analysis of data from Fakey confirms the prevalence of online social herding: users are more likely to share low-credibility articles when they believe that many other people have shared them.

    Another program available to the public, called Hoaxy, shows how any extant meme spreads through Twitter. In this visualization, nodes represent actual Twitter accounts, and links depict how retweets, quotes, mentions and replies propagate the meme from account to account. Each node has a color representing its score from Botometer, which allows users to see the scale at which bots amplify misinformation. These tools have been used by investigative journalists to uncover the roots of misinformation campaigns, such as one pushing the “pizzagate” conspiracy in the U.S. They also helped to detect bot-driven voter-suppression efforts during the 2018 U.S. midterm election. Manipulation is getting harder to spot, however, as machine-learning algorithms become better at emulating human behavior.

    Apart from spreading fake news, misinformation campaigns can also divert attention from other, more serious problems. To combat such manipulation, we have recently developed a software tool called BotSlayer. It extracts hashtags, links, accounts and other features that co-occur in tweets about topics a user wishes to study. For each entity, BotSlayer tracks the tweets, the accounts posting them and their bot scores to flag entities that are trending and probably being amplified by bots or coordinated accounts. The goal is to enable reporters, civil-society organizations and political candidates to spot and track inauthentic influence campaigns in real time.

    These programmatic tools are important aids, but institutional changes are also necessary to curb the proliferation of fake news. Education can help, although it is unlikely to encompass all the topics on which people are misled. Some governments and social media platforms are also trying to clamp down on online manipulation and fake news. But who decides what is fake or manipulative and what is not? Information can come with warning labels such as the ones Facebook and Twitter have started providing, but can the people who apply those labels be trusted? The risk that such measures could deliberately or inadvertently suppress free speech, which is vital for robust democracies, is real. The dominance of social media platforms with global reach and close ties with governments further complicates the possibilities.

    One of the best ideas may be to make it more difficult to create and share low-quality information. This could involve adding friction by forcing people to pay to share or receive information. Payment could be in the form of time, mental work such as puzzles, or microscopic fees for subscriptions or usage. Automated posting should be treated like advertising. Some platforms are already using friction in the form of CAPTCHAs and phone confirmation to access accounts. Twitter has placed limits on automated posting. These efforts could be expanded to gradually shift online sharing incentives toward information that is valuable to consumers.

    Free communication is not free. By decreasing the cost of information, we have decreased its value and invited its adulteration. To restore the health of our information ecosystem, we must understand the vulnerabilities of our overwhelmed minds and how the economics of information can be leveraged to protect us from being misled.

    Authors: Filippo Menczer

    Source: Scientific American

  • How artificial intelligence will shape the future of business

    How artificial intelligence will shape the future of business

    From the boardroom at the office to your living room at home, artificial intelligence (AI) is nearly everywhere nowadays. Tipped as the most disruptive technology of all time, it has already transformed industries across the globe. And companies are racing to understand how to integrate it into their own business processes.

    AI is not a new concept. The technology has been with us for a long time, but in the past, there were too many barriers to its use and applicability in our everyday lives. Now improvements in computing power and storage, increased data volumes and more advanced algorithms mean that AI is going mainstream. Businesses are harnessing its power to reinvent themselves and stay relevant in the digital age.

    The technology makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks. It does this by processing large amounts of data and recognising patterns. AI analyses much more data than humans at a much deeper level, and faster.

    Most organisations can’t cope with the data they already have, let alone the data that is around the corner. So there’s a huge opportunity for organisations to use AI to turn all that data into knowledge to make faster and more accurate decisions.

    Customer experience

    Customer experience is becoming the new competitive battleground for all organisations. Over the next decade, businesses that dominate in this area will be the ones that survive and thrive. Analysing and interpreting the mountains of customer data within the organisation in real time and turning it into valuable insights and actions will be crucial.

    Today most organisations are using data only to report on what their customers did in the past. SAS research reveals that 93% of businesses currently cannot use analytics to predict individual customer needs.

    Over the next decade, we will see more organisations using machine learning to predict future customer behaviours and needs. Just as an AI machine can teach itself chess, organizations can use their existing massive volumes of customer data to teach AI what the next-best action for an individual customer should be. This could include what product to recommend next or which marketing activity is most likely to result in a positive response.

    Automating decisions

    In addition to improving insights and making accurate predictions, AI offers the potential to go one step further and automate business decision making entirely.

    Front-line workers or dependent applications make thousands of operational decisions every day that AI can make faster, more accurately and more consistently. Ultimately this automation means improving KPIs for customer satisfaction, revenue growth, return on assets, production uptime, operational costs, meeting targets and more.

    Take Shop Direct for example, which owns the Littlewoods and Very brands. This approach saw Shop Direct’s profits surge by 40%, driven by a 15.9% increase in sales from Very.co.uk. It uses AI from SAS to analyse customer data in real time and automate decisions to drive groundbreaking personalisation at an individual customer level.

    AI is here. It’s already being adopted faster than the arrival of the internet. And it’s delivering business results across almost every industry today. In the next decade, every successful company will have AI. And the effects on skills, culture and structure will deliver superior customer experiences.

    Author: Tiffany Carpenter

    Source: SAS

  • How Machine Learning is Taking Over Wall Street

    How Machine Learning is Taking Over Wall Street

    Well-funded financial institutions are in a perpetual tech arms race, so it’s no surprise that machine learning is shaking up the industry. Investment banking, hedge funds, and similar entities are employing the latest machine learning techniques to gain an edge on the competition, and on the markets. While the reality today is that machine learning is mostly employed in the back office–for tasks such as credit scoring, risk management, and fraud detection–this is about to change dramatically.

    Machine learning is migrating to where the action is: financial market trading. Once leading-edge Wall Street platforms that companies invested many millions in are soon to become obsolete due to machine learning. Understanding how disrupting Wall Street will change and evolve and why it matters is key to navigating the opportunities ahead.

    Algorithm Trading

    Algorithmic trading now dominates the derivative, equity, and foreign exchange trading markets. These trading strategies can be complex, but the essentials are straightforward: program a set of rules that takes market data as input and apply basic models (10 -day moving average) to generate an automated trade workflow. Over the years, these strategies have moved beyond simple time-series momentum and mean revision models to more exotic name strategies like snipes, slicers, and boxers. Evolved over decades, algorithm trading has replaced much of the manual trade order flow with faster static rules-based strategies. What was once cutting edge is now an inherent disadvantage. Static rules, no matter how complex, may work well in relatively stable markets but can’t react to evolve rapidly changing market conditions.

    A machine learning algorithm’s clear advantage is it learns from experience and is not static. Employing massive datasets and pattern recognition, these algorithms produce models that learn from experience and are orders of magnitude more powerful than old-school algorithmic trading models. Decisions on how and when to trade will be made in some cases by using multi-agent systems that can act autonomously. At some point, these static algorithms will be no match for more nimble machine learning algorithms.

    Why it Matters: Reskill and Upskill

    Companies that make use of algorithm trading need to reskill or risk getting left behind. In a winner-take-all market, companies employing only slightly more advanced techniques like machine learning will continuously win a bigger share of the market. In addition to machine learning, businesses should expect an increased demand for data engineers, data scientists, MLOps specials, and others that can handle this sophisticated workflow.

    High-Frequency Trading Agents

    High-frequency trading (HFT) is the flashy cousin of algorithmic trading. Employing similar rules-based models, or even predictive analytics, these strategies operate at a much more rapid pace,; completing hundreds of stock trades in nanoseconds versus longer time range algorithmic trading strategies. High-frequency trading also relies on massive hardware and bandwidth infrastructure investment that often requires system colocation next to major exchanges. Given its sophistication, only 2% of financial trading firms employ high-frequency trading, yet at its peak, it accounted for 10 to 43% of stock trading volume on any given day.

    The ingredients for HFT–massive computing power, high frequency streaming big data, and ultrafast connections–are all areas where deep learning and machine learning workflows excel. 

    Pre-trained models can prevent machine learning and deep learning algorithms from becoming speed-limiting factors. Coupled with techniques such as deep reinforcement learning, HTF is primed for another technological leap. However, given its increased complexity, it will remain the domain of a relatively few, but highly profitable, firms. 

    Why it’s Important: Trouble Ahead

    The Flash Crash that occurred on May 6, 2010 caused trillions of dollars of market equity to be wiped out in an instant (36 minutes to be precise). Regulators have struggled to keep up with algorithmic trading and high-frequency trading, and will doubtless be hard-pressed to stay ahead of the next generation. HTF AI agents will require much more sophisticated risk monitoring and compliance systems that in turn will need to employ machine learning to monitor.

    Risk Assessment Platforms

    Despite the vast sums invested in technology by financial institutions, the humble Excel spreadsheet remains the number one applicationon Wall Street. Risk departments, charged with ensuring traders don’t make calamitous errors, are no exception. Even the better-equipped firms employ software that relies on rule sets and analytics that are apt at catching known risks but are poorly equipped to identify evolving market risk.  

    The nature of a robust risk assessment platform is a kind of catch-all. Risk scales from individual trades, to companies, industry, country, and global risk profiling. Risk can be quantifiable, but often a risk assessment may need to rely on alternative data. Machine learning’s adaptability and flexibility make it a natural successor to current risk assessment software. Both supervised and unsupervised machine learning techniques can be employed to layer on more sophisticated risk strategies. Anomaly detection used to identify outliers is one such technique that can be readily employed to identify the rare events that are characteristic of risk modeling.  

    Why it Matters: Risk and Repeat

    The recent implosion of Archegos Capital in March cost some of the world’s most sophisticated banks to lose up to $10 billion, highlighting the poor systems and oversight that many financial institutions face with have to trade risk exposure. Similar risk failure, albeit of a smaller magnitude, continues to abound despite the lesson learned and trillions lost due to the risk failure that gave rise to the 2007 financial crisis. Risk departments are finally waking up to the inherent advantages of pattern recognition machine learning versus manual and backward-looking analytics tools. Add to this the increased complexity due to, you guessed it, machine learning trading strategies.

    OMS Trading Platforms

    Retail traders have flocked to online trading platforms like Robinhood, Fidelity, and E*Trade. The institutional professionals use more advanced systems called OMS (order management systems) from companies like B2Broker, Charles River, Interactive Brokers, and others. These institutional trading platforms all execute the same basic workflow. Financial market data is fed in; a set of static trading, risk, and compliance rules are applied; buy and sell orders are generated; the order book is updated, and trade analytics reports are generated. 

    Traditionally, these platforms were closed systems. Many provide limited APIs that allow customization of various aspects such as data feeds, order flow, and algorithms, but most work only within the confines of their particular platform. Advanced hedge fund traders are employing sophisticated machine learning and deep learning techniques that utilize platforms like Tensorflow, Keras, PyTorch, and similar frameworks and libraries. Deep learning techniques such as deep reinforcement learning, NLU (natural language understanding), and transfer learning require these platforms. These models often require alternative data whose unstructured format does not readily make itself suitable for the structured time-series format many of these present trading platforms require. 

    Why it’s Important: From Closed to Open

    At some point, this equation will flip. Trading platforms are very good at order workflow and trade analytics. However, data profiling, data transformation, and machine learning algorithms need something much more flexible, adaptive, and open. The existing dominant market players will need to adopt a more open API approach that gives full access to every stage of the order workflow. At some point over the next 5 years, this in turn will lead to adoption by retailed brokers and bring machine learning trading to the masses. 

    From Leader to Laggard

    For the last few decades, Wall Street has been a clear leader in rolling out complex platforms such as algorithm trading and high-frequency trading (HTF), and other innovative trading strategies.  However, many of these systems rely on static rules-based systems or predictive analytics at best. Other companies that fully embraced machine learning and deep learning earlier have come to dominate sectors of their industry. Expect a similar shakeout in financial institutions as some companies go all-in on artificial intelligence and become the next generation of technology leaders.  

    Author: Sheamus McGovern

    Source: Open Data Science

  • How serverless machine learning and choosing the right FaaS benefit AI development

    How serverless machine learning and choosing the right FaaS benefit AI development

    Getting started with machine learning throws multiple hurdles at enterprises. But the serverless computing trend, when applied to machine learning, can help remove some barriers.

    IT infrastructure that enables rapid scaling, integration and automation is a greatly valued commodity in a marketplace that is evolving faster. And fast, serverless machine learning is a primary example of that.

    Serverless computing is a cloud-based model wherein a service provider accepts code from a customer, dynamically allocates resources to the job and executes it. This model can be more cost-effective than conventional pay-or-rent server models. Elasticity replaces scalability, relieving the customer of deployment grief. Code development can be far more modular. And headaches from processes like HTTP request processing and multithreading vanish altogether.

    It's as efficient as development could possibly be from the standpoint of time and money: The enterprise pays the provider job by job, billed only for the resources consumed in any one job execution. This simple pay-as-you-go model frees up enterprise resources for more rapid app and service development and levels the playing field for development companies not at the enterprise level.

    Attractive as it is, how can this paradigm accommodate machine learning, which is becoming a mission-critical competitive advantage in many industries?

    A common problem in working with machine learning is moving training models into production at scale. It's a matter of getting the model to perform for a great many users, often in different places, as fast as the users need it to do so. Nested in this broad problem is the more granular headache of concept drift, as the model's performance degrades over time with increasing variations in data, which causes such models to need frequent retraining. And that, in turn, creates a versioning issue and so on.

    Function as a service

    Function as a service (FaaS) is an implementation of serverless computing that works well for many application deployment scenarios, serverless machine learning included. The idea is to create a pipeline by which code is moved, in series, from testing to versioning to deployment, using FaaS throughout as the processing resource. When the pipeline is well-conceived and implemented, most of the housekeeping difficulties of development and deployment are minimized, if not removed.

    A machine learning model deployment adds two steps to this pipeline:

    • training, upon which the model's quality depends; and
    • publishing: timing the go-live of the code in production, once it's deployed.

    FaaS is a great platform for this kind of process, given its versatile and flexible nature.

    All the major public clouds provide FaaS. The list begins with AWS Lambda, Microsoft Azure Functions, Google Cloud Functions and IBM Cloud Functions, and it includes many others.

    Easier AI development

    The major FaaS function platforms accommodate JavaScript, Python and a broad range of other languages. For example, Azure Functions is Python- and JavaScript-friendly.

    Beyond the languages themselves, there are many machine learning libraries available through serverless machine learning offerings: TensorFlow, PyTorch, Keras, MLpack, Spark ML, Apache MXNet and a great many more.

    A key point about AI development in the FaaS domain is it vastly simplifies the developer's investment in architecture: autoscaling is built in; multithreading goes away, as mentioned above; fault tolerance and high availability are provided by default.

    Moreover, if machine learning models are essentially functions handled by FaaS, then they are abstracted and autonomous in a way that relieves timeline pressure when different teams are working with different microservices in an application system. The lives of product managers get much easier.

    Turnkey FaaS machine learning

    Vendors are doubling down on the concept of serverless computing and continuing to refine their options. Amazon, Google and others have services set aside to do your model training for you, on demand: Amazon SageMaker and the Google Cloud ML Engine are two of many such services, which also include IBM Watson Machine Learning, Salesforce Einstein and Seldon Core, which is open source.

    These services do more than just train machine learning models. Many serverless machine learning offerings handle the construction of data sets to be used in model training, provide libraries of machine learning algorithms, and configure and optimize the machine learning libraries mentioned above.

    Some offer model tuning, automated adjustments of algorithm parameters to tweak the model to its highest predictive capacity.

    The FaaS you choose, then, could lead to one-stop shopping for your machine learning application.

    Author: Scott Robinson

    Source: TechTarget

  • How the skillset of data scientists will change over the next decade

    How the skillset of data scientists will change over the next decade

    AutoML is poised to turn developers into data scientists — and vice versa. Here’s how AutoML will radically change data science for the better.

    In the coming decade, the data scientist role as we know it will look very different than it does today. But don’t worry, no one is predicting lost jobs, just changed jobs.

    Data scientists will be fine — according to the Bureau of Labor Statistics, the role is still projected to grow at a higher than average clip through 2029. But advancements in technology will be the impetus for a huge shift in a data scientist’s responsibilities and in the way businesses approach analytics as a whole. And AutoML tools, which help automate the machine learning pipeline from raw data to a usable model, will lead this revolution.

    In 10 years, data scientists will have entirely different sets of skills and tools, but their function will remain the same: to serve as confident and competent technology guides that can make sense of complex data to solve business problems.

    AutoML democratizes data science

    Until recently, machine learning algorithms and processes were almost exclusively the domain of more traditional data science roles—those with formal education and advanced degrees, or working for large technology corporations. Data scientists have played an invaluable role in every part of the machine learning development spectrum. But in time, their role will become more collaborative and strategic. With tools like AutoML to automate some of their more academic skills, data scientists can focus on guiding organizations toward solutions to business problems via data.

    In many ways, this is because AutoML democratizes the effort of putting machine learning into practice. Vendors from startups to cloud hyperscalers have launched solutions easy enough for developers to use and experiment on without a large educational or experiential barrier to entry. Similarly, some AutoML applications are intuitive and simple enough that non-technical workers can try their hands at creating solutions to problems in their own departments—creating a “citizen data scientist” of sorts within organizations.

    In order to explore the possibilities these types of tools unlock for both developers and data scientists, we first have to understand the current state of data science as it relates to machine learning development. It’s easiest to understand when placed on a maturity scale.

    Smaller organizations and businesses with more traditional roles in charge of digital transformation (i.e., not classically trained data scientists) typically fall on this end of this scale. Right now, they are the biggest customers for out-of-the-box machine learning applications, which are more geared toward an audience unfamiliar with the intricacies of machine learning.

    • Pros: These turnkey applications tend to be easy to implement, and relatively cheap and easy to deploy. For smaller companies with a very specific process to automate or improve, there are likely several viable options on the market. The low barrier to entry makes these applications perfect for data scientists wading into machine learning for the first time. Because some of the applications are so intuitive, they even allow non-technical employees a chance to experiment with automation and advanced data capabilities—potentially introducing a valuable sandbox into an organization.
    • Cons: This class of machine learning applications is notoriously inflexible. While they can be easy to implement, they aren’t easily customized. As such, certain levels of accuracy may be impossible for certain applications. Additionally, these applications can be severely limited by their reliance on pretrained models and data. 

    Examples of these applications include Amazon Comprehend, Amazon Lex, and Amazon Forecast from Amazon Web Services and Azure Speech Services and Azure Language Understanding (LUIS) from Microsoft Azure. These tools are often sufficient enough for burgeoning data scientists to take the first steps in machine learning and usher their organizations further down the maturity spectrum.

    Customizable solutions with AutoML

    Organizations with large yet relatively common data sets—think customer transaction data or marketing email metrics—need more flexibility when using machine learning to solve problems. Enter AutoML. AutoML takes the steps of a manual machine learning workflow (data discovery, exploratory data analysis, hyperparameter tuning, etc.) and condenses them into a configurable stack.

    • Pros: AutoML applications allow more experiments to be run on data in a larger space. But the real superpower of AutoML is the accessibility — custom configurations can be built and inputs can be refined relatively easily. What’s more, AutoML isn’t made exclusively with data scientists as an audience. Developers can also easily tinker within the sandbox to bring machine learning elements into their own products or projects.
    • Cons: While it comes close, AutoML’s limitations mean accuracy in outputs will be difficult to perfect. Because of this, degree-holding, card carrying data scientists often look down upon applications built with the help of AutoML — even if the result is accurate enough to solve the problem at hand.

    Examples of these applications include Amazon SageMaker AutoPilot or Google Cloud AutoML. Data scientists a decade from now will undoubtedly need to be familiar with tools like these. Like a developer who is proficient in multiple programming languages, data scientists will need to have proficiency with multiple AutoML environments in order to be considered top talent.

    “Hand-rolled” and homegrown machine learning solutions 

    The largest enterprise-scale businesses and Fortune 500 companies are where most of the advanced and proprietary machine learning applications are currently being developed. Data scientists at these organizations are part of large teams perfecting machine learning algorithms using troves of historical company data, and building these applications from the ground up. Custom applications like these are only possible with considerable resources and talent, which is why the payoff and risks are so great.

    • Pros: Like any application built from scratch, custom machine learning is “state-of-the-art” and is built based on a deep understanding of the problem at hand. It’s also more accurate — if only by small margins — than AutoML and out-of-the-box machine learning solutions.
    • Cons: Getting a custom machine learning application to reach certain accuracy thresholds can be extremely difficult, and often requires heavy lifting by teams of data scientists. Additionally, custom machine learning options are the most time-consuming and most expensive to develop.

    An example of a hand-rolled machine learning solution is starting with a blank Jupyter notebook, manually importing data, and then conducting each step from exploratory data analysis through model tuning by hand. This is often achieved by writing custom code using open source machine learning frameworks such as Scikit-learn, TensorFlow, PyTorch, and many others. This approach requires a high degree of both experience and intuition, but can produce results that often outperform both turnkey machine learning services and AutoML.

    Tools like AutoML will shift data science roles and responsibilities over the next 10 years. AutoML takes the burden of developing machine learning from scratch off of data scientists, and instead puts the possibilities of machine learning technology directly in the hands of other problem solvers. With time freed up to focus on what they know—the data and the inputs themselves — data scientists a decade from now will serve as even more valuable guides for their organizations.

    Author: Eric Miller

    Source: InfoWorld

  • How to create a trusted data environment in 3 essential steps

    How to create a trusted data environment in 3 essential steps

    We are in the era of the information economy. Nowadays, more than ever, companies have the capabilities to optimize their processes through the use of data and analytics. While there are endless possibilities wjen it comes to data analysis, there are still challenges with maintaining, integrating, and cleaning data to ensure that it will empower people to take decisions.

    Bottom up, top down? What is the best?

    As IT teams begin to tackle the data deluge, a question often asked is: should this problem be approached from the bottom up or top down? There is no “one-size-fits-all” answer here, but all data teams need a high-level view to help you get a quick view of your data subject areas. Think of this high-level view as a map you create to define priorities and identify problem areas for your business within the modern day data-based economy. This map will allow you to set up a phased approach to optimize your most value contributing data assets.

    The high-level view unfortunately is not enough to turn your data into valuable assets. You also need to know the details of your data.

    Getting the details from your data is where a data profile comes into play. This profile tells you what your data is from the technical perspective. The high-level view (the enterprise information model), gives you the view from the business perspective. Real business value comes from the combination of both views. A transversal, holistic view on your data assets, allowing to zoom in or zoom out. The high-level view with technical details (even without the profiling) allows to start with the most important phase in the digital transformation: Discovery of your data assets.

    Not only data integration, but data integrity

    With all the data travelling around in different types and sizes, integrating the data streams across various partners, apps and sources have become critical. But it’s more complex than ever.

    Due to the sizes and variety of data being generated, not to mention the ever-increasing speed in go to market scenarios, companies should look for technology partners that can help them achieve this integration and integrity, either on premise or in the cloud.

    Your 3 step plan to trusted data

    Step 1: Discover and cleanse your data

    A recent IDC study found that only 19% of a data professional’s time is spent analyzing information and delivering valuable business outcomes. They spend 37% of their time preparing data and 24% of their time goes to protecting data. The challenge is to overcome these obstacles by bringing clarity, transparency, and accessibility to your data assets.

    Building this discovery platform, which at the same time allows you to profile your data, to understand the quality of your data and build a confidence score to build trust with the business using the data assets, comes under the form of an auto-profiling data catalog.

    Thanks to the application of Artificial Intelligence (AI) and Machine Learning (ML) in the data catalogs, data profiling can be provided as self-service towards power users.

    Bringing transparency, understanding, and trust to the business brings out the value of the data assets.

    Step 2: Organize data you can trust and empower people

    According to the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms, 2017: “By 2020, organizations that offer users access to a curated catalog of internal and external data will realize twice the business value from analytics investments than those that do not.”

    An important phase in a successful data governance framework is establishing a single point of trust. From the technical perspective this translates to collecting all the data sets together in a single point of control. The governance aspect is the capability to assign roles and responsibilities directly in the central point of control, which allows to instantly operationalize your governance from the place the data originates.

    The organization of your data assets goes along with the business understanding of the data, transparency and provenance. The end to end view of your data lineage ensures compliance and risk mitigation.

    With the central compass in place and the roles and responsibilities assigned, it’s time to empower the people for data curation and remediation, in which an ongoing communication is from vital importance for adoption of a data driven strategy.

    Step 3: Automate your data pipelines & enable data access

    Different layers and technologies make our lives more complex. It is important to keep our data flows and streams aligned and adopt to swift and quick changes in business needs.

    The needed transitions, data quality profiling and reporting can extensively be automated.

    Start small and scale big. A part of intelligence these days can be achieved by applying AI and ML. These algorithms can take the cumbersome work out of the hands of analysts and can also be better and easier scaled. This automation gives the analysts faster understanding of the data and build better faster and more insights in a given time.

    Putting data at the center of everything, implementing automation and provisioning it through one single platform is one of the key success factors in your digital transformation and become a real data-driven organization.

    Source: Talend

  • How to improve your business processes with Artificial Intelligence?

    How to improve your business processes with Artificial Intelligence?

    In the age of digital disruption, even the world’s largest companies aren’t impervious to agile competitors that move quick, iterate fast, and have the capacity to build products faster than their peers. That’s why many legacy organizations are taking a closer look at business process management.

    Simply speaking, business process management is the practice of reengineering existing systems in your firm for better productivity and efficiency. It takes a proactive approach towards identifying business problems and the steps needed to rectify them. And while business process management has traditionally been the forte of management consultants and other functional experts, rapid advancements in artificial intelligence and big data means this sector is also undergoing a fundamental transformation.

    So it begs the question: how do you start “plugging AI” into your company’s existing data and systems?

    Where to begin?

    Artificial intelligence is exciting because it promises to introduce a totally new way to business operations. However, most traditional organizations don’t have the necessary infrastructure and/or computing power to deploy these technologies.

    Moving your data and applications to the cloud is a very popular solution to unlocking the necessary computing resources, but there's a catch. You can’t just copy-paste your files to the cloud and start using AI. Older systems weren’t built with a cloud deployment in mind, so leveraging the cloud usually requires rebuilding your existing software using a common cloud-ready platform like Kubernetes, Pivotal Cloud, and Docker Swarm.

    The point is that once you make a decision towards digital transformation, you need complete buy-in from all areas of the business and a commitment to process and technology changes. Getting that commitment typically involves showcasing the real benefits that AI can unlock. Let’s take a closer look at how artificial intelligence is actively impacting the way companies do their business.

    1. Analyzing sales calls

    When it comes to simulating business processes and operations one crucial aspect is definitely sales calls. That’s because sales, and the ensuing revenue that comes from it, are the bread and butter of your business. Top-tier sales representatives will ensure your firm keeps chugging along and reaching new boundaries.

    In the past, analyzing sales calls was a manual process. There might have been a standard sales playbook with generic questions that each individual would be expected to ask. But now, AI conversational tools like Gong are automating this process entirely.

    Gong is able to record each outbound sales call that your team makes and pick up on cues that help it determine how the call went. So, a successful sales call will probably see the prospect talking more than the sales rep, for example.

    2. Converting voicemail into text

    Have you ever heard the phrase: “Your unhappiest customers are your greatest source of learning?” These famous words were said by none other than Bill Gates. But how can you even accurately quantify customer sentiment if you don’t take the requisite steps to track it?

    It’s certainly possible that a large chunk of your customers don’t want to remain on hold while waiting for a customer support agent and prefer to leave a voicemail instead. Intelligent automation tools like Workato are making it possible to automate voicemail follow-ups, thereby ensuring that no customer falls through the cracks and each one is given an appropriate response to their concerns.

    For example, Workato was able to help automate voicemail follow-ups for a large chain of cafes. Whenever a new voicemail came into its system, the intelligent tool would use speech to text conversion to create a transcript of the voicemail. It would then take that text and add it on the service ticket, giving customer support agents a much better idea of the nature of the complaint and allowing them to resolve it quicker.

    3. Detecting fraud

    Occupational fraud causes organizations to lose about 5% of their total revenue every year with a potential total loss of $3.5 trillion. Machine learning algorithms are actively quelling this trend by spotting discrepancies and anomalies in everyday processes.

    For example, banks and financial institutions use intelligent algorithms to detect suspicious money transfers and payments. This process is also applicable in cybersecurity, tax evasion, customs clearing processes, insurance, and other fields. Large-scale organizations that are able to leverage AI are potentially looking at cost savings in the millions of dollars each year. These resources can then be spent in other critical areas of business such as research and development so companies can stay competitive and ahead of the curve.


    Artificial intelligence isn’t just a fancy buzzword that people are tossing around with willful abandon. In fact, every time you take advantage of Google’s typo detection feature (when you see ‘did you mean’ in the search engine) you’re actually plugging into its DeepMind platform, an example of AI in everyday use.

    AI has the potential to promote greater efficiency, output, less interruption, and, ultimately, higher revenue across businesses of all shapes and sizes.

    Author: Santana Wilson

    Source: Oracle

  • Human actions more important than ever with historically high volumes of data

    Human actions more important than ever with historically high volumes of data

    IDC predicts that our global datasphere: the digital data we create, capture, replicate and consume, will grow from approximately 40 zettabytes of data in 2019 to 175 zettabytes in 2025, and that 60% of this data will be managed by enterprises.

    To both manage and make use of this near-future data deluge, enterprise organizations will increasingly rely on machine learning and AI. But IDC Research Director Chandana Gopal says this doesn’t mean that the importance of humans in deriving insights and decision making will decrease. In fact, the opposite is true.

    'As volumes of data increase, it becomes vitally important to ensure that decision makers understand the context and trust the data and the insights that are being generated by AI/ML, sometimes referred to as thick data', says Gopal in 10 Enterprise Analytics Trends to Watch in 2020.

    In an AI automation framework published by IDC, we state that it is important to evaluate the interaction of humans and machines by asking the following three questions:

    1. Who analyzes the data?
    2. Who decides based on the results of the analysis?
    3. Who acts based on the decision?

    'The answers to the three questions above will guide businesses towards their goal of maximizing the use of data and augmenting the capabilities of humans in effective decision making. There is no doubt that machines are better suited to finding patterns and correlations in vast quantities of data. However, as it is famously said, correlation does not imply causation, and it is up to the human (augmented with ML) to determine why a certain pattern might occur'.

    Training employees to become data literate and conversant with data ethnography should be part of every enterprise organization’s data strategy in 2020 and beyond, advises Gopal. As more and more decisions are informed and made by machines, it’s vital that humans understand the how and why.

    Author: Tricia Morris

    Source: Microstrategy

  • In een intelligente organisatie is er altijd plaats voor een chatbot in HR

    In een intelligente organisatie is er altijd plaats voor een chatbot in HR

    Mensen vormen het hart van een bedrijf, en de afdeling Human Resources is er om voor die mensen te zorgen. HR is de bewaker van de cultuur en zorgt dat werknemers mogelijkheden krijgen om te groeien. Het houdt het bedrijf levendig en gezond. HR draait dus om mensen. Is een virtuele assistent, oftewel een chatbot, tussen al deze mensen wel op zijn plek?

    Hoewel HR draait om de mensen binnen een organisatie, besteden HR-medewerkers ongeveer een vierde van hun tijd aan administratieve taken. Het beantwoorden van vragen van medewerkers is bijvoorbeeld een dagelijks terugkerende taak. Vragen als ‘hoeveel vakantiedagen heb ik nog?’ of ‘wat zijn de regels rond ziekteverlof?’ komen bijna dagelijks aan bod. Een chatbot kan al die vragen van medewerkers beantwoorden. Dit ontziet niet alleen de HR-manager, maar het schept ook direct duidelijkheid voor medewerkers die de vragen stellen. Nooit meer de frustraties van lang wachten op een antwoord op een simpele vraag. Dat klinkt goed toch?

    Een chatbot kan de gestelde vragen ook nauwkeurig bijhouden, om zo knelpunten in het HR-beleid op te merken. Daarnaast wordt een chatbot met de hulp van kunstmatige intelligentie steeds slimmer, naarmate hij meer vragen krijgt. De antwoorden die hij geeft zullen elke dag beter en nauwkeuriger worden. Dit wirdt ook wel machine learning genoemd.

    Persoonlijke antwoorden voor specifieke situaties

    Vooral het aanvragen van verlof is een administratieve taak die vaak veel tijd kost. Denk aan het aanvragen van zwangerschapsverlof bijvoorbeeld. Een bot kan persoonlijke antwoorden en oplossingen geven voor deze specifieke aanvraag.

    Ook tijdens ziekte kan de chatbot een rol spelen. Een van de belangrijkste taken van HR is het zorgen voor een gemotiveerd personeel. Om hieraan bij te dragen kan een chatbot bijvoorbeeld een ‘beterschap’ boodschap sturen als iemand zich ziek meldt. De virtuele assistent kan ook vragen en bijhouden hoe het met diegene gaat, om zo het herstel in het oog te houden.

    Sollicitatieprocedures gladstrijken met een chatbot

    Gezien de huidige arbeidsmarkt is het vaak lastig om nieuw personeel te vinden. Het is daarom essentieel dat het sollicitatieproces vlekkeloos verloopt. Een chatbot kan dit optimaliseren door vragen van een sollicitant direct te beantwoorden. Na het beantwoorden van een vraag, kan de chatbot zelf waardevolle data verzamelenover de sollicitant. De bot slaat de antwoorden op zodat het eenvoudiger wordt om kandidaten te screenen. Niet alleen het leven van de recruiter wordt zo makkelijker, ook dat van de sollicitant.

    Het grootste deel, ongeveer 80%, van mensen die ergens solliciteren, overweegt ergens anders heen te gaan als ze tijdens het proces niet regelmatig updates krijgen over hun sollicitatie. Ze blijven wel aan boord als ze regelmatig op de hoogte gehouden worden over hoe het ervoor staat. Een bot kan een sollicitant op de hoogte houden en zo het proces van recruitment op een positieve noot beginnen. Nadat de sollicitant de selectieprocedure heeft doorlopen en zijn of haar proeftijd in gaat, begint de onboarding. De onboarding is een belangrijke periode om ervoor te zorgen dat een nieuwe medewerker zo snel mogelijk mee kan draaien in de organisatie. In plaats van te werken via een checklist kan de chatbot een groot deel van de onboarding van de HR overnemen en kan de medewerker snel zelf aan de slag. Doordat alle documenten en informatie klaargezet worden in de chatbot kan HR zich meer focussen op het persoonlijke aspect van de onboarding.

    Chatbot voor HR, meer ruimte voor mensen

    Ondanks de opkomst van nieuwe technologie is de wereld van HR er eentje die draait om mensen. Mensen die tijd nodig hebben om er voor elkaar te zijn, in plaats van dat ze zich constant bezig moeten houden met administratieve taken. HR moet zich kunnen richten op de ontwikkeling van medewerkers en als mentor kunnen optreden. HR moet de perfecte nieuwe collega kunnen vinden en de doelen van de organisatie nastreven. Door de inzet van een chatbot kan juist het werk uit handen genomen worden dat zoiets in de weg staat. Zo kan een bedrijf zich niet alleen richten op wat belangrijk is, maar kan het ook zijn medewerkers de ruimte geven te doen waar ze goed in zijn door altijd paraat te staan met de juiste informatie en het juiste advies. Daarom heeft een intelligente organisatie altijd plaats voor een chatbot in HR.

    Auteur: Joris Jonkman

    Bron: Emerce

  • Integrating security, compliance, and session management when deploying AI systems

    Integrating security, compliance, and session management when deploying AI systems

    As enterprises adopt AI (artificial intelligence), they'll need a sound deployment framework that enables security, compliance, and session management.

    As accessible as the various dimensions of AI are to today's enterprise, one simple fact remains: embedding scalable AI systems into core business processes in production depends on a coherent deployment framework. Without it, AI's potential automation and acceleration benefits almost certainly become liabilities, or will never be fully realized.

    This framework functions as a guardrail for protecting and managing AI systems, enabling their interoperability with existing IT resources. It's the means by which AI implementations with intelligent bots interact with one another for mission-critical processes.

    With this method, bots are analogous to railway cars transporting data between sources and systems. The framework is akin to the tracks the cars operate on, helping the bots to function consistently and dependably. It delivers three core functions:

    • Security
    • Compliance and data governance
    • Session management

    With this framework, AI becomes as dependable as any other well-managed IT resource. The three core functions each need to be supported as follows.


    A coherent AI framework primarily solidifies a secure environment for applied AI. AI is a collection of various cognitive computing technologies: machine learning, natural language processing (NLP), etc. Applied AI is the application of those technologies to fundamental business processes and organizational data. Therefore, it's imperative for organizations to tailor their AI frameworks to their particular security needs in accordance with measures such as encryption or tokenization.

    When AI is subjected to these security protocols the same way employees or other systems are, there can be secure communication between the framework and external resources. For example, organizations can access optical character recognition (OCR) algorithms through AWS or cognitive computing options from IBM's Watson while safeguarding their AI systems.

    Compliance (and data governance)

    In much the same way organizations personalize their AI frameworks for security, they can also customize them for the various dimensions of regulatory compliance and data governance. Of cardinal importance is the treatment of confidential, personally identifiable information (PII), particularly with the passage of GDPR and other privacy regulations.

    For example, when leveraging NLP it may be necessary to communicate with external NLP engines. The inclusion of PII in such exchanges is inevitable, especially when dealing with customer data. However, the AI framework can be adjusted so that when PII is detected, it's automatically compressed, mapped, and rendered anonymous so bots deliver this information only according to compliance policies. It also ensures users can access external resources in accordance with governance and security policies.

    Session management

    The session management capabilities of coherent AI frameworks are invaluable for preserving the context between bots for stateful relevance of underlying AI systems. The framework ensures communication between bots is pertinent to their specific functions in workflows.

    Similar to how DNA is passed along, bots can contextualize the data they disseminate to each other. For example, a general-inquiry bot may answer users' questions about various aspects of a job. However, once someone applies for the position, that bot must understand the context of the application data and pass it along to an HR bot. The framework provides this session management for the duration of the data's journey within the AI systems.

    Key benefits

    The outputs of the security, compliance, and session management functions respectively enable three valuable benefits:

    No rogue bots: AI systems won't go rogue thanks to the framework's security. The framework ingrains security within AI systems, extending the same benefits for data privacy. This can help you comply with today's strict regulations in countries such as Germany and India about where data is stored, particularly data accessed through the cloud. The framework prevents data from being stored or used in ways contrary to security and governance policies, so AI can safely use the most crucial system resources.

    New services: The compliance function makes it easy to add new services external to the enterprise. Revisiting the train analogy, a new service is like a new car on the track. The framework incorporates it within the existing infrastructure without untimely delays so firms can quickly access the cloud for any necessary services to assist AI systems.

    Critical analytics: Finally, the session management function issues real-time information about system performance, which is important when leveraging multiple AI systems. It enables organizations to define metrics relevant to their use cases, identify anomalies, and increase efficiency via a machine-learning feedback loop with predictions for optimizing workflows.

    Necessary advancements

    Organizations that develop and deploy AI-driven business applications that can think, act, and complete processes autonomously without human intervention will need a sound deployment framework. Delivering a road map for what data is processed as well as how, where, and why, the framework aligns AI with an organization's core values and is vital to scaling these technologies for mission-critical applications. It's the foundation for AI's transformative potentialand, more important, its enduring value to the enterprise.

    Source: Ramesh Mahalingam

    Author: TDWI

  • Is Artificial Intelligence shaping the future of Market Intelligence?

    Is Artificial Intelligence shaping the future of market intelligence?

    Global developments, increasing competition, shifting consumer demands... These are only a few of the countless external forces that will shape the exciting world of tomorrow.

    As a company, how can you be prepared for rapid changes in the environment?

    That's where market intelligence proves its value.

    Companies require proactive and forward-thinking market intelligence in order to detect and react to critical market signals. This kind of intelligence is critical to guarantee sustainable profits and ensure survival in today’s highly competitive environment.

    The market intelligence field over the years

    Just like the world itself, the market intelligence field has seen some major changes over the past couple of decades. For example, the rise and popularity of social media has made it notably easier to track data about consumers and competitors. It is widely accepted that this field will undergo changes at an even higher pace in the future, due to significant technical, social and organizational developments.

    But what are the developments and trends that will impact market intelligence most over the next few years? According to the research paper State of the Art and Trends of Market Intelligence, the most impactful developments are Artificial Intelligence, Data Visualization, and the GDPR legislation. The focus of this article is on the role of Artificial Intelligence (AI).

    Artificial Intelligence

    Artificial Intelligence is the intelligence displayed by machines, often characterized by learning and the ability to adapt to changes. According to Qualtrics, 93% of market researchers see AI as an opportunity for the research business.

    Where can AI add value?

    AI can add value in the processing of large and unstructured datasets. Open-ended data can be processed with ease due to the use of AI technologies such as Natural Language Processing (NLP), for example. NLP enables computers to understand, interpret and manipulate natural human language. This way NLP can assist in tracking sentiments from different sentences. This can be applied in business in for example the assessment of reviews, which usually is a slow task. With NLP however, this process can be streamlined efficiently.

    NLP can also be used as an add-on for language translation programs. It allows for the rudimentary translation of a text, before it is translated by a human. This method also makes it possible to quickly translate reports and documents written in another language, which can be very beneficial during the collection of raw data.

    Additionally, NLP can assist with practices like Topic Modeling, which consists of the automatic generation of keywords for different articles and blogs. This tool makes the process of building a huge set of labeled data more convenient. Another method, which also utilizes NLP, is Text Classification: an algorithm that can automatically suggest a related category for a specific article or news item.

    Desk research is extremely valuable in the process of gathering relevant market intelligence. However, it is very time-consuming. This is problematic, because important insights may not arrive at the desk of the specific decision maker in time. This can be detrimental to a company’s ability to react quickly in a fast-changing business environment. AI can speed up this process, as it can rapidly read all kind of sources and identify trends significantly faster than traditional desk research ever could.

    The future of market intelligence

    Clearly, the applications mentioned in this article are just a selection of the wide range of possibilities AI is providing within the field of market research and intelligence. The popularity of this technology is increasing rapidly, and it can unlock stunning amounts of relevant and rich information for all kind of fields.

    Does this imply that traditional methods and analysis are redundant and not needed anymore?

    Of course not! AI also has its own limitations.

    In the next few years, the true value of AI and other technological developments will be shown. The real power lies in the combination of AI with more traditional research methods. The results will allow businesses to arrive at actionable insights faster, and in turn, improve solid and data-driven decision-making. This way market intelligence can help companies take the steps that lead to tomorrow’s success.

    Author: Kees Kuiper

    Source: Hammer Intel

  • Is de multicloud een reden voor de langzame adoptie van AI bij organisaties?

    Is de multicloud een reden voor de langzame adoptie van AI bij organisaties?

    Ondanks de enorme potentie verloopt de adoptie van AI relatief langzaam. Volgens Efrym Willems, Business Development IBM Watson, Analytics, IoT & IBM Cloud bij Tech Data, is de multicloud een veelgehoorde reden om de technologie links te laten liggen. Volgens hem onterecht. 'AI is inmiddels ook in multicloudomgevingen een realistische optie'.

    Begin 2019 kondigde low-codeontwikkelaar Mendix een verregaande integratie aan van het eigen platform met IBM Cloud Services. Applicatieontwikkelaars krijgen daarmee eenvoudige toegang tot de functies van het artificial intelligence (AI)-platform IBM Watson. Bovendien draaien applicaties ontwikkeld met Mendix direct in de IBM Cloud. Dat lijkt op het eerste oog een detail, maar is een belangrijke stap voor de bredere adoptie van AI in multicloudomgevingen. Dat is een welkome ontwikkeling. Volgens analisten en AI-leveranciers blijft de adoptie van AI achter. De multicloud staat daarbij in de weg. Organisaties weten niet hoe ze het versnipperde datalandschap bijeen kunnen brengen. 'Toch bewijst IBM dat multicloud helemaal geen drempel hoeft te zijn', aldus Willems.

    AI maakt samenvatting en stelt diagnose

    Goed nieuws, want AI is bewezen effectief. Een recent voorbeeld is de samenvatting van de Wimbledon-finale, die geheel werd samengesteld door het AI-systeem IBM Watson. 'Het gevecht tussen de tennislegendes Roger Federer en Novak Djokovic tijdens de finale van Wimbledon 2019 duurde bijna vijf uur. Toch stond twee minuten na de wedstrijd een samenvatting paraat. Het systeem selecteerde daarbij de hoogtepunten op basis van geluid en gezichtsuitdrukkingen van het publiek. Twintig minuten na de finale waren zelfs gepersonaliseerde samenvattingen beschikbaar', vertelt de Tech Data-expert. AI heeft inmiddels ook zijn waarde bewezen als het gaat om bijvoorbeeld het optimaliseren van productieprocessen of het verbeteren van de gezondheidszorg. Zo experimenteren Nederlandse ziekenhuizen volop met de toepassing van AI, onder andere op het gebied van diagnostiek. 'Watson stelde binnen 10 minuten de juiste diagnose bij een vrouw, een zeldzame vorm van leukemie'.


    Toch zijn organisaties niet massaal op de AI-trein gesprongen. Volgens onderzoek heeft slechts een kwart van de organisaties een bedrijfsbrede AI-strategie. Vaak gooien zorgen rondom data-integratie roet in het eten. 'De gestructureerde en ongestructureerde data die nodig zijn voor analyses, staan vaak verspreid over meerdere locaties, zowel in de cloud als on-premises', legt Willems uit. 'Maar dat hoeft inmiddels geen probleem meer te zijn. In de multicloud is een effectieve inzet van AI goed mogelijk', aldus Willems. Wel zijn een aantal voorwaarden belangrijk:

    1. AI op alle platforms

    Een goede werking van AI vraagt aanwezigheid van de technologie op alle platforms waar de data en applicaties in gebruik zijn. 'Dat is precies de reden dat IBM zijn Watson-oplossing beschikbaar heeft gemaakt voor diverse platformen, via microservices', legt Willems uit. “Deze draaien in een Kubernetes-container. 'Die microservices draaien on-premises of in de IBM Cloud, maar functioneren ook prima in de clouds van bijvoorbeeld Microsoft, Amazon en Google. De AI komt dus naar de data, in plaats van dat alle data naar de AI moeten komen. Deze aanpak biedt bovendien een ander voordeel: het voorkomt dat organisaties vastzitten aan een specifieke omgeving'.

    2. Dataconnectoren

    Bovenstaand gegeven is niet voor iedere organisatie voldoende. Data kunnen nog verder versnipperd zijn, bijvoorbeeld in omgevingen als Dropbox, Salesforce, Tableau en Looker. In die gevallen is het belangrijk dat voor deze omgevingen dataconnectoren beschikbaar zijn. Zo kan de AI-oplossing alsnog gebruikmaken van de daar opgeslagen gegevens. IBM heeft daarnaast Watson Studio, het platform voor datascience en machine learning, vorig jaar verrijkt met een verbeterde integratie met Hadoop Distributions (CDH en HDP). Volgens Willems is het daardoor eveneens mogelijk om analytics uit te voeren daar waar de data staan en gebruik te maken van de beschikbare rekenkracht.

    3. Alternatief: data naar één plek

    Een alternatief is het samenbrengen van datasets naar een centraal platform. 'IBM Cloud, wat sinds 2018 de nieuwe naam is voor SoftLayer, biedt die mogelijkheid. Bijvoorbeeld met IaaS- of PaaS-diensten, of door simpelweg cloudstorage te bieden'. Het is daarnaast mogelijk IaaS- en PaaS-diensten te integreren in een multicloudomgeving, voegt Willems eraan toe.

    4. Brede ondersteuning ontwikkeltools

    In het hierboven geschetste scenario is de integratie van Mendix met IBM Cloud een belangrijke ontwikkeling voor AI-adoptie. 'Na consolidatie van de data kunnen speciaal daarvoor gebouwde apps de data ontsluiten en analyseren', zegt Willems. 'Het ontwikkelen van die apps gaat snel en relatief eenvoudig met low-code en no-code platformen van aanbieders als Mendix of OutSystems'. Daarnaast is uiteraard ook IBM Bluemix, de developertoolset van IBM, inmiddels onder de vlag van IBM Cloud beschikbaar.

    Geen obstakel

    AI kan met bovengenoemde aandachtspunten onafhankelijk van de gekozen clouddeployment waarde toevoegen. 'Of een organisatie nu de AI naar de data brengt of andersom: een multicloudomgeving is in beide gevallen geen obstakel meer', besluit Willems.

    Bron: BI-Platform

  • Kunstmatige intelligentie leert autorijden met GTA

    Zelfrijdende auto toekomst-geschiedenis

    Wie ooit Grand Theft Auto (GTA) heeft gespeeld, weet dat de game niet is gemaakt om je aan de regels te houden. Toch kan GTA volgens onderzoekers van de Technische Universiteit Darmstadt een kunstmatige intelligentie helpen om te leren door het verkeer te rijden. Dat schrijft het universiteitsmagazine van MIT, Technology Review.

    Onderzoekers gebruiken het spel daarom ook om algoritmes te leren hoe ze zich in het verkeer moeten gedragen. Volgens de universiteit is de realistische wereld van computerspelletjes zoals GTA heel erg geschikt om de echte wereld beter te begrijpen. Virtuele werelden worden al gebruikt om data aan algoritmes te geven, maar door games te gebruiken hoeven die werelden niet specifiek gecreëerd te worden.

    Het leren rijden in Grand Theft Auto werkt ongeveer gelijk als in de echte wereld. Voor zelfrijdende auto’s worden objecten en mensen, zoals voetgangers, gelabeld. Die labels kunnen aan het algoritme, waardoor die in staat is om in zowel de echte wereld als de videogame onderscheid te maken tussen verschillende voorwerpen of medeweggebruikers.

    Het is niet de eerste keer dat kunstmatige intelligentie wordt ingezet om computerspelletjes te spelen. Zo werkte onderzoekers al aan een slimme Mario en wordt Minecraft voor eenzelfde doeleinde gebruikt als GTA. Microsoft gebruikt de virtuele wereld namelijk om personages te leren hoe ze zich door de omgeving moeten manoeuvreren. De kennis die wordt opgedaan kan later gebruikt worden om robots in de echte wereld soortgelijke obstakels te laten overwinnen.

    Bron: numrush.nl, 12 september 2016


  • Machine learning is changing the data center

    machine learningEvery year, the technology industry seems to come up with new products that have the capability to manage themselves. From cars that tell us if we’re backing up too fast to AC units that turn on when they realize the residents are on their way home, we’re seeing technology continuing to advance in their ability to self-manage.

    The next logical step we are seeing is self-managing data centers, where automation and machine learning handle administrative storage tasks.

    Even for those who don’t believe machines can execute the tasks of an IT manager more effectively than their human counterparts, the efficiency gains from offloading repetitive functions -- or making connections between dissimilar, often unrecognized events – should give businesses the ability focus on strategic objectives that will help the company flourish.

    Like self-driving cars, the self-managed data center that rarely needs human intervention could be coming sooner than we think. Data centers are increasingly utilizing full self-managed capabilities, which wouldn’t be possible without automation and machine learning technology. 

    Below are the three main trends that are helping to make self-managed data centers a reality.

    Promising performance without intervention

    Automation and machine learning offer multiple capabilities that aid in developing the self-managed data center. 

    One is that organizations can guarantee performance without intervention. With traditional storage, applications compete for resources from a fixed number of buckets or IOPS. Guaranteeing a set number of IOPS for a particular application prevents organizations from accessing those IOPS for other apps. 

    Automation enables organizations to access IOPS resources and allows virtual machines (VMs) to employ them for other necessary purposes. So, although it ensures a clear lane for every VM, it also enables the VMs to access IOPS as necessary. 

    This approach avoids the danger of saving and wasting unused IOPS, instead making them available when needed.

    Ensuring a clear lane for very every virtual machine

    In the future, machine learning and automation promise to optimize the performance of storage arrays and predict future usage trends. It can analyze past performance to predict trends for the next two months, for example, giving organizations insight into what’s necessary to optimize performance and capacity for storage-array pools.

    Machine learning should enable organizations to move VMs from a particular array to somewhere else in the pool if through its ability to analyze performance trends. Furthermore, it would allow organizations to predict and address poor performance on an array.

    Machine learning can also help businesses plan for their future. Analytics would enable organizations to improve predictions and make savvier decisions about infrastructure requirements to avoid downtime. It’s like building another wing on an apartment complex to address growing resident occupancy in the future.

    Optimizing the performance of storage arrays and predicting future usage trends

    Additionally, by giving each VM its own lane, organizations could make optimal use of all their performance all the time. On those rare occasions when VMs ask for more than the storage can deliver, performance could be assigned dynamically to applications that require it rather than on a first-in, first-out basis.

    When further development of apps and devices that use machine learning take place, companies will try to find new and exciting ways to incorporate AI.

    The controversial debates about AI will continue, but there are ways to utilize it without going overboard and giving over too much control. 

    Automated, self-managed data centers are becoming a reality, promising real-time, predictable performance without IT intervention. Even dense IT infrastructure that’s typically difficult and time-consuming to upgrade and control is becoming automated and divided into elements managed through software instead of hardware. These data centers are increasingly utilizing full self-managing capabilities. 

    Ultimately, with the combination of AI and machine learning, IT teams should finally have the ability to focus their time on more important tasks that add real value to the company rather than being stuck in the back end of the data center. 

    The data center that manages itself and rarely needs assistance has the potential to arrive sooner than expected. In the coming months, you’ll start to see how machine-learning-based intelligent automation will become a critical component of the modern data centers

    Author: Chris Colotti

    Source: Information Management

  • Machine learning, AI, and the increasing attention for data quality

    Machine learning, AI, and the increasing attention for data quality

    Data quality has been going through a renaissance recently.

    As a growing number of organizations increase efforts to transition computing infrastructure to the cloud and invest in cutting-edge machine learning and AI initiatives, they are finding that the main barrier to success is the quality of their data.

    The old saying “garbage in, garbage out” has never been more relevant. With the speed and scale of today’s analytics workloads and the businesses that they support, the costs associated with poor data quality are also higher than ever.

    This is reflected in a massive uptick in media coverage on the topic. Over the past few months, data quality has been the focus of feature articles in The Wall Street Journal, Forbes, Harvard Business Review, MIT Sloan Management Review and others. The common theme is that the success of machine learning and AI is completely dependent on data quality. A quote that summarizes this dependency very well is this one by Thomas Redman: ''If your data is bad, your machine learning tools are useless.''

    The development of new approaches towards data quality

    The need to accelerate data quality assessment, remediation and monitoring has never been more critical for organizations and they are finding that the traditional approaches to data quality don’t provide the speed, scale and agility required by today’s businesses.

    For this reason, highly rated data preparation business Trifacta recently announced an expansion into data quality and unveiled two major new platform capabilities with active profiling and smart cleaning. This is the first time Trifacta has expanded our focus beyond data preparation. By adding new data quality functionality, the business aims to gain capabilities to handle a wider set of data management tasks as part of a modern DataOps platform.

    Legacy approaches to data quality involve many manual, disparate activities as part of a broader process. Dedicated data quality teams, often disconnected from the business context of the data they are working with, manage the process of profiling, fixing and continually monitoring data quality in operational workflows. Each step must be managed in a completely separate interface. It’s hard to iteratively move back-and-forth between steps such as profiling and remediation. Worst of all, the individuals doing the work of managing data quality often don’t have the appropriate context for the data to make informed decisions when business rules change or new situations arise.

    Trifacta uses interactive visualizations and machine intelligence guides help users by highlighting data quality issues and providing intelligent suggestions on how to address them. Profiling, user interaction, intelligent suggestions, and guided decision-making are all interconnected and drive the other. Users can seamlessly transition back-and-forth between steps to ensure their work is correct. This guided approach lowers the barriers to users and helps to democratize the work beyond siloed data quality teams, allowing those with the business context to own and deliver quality outputs with greater efficiency to downstream analytics initiatives.

    New data platform capabilities like this are only a first (albeit significant) step into data quality. Keep your eyes open and expect more developments towards data quality in the near future!

    Author: Will Davis

    Source: Trifacta

  • Machine learning: definition and opportunities

    Machine learning: definition and opportunities

    What is machine learning?

    Machine learning is an application of artificial intelligence (AI) that gives computers the ability to continually learn from data, identify patterns, make decisions and improve from experience in an autonomous fashion over time without being explicitly programmed.

    How big will it be?

    According to the International Data Corporation (IDC), spending on AI and machine learning will grow from $US37.5 billion in 2019 to $US97.9 billion by 2023.

    What’s the opportunity?

    Machine learning is having a big impact on the healthcare industry by using data from wearables and sensors to assess a patient’s health in real-time. Unexpected and unpredictable patterns can be identified from masses of independent variables leading to improved diagnosis, treatment and prevention.

    Machine learning is also being used in the financial services industry as a way of preventing fraud. Systems can analyze millions of bits of data relating to online buyer and seller behavior, which by themselves wouldn’t be conclusive, but together can form a strong indication of fraud.

    Oil and gas is another industry starting to benefit from machine learning technology. For example, miles of pipeline footage shot by drones can be analyzed pixel by pixel, identifying potential structural weaknesses that humans would not be able to see.


    As more and more industries rely on huge volumes of data, and with computational processing becoming cheaper and more powerful, machine learning will lead to rapid and significant improvements to business processes and transform the way companies make decisions . Models can be quickly and automatically produced that can analyze bigger, more complex data sets, uncover previously unknown patterns and identify key insights, opportunities and risks.

    Source: B2B International

  • Machines vormen de toekomst van klantervaringen

    Machines vormen de toekomst van klantervaringen

    Wereldwijd onderzoek door Futurum Research in opdracht van SAS laat zien dat 67% van alle interacties tussen klanten en bedrijven door slimme machines wordt afgehandeld in 2030. Uit het onderzoek blijkt dat technologie de grootste drijvende kracht zal zijn voor de customer experience. Dat betekent dat merken hun klant ecosysteem opnieuw onder de loep moeten nemen om in te kunnen spelen op de mondige consument en nieuwe technologieën.

    Technologie heeft de afgelopen jaren de manier waarop bedrijven en consumenten met elkaar omgaan volledig op zijn kop gezet. Het gedrag en de voorkeuren van consumenten veranderen voortdurend. Hoe zal de klantervaring er in 2030 uitzien? En wat moeten merken doen om tegemoet te komen aan de verwachtingen van de toekomstige consument? Dit zijn een paar van de vragen die aan de orde komen in het onderzoek 'Experience 2030: The Future of Customer Experience', uitgevoerd door Futurum Research en gesponsord door SAS.

    Flexibiliteit en ingrijpende automatisering

    De bedrijven die in dit onderzoek werden ondervraagd voorzien voor 2030 een grootschalige verschuiving naar automatisering van klantinteracties. Het onderzoek voorspelt dat slimme machines mensen zullen vervangen en grofweg twee derde van alle klantcommunicatie, beslissingen tijdens real-time interacties en beslissingen ten aanzien van marketingcampagnes, zullen afhandelen. Volgens het onderzoek zal in 2030 67% van de interacties tussen bedrijven en consumenten die gebruikmaken van digitale technologie (online, mobiel, assistent etc) worden afgehandeld door slimme machines in plaats van menselijke medewerkers. En in 2030 zal 69% van de beslissingen tijdens een klantinteractie worden genomen door slimme machines.

    Consumenten omarmen nieuwe technologie

    Volgens het onderzoek is 78% van alle bedrijven van mening dat consumenten zich momenteel ongemakkelijk voelen over de omgang met technologie in winkels. Uit het onderzoek blijkt echter dat dit slechts gold voor 35% van alle consumenten. Dit verschil in perceptie tussen bedrijven en consumenten kan uitgroeien tot een beperkende factor voor de groei van deze bedrijven.
    Voor bedrijven levert dit niveau van klantacceptatie en -verwachting nieuwe kansen op om de betrokkenheid van klanten te vergroten. Om tegemoet te komen aan de steeds hogere verwachtingen van beide partijen hebben bedrijven echter nieuwe oplossingen nodig om de kloof tussen consumententechnologie en marketing technologie te dichten.

    Investeringen in AI en AR/VR

    De toekomst van customer experience zal in hoge mate worden bepaald door nieuwe technologieën. In dit onderzoek werd bedrijven gevraagd naar de toekomstige technologieën waarin ze momenteel investeren om ondersteuning te bieden aan nieuwe klantervaringen en het verbeteren van de klantentevredenheid in 2030. Volgens het onderzoek investeert 62% van alle bedrijven in spraakgestuurde AI-assistenten voor het verbeteren van de klantinteractie en voor de customer support. Nog eens 58% investeert in spraakgestuurde AI als hulpmiddel voor de marketing- en salesafdeling. 54% van alle bedrijven investeert in augmented reality (AR) en virtual reality (VR) om consumenten te helpen de vorm of het gebruik van een product of dienst op afstand te visualiseren. 53% heeft plannen om AR-/VR-tools in te zetten voor het optimaliseren van het gebruik van producten en selfservice mogelijkheden voor consumenten. Al deze opkomende en complexere technologieën voor klantloyaliteit betekenen dat merken hun kennis op het gebied van datamanagement, analytische optimalisatie processen en geautomatiseerde besluitvormings-mogelijkheden moeten heroverwegen. Ze moeten in staat zijn om deze nieuwe technologie in te zetten ten behoeve van tastbare bedrijfsresultaten. Deze nieuwe toepassingen zullen in staat zijn om data te verzamelen, verwerken en analyseren om bij te dragen aan de ‘multimediamarketing’ die bepalend is voor het toekomstig succes.

    Sleutel tot succes

    Misschien wel de grootste uitdaging voor bedrijven op dit moment is het vermogen om de vertrouwenskloof tussen bedrijven en consumenten te dichten. Consumenten zijn terughoudend in de manier waarop bedrijven met hun persoonlijke gegevens omgaan en voelen zich niet bij machte om hier verandering in te brengen. Slechts 54% van alle consumenten is er gerust op dat bedrijven hun data op vertrouwelijke wijze zullen behandelen. Slechts 54% van de consumenten vertrouwt erop dat bedrijven hun gegevens geheimhouden. Dit is een uitdaging voor bedrijven bij het optimaliseren van de customer experience, om de juiste balans te vinden tussen de hoeveelheid informatie die ze opvragen en het vertrouwen van klanten. Uit de onderzoeksresultaten blijkt echter dat bedrijven zich wel degelijk bewust zijn van de risico’s die ze lopen. 59% van hen is het sterk eens met de stelling dat het beveiligen van klantgegevens de allerbelangrijkste factor is voor een goede klantervaring. De vraag is echter of bedrijven daar klaar voor zijn. Het onderzoek doet vermoeden dat zij hierbij de nodige problemen ondervinden. 84% van alle bedrijven maakt zich namelijk zorgen over wijzigingen in overheidsrichtlijnen ten aanzien van privacy en de mate waarin ze daaraan kunnen voldoen.

    Over de onderzoeksmethodiek

    Futurum Research heeft in mei 2019 ruim 4.000 respondenten in 36 landen ondervraagd in verschillende sectoren en overheidsinstellingen. De onderzoeksresultaten zijn bekend gemaakt tijdens Analytics Experience in Milaan. Hier komen meer dan 1800 data scientists en zakelijke professionals samen om te leren over nieuwe toepassingen en best practices op het gebied van analytics.  

    Bron: BI platform

  • MicroStrategy: Take your business to the next level with machine learning

    MicroStrategy: Take your business to the next level with machine learning

    It’s been nearly 22 years since history was made across a chess board. The place was New York City, and the event was Game 6 of a series of games between IBM’s “Deep Blue” and the renowned world champion Garry Kasparov. It was the first time ever a computer had defeated a player of that caliber in a multi-game scenario, and it kicked off a wave of innovation that’s been methodically working its way into the modern enterprise.

    Deep Blue was a formidable opponent because of its brute force approach to chess. In a game where luck is entirely removed from the equation, it could run a search algorithm on a massive scale to evaluate move, discarding candidate moves once they proved to be less valuable than a previously examined and still available option. This giant decision tree powered the computer to a winning position in just 19 moves with Kasparov resigning.

    As impressive as Deep Blue was back then, present-day computing capabilities are much stronger, by orders of magnitude, inspired by the neural network of the human brain. Data scientists create inputs and define outputs detect previously indecipherable patterns, important variables that influence games, and ultimately, the next move to take.

    Models can also continue to ‘learn’ from playing different scenarios and then update the model through a process called ‘reinforcement learning’ (as the Go-playing AlphaZero program does). The result of this? The ability to process millions of scenarios in a fraction of a second to determine the best possible action, with implications far beyond the gameboard.

    Integrating machine learning models into your business workflows comes with its challenges: business analysts are typically unfamiliar with machine learning methods and/or lack the coding skills necessary to create viable models; integration issues with third-party BI software may be a nonstarter; and the need for governed data to avoid incorrectly trained models is a barrier to success.

    As a possible solution, one could use MicroStrategy as a unified platform for creating and deploying data science and machine learning models. With APIs and connectors to hundreds of data sources, analysts and data scientists can pull in trusted data. And when using the R integration pack, business analysts can produce predictive analytics without coding knowledge and disseminate those results throughout their organization.

    The use cases are already coming in as industry leaders put this technology to work. As one example, a large governmental organization reduced employee attrition by 10% using machine learning, R, and MicroStrategy.

    Author: Neil Routman

    Source: MicroStrategy

  • Pattern matching: The fuel that makes AI work

    Pattern matching: The fuel that makes AI work

    Much of the power of machine learning rests in its ability to detect patterns. Much of the basis of this power is the ability of machine learning algorithms to be trained on example data such that, when future data is presented, the trained model can recognize that pattern for a particular application. If you can train a system on a pattern, then you can detect that pattern in the future. Indeed, pattern matching in machine learning (and its counterpart in anomaly detection) is what makes many applications of artificial intelligence (AI) work, from image recognition to conversational applications.

    As you can imagine, there are a wide range of use cases for AI-enabled pattern and anomaly detection systems. Pattern recognition, one of the seven core patterns of AI applications, is being applied to fraud detection and analysis, finding outliers and anomalies in big stacks of data; recommendation systems, providing deep insight into large pools of data; and other applications that depend on identification of patterns through training.

    Fraud detection and risk analysis

    One of the challenges with existing fraud detection systems is that they are primarily rules-based, using predefined notions of what constitutes fraudulent or suspicious behavior. The problem is that humans are particularly creative at skirting rules and finding ways to fool systems. Companies looking to reduce fraud, suspicious behavior or other risk are finding solutions in machine learning systems that can either be trained to recognize patterns of fraudulent behavior or, conversely, find outliers and anomalies to learned acceptable behavior.

    Financial systems, especially banking and credit card processing institutions, are early adopters in using machine learning to enable real-time identification of potentially fraudulent transactions. AI-based systems are able to handle millions of transactions per minute and use trained models to make millisecond decisions as to whether a particular transaction is legitimate. These models can identify which purchases don't fit usual spending patterns or look at interactions between paying parties to decide if something should be flagged for further inspection.

    Cybersecurity firms are also finding significant value in the application of machine learning-based pattern and anomaly systems to bolster their capabilities. Rather than depending on signature-based systems, which are primarily oriented toward responding to attacks that have already been reported and analyzed, machine learning-based systems are able to detect anomalous system behavior and block those behaviors from causing problems to the systems or networks.

    These AI-based systems are able to adapt to continuously changing threats and can more easily handle new and unseen attacks. The pattern and anomaly systems can also help to improve overall security by categorizing attacks and improving spam and phishing detection. Rather than requiring users to manually flag suspicious messages, these systems can automatically detect messages that don't fit the usual pattern and quarantine them for future inspection or automatic deletion. These intelligent systems can also autonomously monitor software systems and automatically apply software patches when certain patterns are discovered.

    Uncovering insights in data

    Machine learning-based pattern recognition systems are also being applied to extract greater value from existing data. Machines can look at data to find insights, patterns and groupings and use the power of AI systems to find patterns and anomalies humans aren't always able to see. This has broad applicability to both back-office and front-office operations and systems. Whereas, before, data visualization was the primary way in which users could extract value from large data sets, machine learning is now being used to find the groupings, clusters and outliers that might indicate some deeper connection or insight.

    In one interesting example, through machine learning pattern analysis, Walmart discovered consumers buy strawberry pop-tarts before hurricanes. Using unsupervised learning approaches, Walmart identified the pattern of products that customers usually buy when stocking up ahead of time for hurricanes. In addition to the usual batteries, tarps and bottled water, it discovered that the rate of purchase of strawberry pop-tarts also increased. No doubt, Walmart and other retailers are using the power of machine learning to find equally unexpected, high-value insights from their data.

    Automatically correcting errors

    Pattern matching in machine learning can also be used to automatically detect and correct errors. Data is rarely clean and often incomplete. AI systems can spot routine mistakes or errors and make adjustments as needed, fixing data, typos and process issues. Machines can learn what normal patterns and behavior look like, quickly spot and identify errors, automatically fix issues on its own and provide feedback if needed.

    For example, algorithms can detect outliers in medical prescription behavior, flag these records in real time and send a notification to healthcare providers when the prescription contains mistakes. Other automated error correction systems are assisting with document-oriented processes, fixing mistakes made by users when entering data into forms by detecting when data such as names are placed into the wrong fields or when other information is incomplete or inappropriately entered.

    Similarly, AI-based systems are able to automatically augment data by using patterns learned from previous data collection and integration activities. Using unstructured learning, these systems can find and group information that might be relevant, connecting all the data sources together. In this way, a request for some piece of data might also retrieve additional, related information, even if not explicitly requested by the query. This enables the system to fill in the gaps when information is missing from the original source, correct errors and resolve inconsistencies.

    Industry applications of pattern matching systems

    In addition to the applications above, there are many use cases for AI systems that implement pattern matching in machine learning capabilities. One use case gaining steam is the application of AI for HR and staffing. AI systems are being tasked to find the best match between job candidates and open positions. While traditional HR systems are dependent on humans to make the connection or use rules-based matching systems, increasingly, HR applications are making use of machine learning to learn what characteristics of employees make the best hires. The systems learn from these patterns of good hires to identify which candidates should float to the surface of the resume pile, resulting in more optimal matches.

    Since the human is eliminated in this situation, AI systems can be used to screen candidates and select the best person, while reducing the risk of bias and discrimination. Machine learning systems can sort through thousands of potential candidates and reach out in a personalized way to start a conversation. The systems can even augment the data in the job applicant's resume with information it gleans from additional online sources, providing additional value.

    In the back office, companies are applying pattern recognition systems to detect transactions that run afoul of company rules and regulations. AI startup AppZen uses machine learning to automatically check all invoices and receipts against expense reports and purchase orders. Any items that don't match acceptable transactional patterns are sent for human review, while the rest are expedited through the process. Occupational fraud, on average, costs a company 5% of its revenues each year, with the annual median loss at $140,000, and over 20% of companies reporting losses of $1 million or more.

    The key to solving this problem is to put processes and controls in place that automatically audit, monitor, and accept or reject transactions that don't fit a recognized pattern. AI-based systems are definitely helping in this way, and we'll increasingly see them being used by more organizations as a result.

    Author: Ronald Schmelzer

    Source: TechTarget

  • Preserving privacy within a population: differential privacy

    Preserving privacy within a population: differential privacy

    In this article, I will present the definition of differential privacy and preserving privacy and personal data of users while using their data in training machine learning models or driving insights using data science technologies.

    What is differential privacy?

    Differential privacy describes a promise, made by a data holder, or curator, to a data subject:

    ''You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis,no matter what other studies, data sets, or information sources, are available.''

    At their best, differentially private database mechanisms can make confidential data widely available for accurate data analysis, without resorting to data clean rooms, data usage agreements, data protection plans, or restricted views.

    Nonetheless, data utility will eventually be consumed: the Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy in a spectacular way.

    Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population.

    A medical database may teach us that smoking causes cancer, affecting an insurance company’s view of a smoker’s long-term medical costs. 

    Has the smoker been harmed by the analysis?

    Perhaps — his insurance premiums may rise, if the insurer knows he smokes. He may also be helped — learning of his health risks, he enters a smoking cessation program.

    Has the smoker’s privacy been compromised?

    It is certainly the case that more is known about him after the study than was known before, but was his information “leaked”?

    Differential privacy will take the view that it was not, with the rationale that the impact on the smoker is the same independent of whether or not he was in the study. It is the conclusions reached in the study that affect the smoker, not his presence or absence in the data set

    Differential privacy ensures that the same conclusions, for example, smoking causes cancer, will be reached, independent of whether any individual opts into or opts out of the data set.

    Artificial Intelligence and the privacy paradox

    Consider an institution, e.g. the National Institutes of Health, the Census Bureau, or a social networking company, in possession of dataset containing sensitive information about individuals. For example, the dataset may consist of medical records, socioeconomic attributes, or geolocation data. The institution faces an important tradeoff when deciding how to make this dataset available for statistical analysis.

    On one hand, if the institution releases the dataset (or at least statistical information about it), it can enable important research and eventually inform policy decisions.

    On the other hand, for a number of ethical and legal reasons it is important to protect the individual-level privacy of the data subjects. The field of privacy-preserving data analysis aims to reconcile these two objectives. That is, it seeks to enable rich statistical analyses on sensitive datasets while protecting the privacy of the individuals who contributed to them.

    Differential privacy and Machine Learning

    One of the most useful tasks in data analysis is machine learning: the problem of automatically finding a simple rule to accurately predict certain unknown characteristics of never before seen data.

    Many machine learning tasks can be performed under the constraint of differential privacy. In fact, the constraint of privacy is not necessarily at odds with the goals of machine learning, both of which aim to extract information from the distribution from which the data was drawn, rather than from individual data points.

    The goal in machine learning is very often similar to the goal in private data analysis. The learner typically wishes to learn some simple rule that explains a data set. However, she wishes this rule to generalize — that is, it should be that the rule she learns not only correctly describes the data that she has on hand, but that it should also be able to correctly describe new data that is drawn from the same distribution.

    Generally, this means that she wants to learn a rule that captures distributional information about the data set on hand, in away that does not depend too specifically on any single data point.

    Of course, this is exactly the goal of private data analysis — to reveal distributional information about the private data set, without revealing too much about any single individual in the dataset (you remember the over-fitting phenomena?).

    It should come as no surprise then that machine learning and private data analysis are closely linked. In fact, as we will see, we are often able to perform private machine learning nearly as accurately, with nearly the same number of examples as we can perform non-private machine learning.

    Cryptography and privacy

    Some recent work has focused on machine learning or general computation over encrypted data.

    Recently, Google deployed a new system for assembling a deep learning model form thousands of locally-learned models while preserving privacy, which they call Federated Learning.


    Differential privacy should not be seen as a limitation in any context. However, we should look at it as a privacy-dog watching our compliance with standards that handles the sensitive data. We generate data more than what we think and we leave digital footprint everywhere, thus; as researchers in machine learning and data science, we should focus more on this topic and find a fair trade-off between privacy and accurate models.

  • Pyramid Analytics' 5 main takeaways from the Insurance AI and Analytics USA conference in Chicago

    Pyramid Analytics' 5 main takeaways from the Insurance AI and Analytics USA conference in Chicago

    Pyramid Analytics was thrilled to participate in the Insurance AI and Analytics USA conference in beautiful Chicago, May 2-3. The goal of the conference was to provide education to insurance leaders looking for ways to use AI and ML to extract more value out of their data. In all of their conversations, the eagerness to do more with data was palpable, but a tinge of frustration could be detected beneath the surface.

    Curious to understand this contradiction, they started most of their conversations with the same basic question: 'What brings you to the show?' Followed by a slightly deeper question: 'Where are you with your AI and ML initiatives?'

    The responses varied. However, a common thread emerged: despite the desire to incorporate AI and ML capabilities into routine business practices, roadblocks remain, regardless of carrier type. Chief among the concerns of the attendees was the ability to access data, it appears that data silos are alive and well. We also heard many express frustrations with the tools used to derive AI and ML insights.

    Here are some observations of the most common reasons for attending the show into five groups, organized by persona:

    1. Data scientists looking for deeper access to data 

    The data scientists seemed to struggle with data access, which is often trapped within departments throughout the organization. To do their jobs effectively, data scientists need to access data so they can unlock trapped business value. They were seeking solutions that would help them bridge the gap between data and analytics.

    2. Executives from traditional organizations trying to understand the way forward

    To varying degrees, the insurance executives had AI and ML programs in place but weren’t satisfied with the results. They attended the conference to learn how they could extract more value from their AI and ML initiatives.

    3. Sophisticated insurers seeking technology to gain an edge on the competition

    This was a general takeaway from indivivuals from newer insurance companies who fit squarely in the “early technology adopter” category. Lacking the constraints of typical insurers (legacy processes and systems), these individuals were seeking information on new technologies and hoping to build partnerships with vendors to achieve further differentiation.

    4. Data and technology vendors looking to build meaningful partnerships

    There were many representatives from data and technology companies seeking out insurance partners looking to advance their businesses at the margins, either by enriching existing data store or by finding new or unique data streams.

    5. Consultants promoting their unique approach to AI and ML initiatives

    It’s clear that AI and ML initiatives require more than just tools, people, and processes. They require strategic direction and a roadmap that builds consistency and accountability. There were a number of consultants making themselves available to insurers.

    Author: Michael Hollenbeck

    Source: Pyramid Analytics

  • Reusing data for ML? Hash your data before you create the train-test split

    Reusing data for ML? Hash your data before you create the train-test split

    The best way to make sure the training and test sets are never mixed while updating the data set.

    Recently, I was reading Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd edition) and it made me realize that there might be an issue with the way we approach the train-test split while preparing data for machine learning models. In this article, I quickly demonstrate what the issue is and show an example of how to fix it.

    Illustrating the issue

    I want to say upfront that the issue I mentioned is not always a problem per se and it all depends on the use case. While preparing the data for training and evaluation, we normally split the data using a function such as Scikit-Learn’s train_test_split . To make sure that the results are reproducible, we use the random_state argument, so however many times we split the same data set, we will always get the very same train-test split. And in this sentence lies the potential issue I mentioned before, particularly in the part about the same data set.

    Imagine a case in which you build a model predicting customer churn. You received satisfactory results, your model is already in production and generating value-added for a company. Great work! However, after some time, there might be new patterns among the customers (for example, global pandemic changed the user behavior) or you simply gathered much more data, as more customers joined the company. For any reason, you might want to retrain the model and use the new data for both training and validation.

    And this is exactly when the issue appears. When you use the good old train_test_split on the new data set (all of the old observations + the new ones you gathered since training), there is no guarantee that the observations you trained on in the past will still be used for training, and the same would be true for the test set. I will illustrate this with an example in Python:

    # import the libraries 
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from zlib import crc32
    # generate the first DataFrame
    X_1 = pd.DataFrame(data={"variable": np.random.normal(size=1000)})
    # apply the train-test split
    X_1_train, X_1_test = train_test_split(X_1, test_size=0.2, random_state=42)
    # add new observations to the DataFrame
    X_2 = pd.concat([X_1, pd.DataFrame(data={"variable": np.random.normal(size=500)})]).reset_index(drop=True)
    # again, apply the train-test split to the updated DataFrame
    X_2_train, X_2_test = train_test_split(X_2, test_size=0.2, random_state=42)
    # see what is the overlap of indices
    print(f"Train set: {len(set(X_1_train.index).intersection(set(X_2_train.index)))}")
    print(f"Test set: {len(set(X_1_test.index).intersection(set(X_2_test.index)))}")
    # Train set: 669
    # Test set: 59

    First, I generated a DataFrame with 1000 random observations. I applied the 80–20 train-test split using a random_state to ensure the results are reproducible. Then, I created a new DataFrame, by adding 500 observations to the end of the initial DataFrame (resetting the index is important to keep track of the observations in this case!). Once again, I applied the train-test split and then investigated how many observations from the initial sets actually appear in the second ones. For that, I used the handy intersection method of a Python’s set. The answer is 669 out of 800 and 59 out of 200. This clearly shows that the data was reshuffled.

    What are the potential dangers of such an issue? It all depends on the volume of data, but it can happen that in an unfortunate random draw all the new observations will end up in one of the sets, and not help that much with proper model fitting. Even though such a case is unlikely, the more likely cases of uneven distribution among the sets are not that desirable either. Hence, it would be better to evenly distribute the new data to both sets, while keeping the original observations assigned to their respective sets.

    Solving the issue

    So how can we solve this issue? One possibility would be to allocate the observations to the training and test sets based on a certain unique identifier. We can calculate the hash of observations’ identifier using some kind of a hashing function and if the value is smaller than x% of the maximum value, we put that observation into the test set. Otherwise, it belongs to the training set.

    You can see an example solution (based on the one presented by Aurélien Géron in his book) in the following function, which uses the CRC32 algorithm. I will not go into the details of the algorithm, you can read about CRC here. Alternatively, here you can find a good explanation of why CRC32 can very well serve as a hashing function and what drawbacks it has — mostly in terms of security, but that is not a problem for us. The function follows the logic described in the paragraph above, where 2³² is the maximum value of this hashing function:

    def hashed_train_test_split(df, index_col, test_size=0.2):
        Train-test split based on the hash of the unique identifier.
        test_index = df[index_col].apply(lambda x: crc32(np.int64(x)))
        test_index = test_index < test_size * 2**32
        return df.loc[~test_index], df.loc[test_index]

    Note: The function above will work for Python 3. To adjust it for Python 2, we should follow crc32’s documentation and use it as follows: crc32(data) & 0xffffffff.

    Before testing the function in practice, it is really important to mention that you should use a unique and immutable identifier for the hashing function. And for this particular implementation, also a numeric one (though this can be relatively easily extended to include strings as well).

    In our toy example, we can safely use the row ID as a unique identifier, as we only append the new observations at the very end of the initial DataFrame and never delete any rows. However, this is something to be aware of while using this approach for more complex cases. So a good identifier might be the customer’s unique number, as by design those should only increase and there should be no duplicates.

    To confirm that the function is doing what we want it to do, we once again run the test scenario as shown above. This time, for both DataFrames we use the hashed_train_test_split function:

    # create an index column (should be immutable and unique)
    X_1 = X_1.reset_index(drop=False)
    X_2 = X_2.reset_index(drop=False)
    # apply the improved train-test split
    X_1_train_hashed, X_1_test_hashed = hashed_train_test_split(X_1, "index")
    X_2_train_hashed, X_2_test_hashed = hashed_train_test_split(X_2, "index")
    # see what is the overlap of indices
    print(f"Train set: {len(set(X_1_train_hashed.index).intersection(set(X_2_train_hashed.index)))}")
    print(f"Test set: {len(set(X_1_test_hashed.index).intersection(set(X_2_test_hashed.index)))}")
    # Train set: 800
    # Test set: 200

    While using the hashed unique identifier for the allocation, we achieved perfect overlap for both training and test sets.


    In this article, I showed how to use hashing functions to improve the default behavior of training-test split. The described issue is not very apparent for many data scientists, as it mostly occurs in case of retraining the ML models using new and updated data sets. So this is not really something often mentioned in textbooks or one does not come across it while playing with example data sets, even the ones from Kaggle competitions. And I mentioned before, this might not even be an issue for us, as it really depends on the use case. However, I do believe that one should be aware of it and how to fix it if there is such a need.

    Author: Eryk Lewinson

    Source: Towards Data Science

  • SAS: 4 real-world artificial intelligence applications

    SAS: 4 real-world artificial intelligence applications

    Everyone is talking about AI (artificial intelligence). Unfortunately, a lot of what you hear about AI in movies and on the TV is sensationalized for entertainment.

    Indeed, AI is overhyped. But AI is also real and powerful.

    Consider this: engineers worked for years on hand-crafted models for object detection, facial recognition and natural language translation. Despite honing those algorithms by the best of our species, their performance does not come close to what data-driven approaches can accomplish today. When we let algorithms discover patterns from data, they outperform human coded logic for many tasks, that involve sensing the natural world.

    The powerful message of AI is not that machines are taking over the world. It is that we can guide machines to generate tremendous value by unlocking the information, patterns and behaviors that are captured in data.

    Today I want to share four real-world applications of SAS AI and introduce you to five SAS employees who are working to put this technology into the hands of decision makers, from caseworkers and clinicians to police officers and college administrators.

    Augmenting health care with medical image analysis

    Fijoy Vadakkumpadan, a Senior Staff Scientist on the SAS Computer Vision team, is no stranger to the importance of medical image analysis. He credits ultrasound technology with helping to ensure a safe delivery of his twin daughters four years ago. Today, he is excited that his work at SAS could make a similar impact on someone else’s life.

    Recently, Fijoy’s team has extended the SAS Platform to analyze medical images. The technology uses an artificial neural network to recognize objects on medical images and thus improve healthcare.

    Designing AI algorithms you can trust

    Xin Hunt, a Senior Machine Learning Developer at SAS, hopes to have a big impact on the future of machine learning. She is focused on interpretability and explainability of machine learning models, saying, 'In order for society to accept it, they have to understand it'.

    Interpretability uses a mathematical understanding of the outputs of a machine learning model. You can use interpretability methods to show how the model reacts to changes in the inputs, for example.

    Explainability goes further than that. It offers full verbal explanations of how a model functions, what parts of the model logic were derived automatically, what parts were modified in post-processing, how the model meets regulations, and so forth.

    Making machine learning accessible to everyone

    From exploring and transforming data to selecting features and comparing algorithms, there are multiple steps to building a machine learning model. What if you could apply all those steps with the click of a button?

    That’s what the development teams of Susan Haller and Dragos Coles have done. Susan is the Director of Advanced Analytics R&D and Dragos is a Senior Machine Learning Developer at SAS. They are showing a powerful tool that offers an API for a dynamic, automated model building. The model is completely transparent, so you examine and modify it after it is built.

    Deploying AI models in the field

    You can do everything right when building and refining a machine learning model, but if you do not deploy it where decisions are made it will not do any good.

    Seb Charrot, a Senior Manager in the Scottish R&D Team, enjoys deploying analytics to solve real problems for real people. He and his team build SAS Mobile Investigator, an application that allows caseworkers, investigators and officers in the field to receive tasks, be notified of risks and concerns regarding their caseload or coverage area, and raise reports on the go.

    Moving AI into the real world

    When you move past the science project phase of analytics and build solutions for the real world, you will find that you can enable everyone, not just those with data science degrees, to make decisions based on data. As a result, everyone’s jobs become easier and more productive. Plus, increased access to analytics leads to faster and more reliable decisions. Technology is unstoppable, it is who we are, it is what we do. Not just at SAS, but as a species.

    Author: Oliver Schabenberger

    Source: SAS

  • Technology advancements: a blessing and a curse for cybersecurity

    Technology advancements: a blessing and a curse for cybersecurity

    With the ever-growing impact of big data, hackers have access to more and more terrifying options. Here's what we can do about it.

    Big data is the lynchpin of new advances in cybersecurity. Unfortunately, predictive analytics and machine learning technology is a double-edged sword for cybersecurity. Hackers are also exploiting this technology, which means that there is a virtual arms race between cybersecurity companies and cybercriminals.

    Datanami has talked about the ways that hackers use big data to coordinate attacks. This should be a wakeup call to anybody that is not adequately prepared.

    Hackers exploit machine learning to avoid detection

    Jathan Sadowski wrote an article in The Guardian a couple years ago on the intersection between big data and cybersecurity. Sadowski said big data is to blame for a growing number of cyberattacks.

    In the evolution of cybercrime, phishing and other email-borne menaces represent increasingly prevalent threats. FireEye claims that email is the launchpad for more than 90% of cyber attacks, while a multitude of other statistics confirm that email is the preferred vector for criminals.

    This is largely because of their knowledge of machine learning. They use machine learning to get a better understanding of customers, choose them them more carefully and penetrate defenses more effectively.

    That being said, people are increasingly aware of things like phishing attacks and most people know that email links and attachments could pose a risk. Many are even on the lookout for suspicious PDFs, compressed archives, camouflaged executables, and Microsoft Office files with dodgy macros inside. Plus, modern anti-malware solutions are quite effective in identifying and stopping these hoaxes in their tracks. The trouble is that big data technology helps these criminals orchestrate more beleivable social engineering attacks.

    Credit card fraud represents another prominent segment of cybercrime, causing bank customers to lose millions of dollars every year. As financial institutions have become familiar with the mechanisms of these stratagems over time, they have refined their procedures to fend off card skimming and other commonplace exploitation vectors. They are developing predictive analytics tools with big data to prepare for threats before they surface.

    The fact that individuals and companies are often prepared for classic phishing and banking fraud schemes has incentivized fraudsters to add extra layers of evasion to their campaigns. The sections below highlight some of the methods used by crooks to hide their misdemeanors from potential victims and automated detection systems.

    Phishing-as-a-Service on the rise, due to big data

    Although phishing campaigns are not new, the way in which many of them are run is changing. Malicious actors used to undertake a lot of tedious work to orchestrate such an attack. In particular, they needed to create complex phishing kits from scratch, launch spam hoaxes that looked trustworthy, and set up or hack websites to host deceptive landing pages. Big data helps hackers understand what factors work best in a phishing attack and replicate it better.

    Such activity required a great deal of technical expertise and resources, which raised the bar for wannabe scammers who were willing to enter this shady business. As a result, in the not-so-distant past, phishing was mostly a prerogative of high-profile attackers.

    However, things have changed, most notably with the popularity of a cybercrime trend known as Phishing-as-a-Service (PHaaS). This refers to a malicious framework providing malefactors with the means to conduct effective fraudulent campaigns with very little effort and at an amazingly low cost.

    In early July, 2019, researchers unearthed a new PHaaS platform that delivers a variety of offensive tools and allows users to conduct full-fledged campaigns while paying inexpensive subscription fees. The monthly prices for this service range from $50 to $80. For an extra fee, a PHaaS service might also include lists of email addresses belonging to people in a certain geographic region. For example, the France package contains about 1.5 million French 'leads' that are 'genuine and verified'.

    The PHaaS product in question lives up to its turnkey promise as it also provides a range of landing page templates. These scam pages mimic the authentic style of popular services such as OneDrive, Adobe, Google, Dropbox, Sharepoint, DocuSign, LinkedIn, and Office 365, to name a few. Moreover, the felonious network saves its 'customers' the trouble of looking for reliable hosting for the landing sites. This feature is already included in the service.

    To top it all off, the platform accommodates sophisticated techniques to make sure the phishing campaigns slip under the radar of machine learning systems and other automated defenses. In this context, it reflects the evasive characteristics of many present-day phishing waves. The common anti-detection quirks are as follows:

    • Content encryption: As a substitute to regular character encoding, this method encrypts content and then applies JavaScript to decrypt the information on the fly when a would-be victim views it in a web browser.
    • HTML character encoding: This trick prevents automated security systems from reading fraudulent data while ensuring that it is rendered properly in an email client or web browser.
    • Inspection blocking: Phishing kits prevent known security bots, AV engines, and various user agents from accessing and crawling the landing pages for analysis purposes.
    • Content injection: In the upshot of this stratagem, a fragment of a legitimate site’s content is substituted with rogue information that lures a visitor to navigate outside of the genuine resource.
    • The use of URLs in email attachments: To obfuscate malicious links, fraudsters embed them within attachments rather than in the email body.
    • Legitimate cloud hosting: Phishing sites can evade the blacklisting trap if they are hosted on reputable cloud services, such as Microsoft Azure. In this case, an additional benefit for the con artists is that their pages use a valid SSL certificate.

    The above evasion tricks enable scammers to perpetrate highly effective, large-scale attacks against both individuals and businesses. The utilization and success of these techniques could help explain a 17% spike in this area of cybercrime during the first quarter of 2019.

    The scourge of card enrollment

    Banking fraud and identity theft go hand in hand. This combination is becoming more harmful and evasive than ever before, with malicious payment card enrollment services gaining momentum in the cybercrime underground. The idea is that the fraudster impersonates a legitimate cardholder in order to access the target’s bank account with virtually no limitations.

    According to security researchers’ latest findings, this particular subject is trending on Russian hacking forums. Threat actors are even providing comprehensive tutorials on card enrollment 'best practices'.

    The scheme starts with the harvesting of Personally Identifiable Information (PII) related to the victim’s payment card, such as the card number, expiration date, CVV code, and cardholder’s full name and address. A common technique used to uncover this data is to inject a card-skimming script into a legitimate ecommerce site. Credit card details can also be found for sale on the dark web making things even easier.

    The next stage involves some extra reconnaissance by means of OSINT (Open Source Intelligence) or shady checking services that may provide additional details about the victim for a small fee. Once the crooks obtain enough data about the individual, they attempt to create an online bank account in the victim’s name (or perform account takeover fraud if the person is already using the bank’s services). Finally, the account access is usually sold to an interested party.

    To stay undetected, criminals leverage remote desktop services and SSH tunnels that cloak the fraud and make it appear that it’s always the same person initiating an e-banking session. This way, the bank isn’t likely to identify an anomaly even when the account is created and used by different people.

    To make fraudulent purchases without being exposed, the hackers also change the billing address within the account settings so that it matches the shipping address they enter on ecommerce sites.

    This cybercrime model is potent enough to wreak havoc in the online banking sector, and security gurus have yet to find an effective way to address it.

    These increasingly sophisticated evasion techniques allow malefactors to mastermind long-running fraud schemes and rake in sizeable profits. Moreover, new dark web services have made it amazingly easy for inexperienced crooks to engage in phishing, e-banking account takeover, and other cybercrimes. Under the circumstances, regular users and organizations should keep hardening their defenses and stay leery of the emerging perils.

    Big data makes hackers a horrifying threat

    Hackers are using big data to perform more terrifying attacks every day. We need to understand the growing threat and continue fortifying our defenses to protect against them.

    Author: Diana Hope

    Source: SmartDataCollective

  • The 8 most important industrial IoT developments in 2019

    The 8 most important industrial IoT developments in 2019

    From manufacturing to the retail sector, the infinite applications of the industrial internet of things (IIoT) are disrupting business processes, thereby improving operational efficiency and business competitiveness. The trend of employing IoT-powered systems for supply chain management, smart monitoring, remote diagnosis, production integration, inventory management, and predictive maintenance is catching up as companies take bold steps to address a myriad of business problems.

    No wonder, the global technology spend on IoT is expected to reach USD 1.2 trillion by 2022. The growth of this segment will be driven by firms deploying IIoT solutions and giant tech organizations who are developing these innovative solutions.

    To help you stay ahead of the curve, we have enlisted a few developments that will dominate the industrial IoT sphere.

    1. Cobots are gaining popularity

    Digitization is having a major impact in the industrial robotics segment as connected cobots or collaborative robots, making their place in the smart manufacturing ecosystem. This trend is improving the efficiency of operations and the reliability of the production cycle.

    IIoT is making robots mobile and collaborative, offering technologies, such as self-driving vehicles (mobile collaborative robots), machine vision (part identification), and additive manufacturing that can boost production efficiency and business growth with an excellent ROI. No wonder, the global cobots market size has crossed USD 649 million in 2018 and is expected to expand at a CAGR of 44.5% between 2019 and 2025.

    2. Digital twins are on the rise

    A growing number of firms are deploying IoT solutions to develop a digital replica of their business assets. Thus, instead of sending data to each physical receiver separately, all the information is sent to the digital twin, enabling business units to access the data with ease.

    Digital twins are growing in popularity as they decrease the complexity of the IoT ecosystem while boosting its efficiency. Gartner shares that 24% of enterprises are already using digital twins and an additional 42% plan to ride on this wave in the coming three years.

    Smart businesses are already using digital twin software to incorporate process data, enabling them to reach accurate insights and address operational inefficiencies.

    3. Augmented reality is disrupting the manufacturing domain

    AR is benefiting the manufacturing domain in more ways than one. The technology has disrupted the manufacturing areas like product design and development, maintenance and field service, quality assurance, logistics, and hands-on training of new employees.

    For instance, in the assembling operations, AR is replacing the traditional paper instruction manual with IoT-enabled systems that have voice-controlled instructions along with a video from the previous assembly operation.  

    AR is also allowing manufacturing technicians to have access to instant intelligence and problem insights related to maintenance, thereby improving their efficiency and reducing equipment downtime.

    4. IoT-enabled predictive maintenance is becoming a part of the overall maintenance workflow

    With the advent of Industry 4.0, several enterprises are investing in IoT-enabled predictive maintenance of their assets to fix automated systems before they get disabled. In today’s competitive business environment, it is extremely important for firms to keep machines running seamlessly. Connected sensors and machine learning are helping companies anticipate component failures in advance, thereby reducing equipment downtime and time to locking up machines for preventative maintenance checks.

    As a result, many organizations are running predictive analytics and machine learning to monitor systems and gather data, allowing them to estimate when components are likely to fail.

    5. 5G will drive real-time IIoT applications

    5G deployments are digitizing the industrial domain and changing the way enterprises manage their business operations. Industries, namely transportation, manufacturing, healthcare, energy and utilities, agriculture, retail, media, and financial services will benefit from the low latency and high data transfer speed of 5G mobile networks.

    For instance, in the manufacturing domain, 5G will power factory automation, ensuring that the processes happen within the time frame, thereby reducing the risk of downtime. Further, 5G will help manufacturers in real-time production inspection and assembly line maintenance.

    6. Firms are shifting from centralized cloud to edge computing

    Until now, the centralized cloud was a popular choice among firms for controlling connected devices and data. However, with IoT devices and sensors expected to generate an ocean of data, more and more enterprises want IoT to monitor and report data and events remotely.

    Though most firms are using centralized cloud-based solutions to collect data, they are facing issues, such as high network load, poor response time, and security risks. Edge computing is helping businesses collect, analyze, and store data close to its source, thereby reducing the costs and security risks and improving system efficiency. That explains the growing demand for edge computing.

    A research report from Business Insider Intelligence forecasts that by 2020, there will be over 5,635 million smart sensors and other IoT devices globally, generating over 507.5 zettabytes of data. The need to collect and process this data at local collection points is what’s triggering the shift from centralized cloud to edge computing.

    7. Firms will continue to invest in cybersecurity

    Cybersecurity threats continue to evolve each day. Connected systems pose a serious threat to data and cause massive system disruption and loss to the firm. A 2018 Data Breach study by IBM revealed that the cost of an average data breach to companies globally is USD 3.86 million.

    As a result, an increasing number of firms are investing in innovative services like virtual private network (VPN) to access the internet safely. Such innovative security solutions are becoming increasingly popular with enterprises across domains.

    8. IoT analytics is gaining significance

    While sectors such as manufacturing, aerospace, and energy and utilities are deploying IoT-powered sensors and wireless technologies, the true value of industrial IoT lies in analytics. The connected systems generate a large amount of data that needs to be effectively employed to optimize operations. Thus, the demand for  IoT analytics will rise in the coming years. As a result, firms will have to depend on AI and ML technologies to find and effective ways to manage the data overload.

    Companies like SAS, SAP, and Teradata are already offering advanced analytics software to help enterprises evaluate real-time data streaming from connected systems on the shop floor.

    Going forward

    IIoT is all set to fuel the fourth industrial revolution. Firms across various industries are adopting innovative IoT devices and technologies to accelerate business growth. These IIoT deployments will help enterprises improve operational efficiency, reduce downtime, and get a serious competitive advantage in their respective domains.

    The IIoT developments shared in this post will set the stage for innovative enterprise platforms and tech advancements. Organizations wanting to remain competitive should be not only aware of these trends but also take adequate measures to embrace them.

    Source: Datafloq

  • The massive impact of data science on the web development business

    The massive impact of data science on the web development business

    “A billion hours ago, modern Homo sapiens emerged.
    A billion minutes ago, Christianity began.
    A billion seconds ago, the IBM personal computer was released.
    A billion Google searches ago… was this morning."

    - Hal Varian, Google’s Chief Economist, December, 2013 (From the book: Work Rules by Laszlo Buck)

    The last line of the above quote characterizes the world’s hunger for information. Information plays a huge role in our life. Information consumed by our senses helps our mind in making decisions. But what happens when the mind is flooded with information? You get confused, annoyed and scared of decision-making. This is where your computers and processors come to rescue, and this is when the term 'information' is replaced by 'data'.

    Every minute, more than a hundred hours of video content is uploaded on YouTube. From application stores, over 50 billion apps have already been downloaded since 2008. There are more than 2 billion people signed up on social media websites. These numbers are just giving you a glimpse of the amount of data which is flowing through the optical fibers every second around the world. And now the question comes: how to make this massive amount of data useful? The answer is analytics. If you know how to play with numbers and extract the nectar of useful insights from this huge amount of data using appropriate analytical tools, then you are my friend, are a real data scientist.

    Data science is helping many businesses, irrespective of them being B2B or B2C. But in this article, we are going to talk more about its role in one of the biggest B2B industries: Custom Web Development. If you are a web developer, you must not ignore the rise of data science in your profession, and if you are thinking about hiring one, then you should know about the latest trends to supervise the development process in a better way. So, let’s discuss the impact of data science in the transformation of web development:

    1. Re(de)fining the software solutions

    Not a very long time ago, web developers used to be creative with page layouts and menu details. It was generally guess-work, but now data science tells web developers about the layouts and details of the competitor websites. Hence, they can propose a unique design after carefully evaluating the competition.

    Also with the help of the latest analytical tools, web developers can know what the requirements of the end users are. They can suggest particular functions or features which are popular among the customers based on the analysis of consumer data. In this way, data science is assisting the developers in providing better and faster software solutions to their clients.

    2. Automatic updates

    Gone are the days when updates had to be manually administered by the developers. This is the era of automation. Machine learning has enabled tools to analyze consumer behavior and data available on social media platforms to come up with required updates. The websites are made self-learning so that they can improve themselves with the changing demands of the customers. It is possible only because data science is doing its job perfectly.

    Although this part is still facing some challenges with creating customized solutions for different clients, but soon custom web development services will make it a piece of cake with the help of data science.

    3. Customizing for end users

    We have discussed until now that how web development can be customized for the clients using data science, but the real goal should be the satisfaction of end users. And satisfaction is a dependent variable of personalization. To create a personalized product for the users, you need to know them, and in this regard data science is helping web developers.

    The spending habits, interest areas, preferred websites, geographical location, age, and gender, etc. all this information of the end users are used to create algorithmic models which can predict the consumer’s alignment towards your web apps. Using these models, you can not only give the user a personalized experience on the website but also strategically place your ads targeting specific customer segments, thus, creating a win-win situation for both buyer and seller.

    4. Changing hot-skills

    Apart from changing the way the web is being designed by developers, data science is influencing the transformation of web development in one more way: by revolutionizing the job market. With ever-changing needs of the industry, a web development company wants employees equipped with the skills of using the latest data and analytics tools.

    The developers looking for jobs today are expected to have knowledge of tools like python and google analytics. They are asked about their proficiency in creating AI and ML programs in their interviews. Therefore, one has to stay updated to stay relevant.

    5. Customer’s expectations

    Do you get irritated when the Uber’s driver calls you to ask about your pick-up location when it can be easily tracked by the GPS and clearly displayed on his device’s screen? Won’t you feel uncomfortable if you misspell something while typing on your messenger and autocorrect stops helping? And don’t you feel nice when you buy a phone online and the web app suggests your latest phone covers for it?

    Well, if the answer is yes, then you are becoming dependent on data science too. Don’t worr, you're not the only one. Customers worldwide like extra help provided by businesses. And this dependency on data will soon make the use of data science a hygiene factor in web development.


    Although it’s called Data Science, using it is nothing less than an art. It requires expertise and dedication to develop a web app which completely harnesses the potential of data science.

    Data science is a vast field. It is responsible for AI, machine learning, big data, analytics, etc. This also drives technologies such as the Internet of Things and AR/VR. Hence, when all the modern buzzwords of business are somewhere related to data science, it requiress absolute ignorance to neglect the role of data science in the development of websites and web apps.

    Source: Datafloq

  • The most wanted skills related for organizations migrating to the cloud

    The most wanted skills related for organizations migrating to the cloud

    Given the widespread move to cloud services underway today, it’s not surprising that there’s growing demand for a variety of cloud-related skills.

    Earlier this year, IT consulting and talent services firm Akraya Inc. compiled a list of the most in-demand cloud skills for 2019, let's take a look at them:

    Cloud security

    Cloud security is a shared responsibility between cloud providers and their customers. That creates a need for professionals with specialization in cloud security skills, including those who can leverage cloud security tools.

    Machine learning (ML) and artificial intelligence (AI)

    In recent years cloud vendors have developed and expanded their set of tools and services that allow organizations to reap the benefits of machine learning and artificial intelligence in the cloud. Companies need people who can leverage these new capabilities of the cloud.

    Cloud migration and deployment within multi-cloud environments

    Many organizations are looking to adopt multiple cloud services, and are looking for professionals who can contribute to their cloud migration efforts. Cloud migration has its risks and is not an easy process, and improper migration processes often lead to business downtime and data vulnerability. This means that employees with appropriate skillset are key.

    Serverless architecture

    Underlying cloud server infrastructure needs to be managed by cloud developers within a server-based architecture. But today’s cloud consists of industry standard technologies and programming languages that help move serverless applications from one cloud vendor to another, Akraya said. Companies need expertise in serverless application development.

    Author: Bob Violino

    Source: Information-management

  • The reinforcing relationship between AI and predictive analytics

    The reinforcing relationship between AI and predictive analytics

    Enterprises have long seen the value of predictive analytics, but now that AI (artificial intelligence) is starting to influence forecasting tools, the benefits may start to go even deeper.

    Through machine learning models, companies in retail, insurance, energy, meteorology, marketing, healthcare and other industries are seeing the benefits of predictive analytics tools. With these tools, companies can predict customer behavior, foresee equipment failure, improve forecasting, identify and select the best product fit for customers, and improve data matching, among other things.

    Enterprises of all sizes are now finding that the combination of predictive analytics and AI can help them stay ahead of their competitors.

    Forecasting gets a boost with AI

    Retail brands are constantly looking to stay relevant by associating themselves with the latest trends. Before each season, designers are continuously working on creating new styles and designs they think will be successful. However, these predictions can be faulty based on a number of factors, such as changes in customer buying patterns, changing tastes in particular colors or styles, and other factors that are difficult to predict.

    AI-based approaches to demand projection can reduce forecasting errors by up to 50%, according to Business of Fashion. This improvement can mean big savings for a retail brand's bottom line and positive ROI for organizations that are inventory-sensitive.

    Another industry that has seen tremendous improvements recently is meteorology and weather forecasting. Traditionally, weather forecasting has been prone to error. However, that is changing, as the accuracy of 5-day forecasts and hurricane tracking forecasts has improved dramatically in recent years.

    According to the Weather Channel, hurricane track forecasts are now more accurate five days in advance than two-day forecasts were in 1992. These extra few days can give people in a hurricane's path extra time to prepare and evacuate, potentially saving lives.

    Another example is the use of predictive analytics by utility companies to help spot trends in energy usage. Smart meters monitor activity and notify customers of consumption spikes at certain times of the day, helping them cut back on power usage. Utility companies are also helping customers predict when they might get a high bill based of a variety of data points and can send out alerts to warn customers if they are running up a large bill that month.

    Reducing downtime and disturbance

    For industries that heavily rely on equipment, such as manufacturing, agriculture, energy, mining etc., unexpected downtime can be costly. Companies are increasingly using predictive analytics and AI systems to help detect and prevent failures.

    AI-enabled predictive maintenance systems can self-monitor and report equipment issues in real time. IoT sensors attached to critical equipment can gather real-time data, spotting issues or potential problems as they arise and notifying teams so they can respond to them right away. The systems can also formulate predictions of upcoming issues, reducing costly unplanned downtime for instance.

    Power plants need to be monitored constantly to make sure they are functioning properly and safely, maing sure they are providing energy to the all the customers that rely on them for electricity. Predictive analytics is being used to help run early warning systems that can identify anomalies and notify managers of issues weeks to months earlier than traditional warning systems. This can lead to improved maintenance planning and more efficient prioritization of maintenance activities.

    Additionally, AI can help predict when a component or piece of equipment might fail, reducing unexpected equipment failure and unplanned downtime while also lowering maintenance costs.

    In industries which rely heavily on location data, such as mining, making sure you're operating in the correct area is paramount. Goldcorp, one of the largest gold mining companies in the world, partnered with IBM Watson to improve its targeting of new deposits of gold.

    By analyzing previously collected data, IBM Watson was able to improve geologists' accuracy of finding new gold deposits. Through the use of predictive analytics, the company was able to gather new information from existing data, better determine specific areas to explore next, and reach high-value exploration targets faster.

    Increased situational awareness

    Predictive analytics and AI are also great at anticipating situational events by collecting data from the environment and making decisions based on that data. This system is helping to predict future events based on data rather than just reacting to current data.

    Brands need to stay on top of their online presence, as well as what's being said about them on social media. Tracking social media to get real-time feedback from customers is important, especially for retail brands and restaurants. Bad reviews and negative comments can be detrimental, particularly for smaller brands.

    With this awareness and by tracking comments on social media in (near) real-time, companies can gather immediate feedback and respond to situations quickly. Situational awareness can also help with competition tracking, market awareness, market trend predictions and anticipated geopolitical problems.

    With companies of all sizes in every industry trying to stay ahead of their competitors and predict market trends, this forward-looking approach of predictive analytics is proving valuable. Predictive analytics is such a core part of AI application development that it is one of the core seven patterns of AI identified by AI market research and analysis firm Cognilytica.

    The use of machine learning to help give humans more data to make better decisions is compelling, and it's one of the most beneficial uses of machine learning technology.

    Author: Kathleen Walch

    Source: TechTarget

  • The role of Machine Learning in the development of text to speech technology

    The role of Machine Learning in the development of text to speech technology

    Machine learning is drastically advancing the development of text to speech technology. Here's how, and why it's so important.

    Machine learning has played a very important role in the development of technology that has a large impact on our everyday lives. However, machine learning is also influencing the direction of technology that is not as commonplace. Text to speech technology is a prime example.

    Text to speech technology predates machine learning by over a century. However, machine learning has made the technology more reliable than ever.

    The progression of text to speech technology in the Machine Learning era

    We live in an era where audiobooks are gaining more appreciation than the traditional pieces of literature. Thus, it comes as no surprise that the Text-to-Speech (TTS) technology is also rapidly becoming popular. It caters to those who need it most, including children who struggle with reading, and those who suffer from a disability. Big data is very useful in assisting these people.

    There are other elements of speech synthetization technology that rely on machine learning. It is now so sophisticated that it can even mimic someone else’s voice.

    Text to Speech (commonly known as TTS) is a piece of assistive technology (that is, any piece of technology that helps individuals overcome their challenges) that reads text out loud, and is available on almost every gadget we have on our hands today. It has taken years for the technology to develop to the point it is at today. Machine learning is changing the direction of this radical technology. However, its journey is one that started in the late eighteenth century.

    The early days of text to speech 

    TTS is a complicated technology that has developed over a long period of time. It all began with the construction of acoustic resonators, which could only produce just the sounds of the vowels. These acoustics were developed in 1779, due to the dedicated work of Christian Kratzenstein. With the advent of semiconductor technology and improvements in signal processing, computer-based TTS devices started hitting the shelves in the 20th century. There was a lot of fascination surrounding the technology during its infancy. This was primarily why Bell Labs’ Vocoder demonstration found its way into the climactic scene of one of the greatest sci-fi flicks of all time: 2001: A Space Odyssey.

    The Machine Learning technology that drives TTS

    A couple of years ago, Medium contributor Utkarsh Saxena penned a great article on speech synthesis technology with machine learning. They talked about two very important machine learning approaches: Parametric TTS and Concatenative TTS. They both help with the development of new speech synthesizing techniques.

    At the heart of it, a TTS engine has a front-end and a back-end component. Modern TTS engines are heavily dependent on machine learning algorithms. The front-end deals with converting the text to phonetics and meaningful sentences. The back-end uses this information to convert symbolic linguistic representation to sound. Good synthesizer technology is key to a good TTS system, which requires sophisticated deep learning neural analysis tools. The audio should be both intelligible and natural, to be able to mimic everyday conversation. Researchers are trying out various techniques to achieve this.

    Concatenation synthesis relies on piecing together multiple segments of recorded speech to form coherent sentences. This technology usually gives way to the most natural-sounding speech. However, it loses out on intelligibility, leading to audible glitches as a result of poor segmentation. Formant synthesis is used when intelligibility takes precedence over natural language. This technology does not use human speech samples, and hence sounds evidently ‘robotic’. The lack of a speech-sample database means that it is relatively lightweight and best suited for embedded system applications. This is because power and memory resources are scarce in these applications. Various other technologies also exist, but the most recent and notable one is the use of machine learning. In fact, recorded speech data helps train deep neural networks. Today’s digital assistants use these extensively.

    The challenges

    Contextual understanding of the text on the screen is one of the main challenges for TTS systems. More often than not, human readers are able to understand certain abbreviations without second thoughts. However, these are very confusing to computer models. A simple example would be to consider two phrases, “Henry VIII” and “Chapter VIII”. Clearly, the former should be read as Henry the Eighth and the latter should be read as Chapter eight. What seems trivial to us is anything but, for front-end developers working at TTS companies like Notevibes.

    They use various predictive models to enhance the user experience. But there is a lack of standard evaluation criteria to judge the accuracy of a TTS system. A lot of variables go into the quality of a particular recording, and these variables are hard to control. This is due to the involvement of both analog and digital processing. However, an increasing number of researchers have begun to evaluate a TTS system based on a fixed set of speech samples.

    That, in a nutshell (a rather big one at that), is an overview of Text to Speech systems. With increased emphasis on AI, ML, DL, etc., it is only a matter of time before we are able to synthesize true-to-life speech for use in our ever-evolving network of things.

    Machine Learning is the core of speech to text technology

    Machine learning is integral to the development of speech to text technology. New speech synthetization tools rely on deep neural algorithms to provide the highest quality outputs as this technology evolves.

    Author: Matt James

    Source: Smart Data Collective

  • The Short-Term Future of Natural Language Processing

    The Short-Term Future of Natural Language Processing

    Natural language processing research and applications are moving forward rapidly. Several trends have emerged on this progress, and point to a future of more exciting possibilities and interesting opportunities in the field.

    As both the written and spoken-language corpora available to us explode in ubiquity, natural language processing (NLP) has become an invaluable tool to researchers, organizations, and even hobbyists. It lets us summarize documents, analyze sentiment, categorize content, translate languages – and one day potentially even converse at a human level.

    Like any AI and ML discipline, NLP is a fast-moving discipline undergoing rapid change as both practitioners and researchers dive into the prospects it presents. While the NLP landscape is rapidly changing, here are some of the trends and opportunities I see on the horizon.

    Prompting: priming NLP with a few choice words

    “Prompting” is a technique that involves adding a piece of text to your input examples to encourage a language model to perform a task you’re interested in. Say your input text is: “We had a great waitress, but the food was undercooked.” Perhaps you’re interested in comparing food quality across restaurant locations. Appending the review with “the food was” and seeing whether “terrible” or “great” are a more likely continuation suddenly provides you with a topical sentiment model. Given the negative sentiment associated with “undercooked,” our missing word would almost certainly be “terrible.”

    Training such a model from scratch would require extensive annotations, but with no or just a small number of examples, a workable solution can be found, making prompting a viable choice for smaller-scale projects and budgets. Because the prompt language can easily be changed, you can explore many possible taxonomies and functionality across your dataset without having to commit to a final set of guidelines for your annotators.

    Standing on the shoulders of giants: a convergence across modalities

    As the field becomes more mature, we’re starting to see cross-pollination between different AI and ML disciplines. This has become possible now that less background knowledge is required to get your feet wet in the field – and instead of highly niche specialists, we’re starting to see solid generalists. This has fostered a convergence of modalities, and we’re now seeing traditional text-based approaches brought to the numerical world and traditionally NLP-oriented things like transformer networks being applied to video and even physical simulations.

    To the creative thinker, the opportunities and applications are vast: for example, Samsung is combining NLP with video imagery to help self-driving cars interpret street signs in foreign countries. NLP and computer vision are natural bedfellows, and I also expect to see the two being used to help translate video to text for accessibility purposes, to improve descriptions of medical imagery, and even to translate verbal design requests into a written or visual description.

    Sharing is caring: open-source models driving knowledge forward

    An open-source AI culture is good for innovation, providing valuable feedback loops, opportunities to improve and develop technologies, and giving technologists room to grow. The big guns like DeepMind have paved the way with the research papers and libraries left in the wake of AlphaGo and AlphaZero, and now smaller contenders such as HuggingFace are doing the same with a commercial/open-source hybrid developed in tandem with language researchers. Partnerships like these, and our own with UMass, mean that the NLP community has increased access to datasets, tokenizers, and transformers, giving us more insight into what’s going on under the hood and fostering opportunities for the community to iterate, advance upon, and broaden access to the technology – and strengthening our collective skill sets and knowledge.

    Fair go, algo: algorithms and people working together

    Algorithms and humans each have their strengths, and by working together, they can deliver exceptional results. One area that’s seen some interest is generative language, but the problem there is that while algorithms can create output that sounds human-like, they’re not concerned with veracity. But having a human at hand to monitor for accuracy, relevance, and the rest of Grice’s Maxims can improve outcomes. The same partnership works well in summarization, which is an area I’m interested in. Quickly condensing a long article into the most salient points is surprisingly tricky for humans, and a task machines have shown a reasonable capacity for, within some constraints. On the other hand, when we ask machines to turn those salient points into a coherent summary, they have an unfortunate tendency to change the meaning to something non-factual. But having a machine highlight the key ideas in a document, and a human turn that into a short snippet, outperforms either working alone. I think we’ll be seeing more and more of this as AI and NLP become embedded in everyday work processes.

    Transforming transformers: less resource-intensive solutions

    BERT and other comparable, well-known technologies are built using transformers, a type of model that identifies dependencies (words in a sentence that relate back to a target word) across long blocks of text. Transformers are highly effective – but also incredibly resource-intensive as they require a huge amount of pretraining and data. Although the world has been taken by storm by these transformer-based technologies, we’re starting to see alternatives spring up as smaller companies and teams seek out more viable solutions for smaller-scale (and budget) problems. HuggingFace’s transformer variant, SRU++ and related work, the Reformer (an efficient transformer model), and models like ETC/BigBird are potential alternatives that I expect will see more interest as the computational cost of transformer-based projects becomes untenable.

    A technology truism: The rest is yet to come

    AI and NLP are always in a stage of advancement and improvement, and we see ebbs and flows as resource-rich industry races ahead with the next big thing while research takes some time to catch up and build out our knowledge. This cycle, now enriched by the increasingly open-source nature of NLP along with technical cross-pollination and new industry applications, will continue to surface new opportunities and advancements for us to explore and take advantage of.

    Author: Paul Barba

    Source: KDnuggets

  • The status of AI in European businesses

    The status of AI in European businesses

    What is the future of AI (artificial intelligence) in Europe and what does it take to build an AI solution that is attractive to investors and customers at the same time? How do we reimagine the battle of 'AI vs Human Creativity' in Europe? 

    Is there any company that is not using AI or isn’t AI-enabled in some way? Whether it is startups or corporates, it is no news that AI is boosting digital transformation across industries at a global level and hence it has traction not only from investors but is also the focus of government initiatives across countries. But where does Europe stand with the US and China in terms of digitization and how collective effort could push AI as an important pan-European strategic topic? 

    First things first: According to McKinsey, the potential of Europe to deliver on AI and catch up against the most AI-ready countries such as the United States and emerging leaders like China is large. If Europe on average develops and diffuses AI according to its current assets and digital position relative to the world, it could add some €2.7 trillion, or 20%, to its combined economic output by 2030. If Europe was to catch up with the US AI frontier, a total of €3.6 trillion could be added to collective GDP in this period.

    What comprises the AI landscape and is it too crowded?

    I recently attended a dedicated panel on 'AI vs Human Creativity' as a part of the first day of the Noah conference 2019 in Berlin.  Moderated by Pamela Spence, Partner of Global Life Sciences, Industry leader EY, the discussion started with an open question on whether the AI landscape is too crowded? According to a report by EY, there are currently about 14,000 startups globally which can be associated with the AI landscape. But what does this mean when it comes to the nature of these startups? 

    Minoo Zarbafi, VP of Bertelsmann Investments Digital Partnerships, added perspective to these numbers: 'There are companies that are AI-enabled and then there are so-called AI-first companies. I differentiate because there are almost no companies today that are not using AI in their processes. From an investor perspective, we at Bertelsmann like AI-first companies which are offering a B2B (business-to-business platform solution to an unsolved problem. For instance, we invested in China in two pioneer companies in the domain of computer vision that are offering a B2B solution for autonomous driving'. Minoo added that from a partnership perspective Bertelsmann looks at AI companies that can help on the digital transformation journey of the company. 'The challenge is to find the right partner with the right approach for our use cases. And we actively seek the support of European and particularly German companies from the startup ecosystem when selecting our partners', she pointed out. 

    The McKinsey report too states that one positive point to note is that Europe may not need to compete head to head but rather in areas where it has an edge (such as in B2B and advanced robotics) and continue to scale up one of the world’s largest bases of technology developers into a more connected Europe-wide web of AI-based innovation hubs.

    Growing share of funding from Series A and beyond reflect increased maturity of the AI ecosystem in Europe. Pamela Spence from EY noted: 'One in 12 startups uses AI as a part of their product or services, up from 50 about six years ago. Startups labelled as being in AI attract up to 50% more funding than other technology firms. 40% of European startups that are claimed as AI companies actually don’t use AI in a way that is material to their business'.

    AI and human creativity go hand-in-hand

    Another interesting and important question is how far are we from the paradigm of clever thinking machines? Why should we be afraid of machines? Hans-Christian Boos, CEO & Founder of Arago, compares how machines were earlier supposed to do tasks which are too tedious or expensive and complex for humans. 'The principle of machine changes with AI. It used to earlier just automate tasks or standardise them. Now, all you need is to describe what you want as an outcome and the machine will find that outcome for you, that is a different ballgame altogether. Everything is result-oriented', he says.

    Minoo Zarbafi adds that as human beings, we have a limited capacity for processing information. 'With the help of AI, you can now digest much more information which may, combined with human creativity, cause you to find innovative solutions that you could not see before. One could say, the more complexity, the better the execution with AI. At Bertelsmann, our organisation is decentralised and it will be interesting to see how AI leverages operational execution'.  

    AI and the political landscape

    Why discuss AI when we talk about the digital revolution in Europe? According to the tech.eu report titled ‘Seed the Future:  A Deep Dive into European Early-Stage Tech Startup Activity’, it would be safe to say that Artificial Intelligence, Machine Learning and Blockchain lead the way in Europe. The European Commission has identified Artificial Intelligence as an area of strategic importance for the digital economy, citing it’s cross-cutting applications to robotics, cognitive systems and big data analytics. In an effort to support this, the Commission’s Horizon 2020 funding includes considerable funding AI, allocating €700M EU funding specifically.

    Chiara Sommer, Investment Director of Intel Capital, reflected on this by saying: 'In the present scenario, the implementation of AI starts with workforce automation with a focus on how companies could reduce cost and become more efficient. The second generation of AI companies focuses on how products can offer solutions and solve problems like never before. There are entire departments can be replaced by AI. Having said that, the IT industry adopts AI fastest, and then you have industries like healthcare, retail, a financial sector that follow'. 

    Why are some companies absorbing AI technologies while most others are not? Among the factors that stand out are their existing digital tools and capabilities and whether their workforce has the right skills to interact with AI and machines. Only 23% of European firms report that AI diffusion is independent of both previous digital technologies and the capabilities required to operate with those digital technologies; 64% report that AI adoption must be tied to digital capabilities, and 58% to digital tools. McKinsey reports that the two biggest barriers to AI adoption in European companies are linked to having the right workforce in place.

    It is certainly a collective effort of industries, the government, policy makers, corporates to have effective and impactful use of AI. Instead of asking how AI will change society Hans-Christian Boos rightly concludes: 'We should change the society to change AI'.

    Author: Diksha Dutta

    Source: Dataconomy

  • Top 10 big data predictions for 2019

    The amount of data that created nowadays is incredible. The amount and importance of data is ever growing, and with that the need for analyzing and identifying patterns and trends in data becomes critical for businesses. Therefore, the need for big data analytics is higher than ever. That raises questions about the future of big data. ‘In which direction will the big data industry evolve?’ 'What are the dominant trends for big data in the future?' While there are several predictions doing the rounds, these are the top 10 big data predictions that will most likely dominate the (near) future of the big data industry:

    1. An increased demand for data scientists

    It is clear that with the growth of data, the demand for people capable of managing big data is also growing. Demand for data scientists, analysts and data management experts is on the rise. The gap between the demand and availability of people who are skilled in analyzing big data trends is big and keeps getting bigger. It is up to you to decide if you wish to hire offshore data scientists/data managersor hire an in-house team for your business.

    2. Businesses will prefer algorithms over software

    Businesses prefer purchasing existing algorithms over creating their own. It gives them more customization options compared to a situation where they buy software. Software cannot be modified as per user requirements, rather businesses have to adjust as per the software.

    3. Businesses increase investments in big data

    IDC analysts predict that the investment in big data and analytics will reach $187 billion in 2019. Even though the big data investment from one industry to the other will vary, spending as a whole will increase. It is predicted that the manufacturing industry will experience the highest investment in big data, followed by healthcare and the financial industry.

    4. Data security and privacy will be a growing concern

    Data security and privacy have been the biggest challenges in the big data and internet of things (IoT) industries. Since the volume of data started increasing exponentially, the privacy and security of data have become more complex and the need to maintain high-security standards is becoming extremely important. If there is something that will impede the growth of big data, it is data security and privacyconcerns.

    5. Machine learning will be of more importance for big data

    Machine learning will be of paramount importance regarding big data. One of the most important reasons why machine learning will be important for big data is that it can be of huge help in predictive analysis and addressing future challenges.

    6. The rise of predictive analytics

    Simply put, predictive analytics can predict the future more reliably with the help of big data analytics. It is a highly sophisticated and effective way to gather market and customer information to determine the next actions of both consumer and businesses. Analytics provide depth in the understanding of futuristic behaviour.

    7. Chief Data Officers will have a more important role

    As big data becomes important, the role of Chief Data Officers will increase. Chief Data Officers will be able to direct functional departments with the power of deeply analysed data and in-depth studies of trends.

    8. Artificial Intelligence will become more accessible

    Without going in detail about how Artificial Intelligence becomes significantly important for every industry, it is safe to say that big data is a major enabler of AI. Processing large amounts of data to derive trends for AI and machine learning is possible. With cloud-based data storage infrastructure, parallel processing of big data is possible. Big data will make AI more productive and more efficient.

    9. A surge in IoT networks

    Smart devices are dominating our lives like never before. There will be an increase in the use of IoT by businesses and that will only increase the amount of data that is being generated. In fact, the focus will be on introducing new devices that are capable of collecting and processing data as quickly as possible.

    10. Chatbots will get smarter

    Needless to say, chatbots come across a large part of daily online interaction. But chatbots are turning more and more intelligent and capable of personalized interactions. With the rise of AI, big data will enable tons of data to be processed and conversations can be analysed to draw a more streamlined strategy that is more customer-focused for chatbots to be smarter.

    Is your business ready for the future of big data analytics? Keep the above predictions in mind when preparing your business for emerging technologies and think about how big data can play a role.

    Source: Datafloq

  • Topic modeling and the need for humanity in data science

    Topic modeling and the need for humanity in data science

    In fields involved with knowledge production, unsupervised machine-learning algorithms are becoming the standard. These algorithms allow us to statistically analyze data sets that exceed traditional analytic capabilities. Topic modeling, for example, is gradually emerging as the strategy of choice in marketing research, social sciences, cultural analytics, and in historical, scientific, and textual scholarship.

    An algorithmic text-mining practice, topic modeling is used to discover recurring subjects and issues in large collections of documents. Parsing such large-scale data sets, classifying genomic sequences, mapping forms of advertisement, observing online discussions, etc. is a matter of organization: how do you make sense of, and classify, these clusters of information?

    The answer, often, is to configure them into abstract but coherent topics.
    As a consequence, the software-based output of your chosen topic-modeling practice will inevitably confront you with the task of interpreting computer-generated data as texts. (Texts, in this context, being assemblages of elements of signification, or what semioticians call signs.)

    Any process of interpretation of textual data relates, from this point of view, to the interplay between observable features and a specific perspective, or that which causes us see what we see during an observational process. 

    In developing a methodology, then, we must consider both the observed and the observer. And that starts by analyzing the problem of what we see. 

    The initial consideration might be that what you see in, say, the word bubbles of the DFR Browser, the graphic interface developed by Andrew Goldstone that visually renders the Mallet-processed analysis of your source texts, are not the empirical or directly observable data. They are rather estimated data resulting from a probabilistic processing of the actual words in the source material. Based on what we call posterior probability, these estimations are the influence of further conditions assigned after primary physical evidence, counting words occurrencesin our case, is gathered. 

    Such lists of words (topics) collectively represent just a possible picture of the object of analysis. Additionally, we need to remind ourselves that Bayesian statistics itself (on which the Latent Dirichlet Algorithm typically used in topic modeling is designed) does not work with physical probabilities but with evidential probabilities.

    As a result, when it comes to topic modeling, the computer is in charge of making an initial wild 'guess'. This initial guess, which is based on large-scale computations, is then followed by iterations of probabilistic hypotheses. Each hypothesis is based on occurrences and frequency: what happens and how often. Conceptually, this means that we are tasking a computer to use data to make subjective or speculative judgments. 

    To this gigantic algorithmic speculation, we then add the one connected with the human-reading practice (how we identify or label a topic as part of a specific area of meaning). 

    We can gradually begin to understand how the process of 'topic labeling', far from being the result of any possible automated or standardized procedure, represents the final stage of speculative layers at both the machinic and human level. 

    It might therefore be useful to assume that a topic is always within the realm of a 'possibility of signification', a textual status that the humanities, and literary theory in particular, have extensively addressed over the past fifty years. 

    As a collection of words, a topic radiates meaning in different ways to different people. The same is true with different settings and purposes. We can understand, then, how the remarkable amount of scholarship on meaning ambiguity and language polysemy typical of the humanities can come as an extremely valuable help and actual operational toolkit for contemporary data science

    As the use of topic modeling techniques is likely to become more and more widespread, the problem of data-based identification of topics will increasingly become a central one. This will require the implementation of theories of interpretation that literary studies and humanistic scholarship have refined across their centuries-long traditions of studies.

    Author: Sarah Rubenoff

    Source: Insidebigdata

  • Trendsetting Applications of AI in Healthcare

    Trendsetting Applications of AI in Healthcare

    We live in a digital age, so it’s not surprising that the healthcare industry follows suit. From the data-driven insights of wearables to mobile apps that help manage chronic conditions, it’s clear that technology is changing healthcare forever.

    But what exactly are people getting their hands on? As a long-time public health physician and leader on digital transformation in healthcare, here are some of the most exciting trends in digital health that I see shaping the industry right now.

    Consumer AI

    Many health systems have already chosen to incorporate artificial intelligence (AI) into their operations — and it’s not just for cost-saving purposes. With the help of AI, healthcare organizations can better tackle complex challenges related to population health management like declines in patient satisfaction, rates of readmission and rising costs of care.

    Apart from this, consumer AI also has a significant role in improving the current state of healthcare. At home, for instance, AI can help patients better understand their symptoms and treatments by making personalized recommendations generated from an individual’s own unique biological data.

    After all, personalization is key to providing better healthcare outcomes. What is also helping push consumer AI further into healthcare is its use as a popular application for wearables. Apps such as Google Fit and Apple HealthKit provide individuals with an in-depth picture of their health and wellness through tracking data, which can potentially help them better understand their conditions.

    Challenges in working on improving the regulatory framework will persist. In what some experts describe as “digital trust,” these issues of privacy in health data will need to be addressed as innovations rapidly emerge in response to the needs for patient-centered care.

    Big Health Data

    The proliferation of wearable technology also helps contribute to the rise of big health data, an ever-increasing trove of information that can offer businesses and healthcare providers useful insights about patient care.

    For instance, big health data can be used to better predict the onset of chronic conditions among patients who are predisposed to them. It can also create more efficient and effective clinical pathways and improve hospital management operations. This aligns with recent calls to take the emphasis off people’s medical records and on overall health plans, as well as utilizing data to deliver information over simply supporting transactions. 

    There is also now a greater focus on developing big health data initiatives that enable secure exchange between healthcare providers and their patients, especially when it comes to cloud-based technologies that offer real-time tracking of patient health data.

    Cloud Data

    The adoption of cloud computing technologies is also helping to usher in a new era of digital transformation in healthcare. Cloud computing enables rapid data access and processing, which can help healthcare providers make more informed, real-time decisions.

    Healthcare organizations are also starting to use cloud-based technologies for better information management. This includes adopting solutions such as electronic health records (EHRs) since they allow healthcare providers to store, manage and share data more easily.

    Cloud networks are also paving the way for better telehealth solutions like remote medical monitoring and mobile health services. In the future, I see virtual healthcare services as becoming an increasingly viable option for patients who want to stay at home.

    Drug Discovery With Machine Learning

    With the rise of big health data comes an increased emphasis on machine learning (ML) in healthcare. ML refers to predictive analytics to sift through huge amounts of medical data and identify patterns that can be used to improve patient outcomes.

    In the coming years, I predict we’ll see a bigger focus on applying ML technologies to drug discovery, drug development and pharmaceutical industry processes. An example of this is the use of ML to predict patient drug responses that can help identify which patients will benefit the most from a certain treatment.

    This type of predictive analytics works well with genetic data and offers clues about how an individual will react in specific situations. Using predictive analytics in this way enables healthcare providers to deliver targeted care plans that are based on individual patient needs.

    Personalized Genetic Testing

    Genetic testing is another area where predictive analytics will play a major role in the future of consumer AI and healthcare. With genetic testing, healthcare providers can analyze an individual’s DNA to create a model that predicts how they are likely to respond to certain drugs or treatments.

    Using this type of advanced predictive analytics enables drug developers to develop personalized treatment plans that could potentially improve the lives of patients with certain conditions. For example, pharmacogenetic testing has recently been used to treat chronic pain in children, and the process can potentially save billions of dollars that go to ineffective drug therapies.

    With the proliferation of this technology and service, consumers will have more options for genetic testing, and these tests can give healthcare providers access to even greater amounts of data in order to improve patient outcomes and treat patients more effectively. However, more research is needed to determine the benefit of this testing broadly in diverse populations.

    Bottom Line

    Digital health transformation will continue to gain momentum over the next few years, and healthcare providers are increasingly looking toward digital technologies for ways to improve patient care while determining the best practices in a regulatory framework.

    As more people worldwide get access to smart technology in their homes, health-related apps and services, including telehealth solutions, will continue to become an increasingly viable option for patients who want to stay home but still get quality healthcare. And as innovation continues to move ahead, global policies and regulations will have to determine how best to use these technologies to ensure safety and efficacy.

    Author: Anita Gupta

    Source: Forbes

  • Using Artificial Intelligence to see future virus threats coming

    Using Artificial Intelligence to see future virus threats coming

    Researchers use machine learning algorithms in novel approach to finding future zoonotic virus threats.

    Most of the emerging infectious diseases that threaten humans – including coronaviruses – are zoonotic, meaning they originate in another animal species. And as population sizes soar and urbanisation expands, encounters with creatures harbouring potentially dangerous diseases are becoming ever more likely.

    Identifying these viruses early, then, is becoming vitally important. A new study out today in PLOS Biology from a team of researchers at the University of Glasgow, UK, has identified a novel way to do this kind of viral detective work, using machine learning to predict the likelihood of a virus jumping to humans.

    According to the researchers, a major stumbling block for understanding zoonotic disease has been that scientists tend to prioritise well-known zoonotic virus families based on their common features. This means that there is potentially myriad viruses unrelated to known zoonotic diseases that have not been discovered, or are not well known, which may hold zoonotic potential – the ability to make the species leap.

    In order to circumvent this problem, the team developed a machine learning algorithm that could infer the zoonotic potential of a virus from its genome sequence alone, by identifying characteristics that link it to humans, rather than looking at taxonomic relationships between the virus being studied and existing zoonotic viruses.

    The team found that viral genomes may have generalisable features that enable them to infect humans, but which are not necessarily taxonomically closely related to other human-infecting viruses. They say this approach may present a novel opportunity for viral sleuthing.

    “By highlighting viruses with the greatest potential to become zoonotic, genome-based ranking allows further ecological and virological characterisation to be targeted more effectively,” the authors write.

    “These findings add a crucial piece to the already surprising amount of information that we can extract from the genetic sequence of viruses using AI techniques,” says co-author Simon Babayan.

    “A genomic sequence is typically the first, and often only, information we have on newly discovered viruses, and the more information we can extract from it, the sooner we might identify the virus’s origins and the zoonotic risk it may pose.

    “As more viruses are characterised, the more effective our machine learning models will become at identifying the rare viruses that ought to be closely monitored and prioritised for pre-emptive vaccine development.”

    Author: Amalyah Hart

    Source: Cosmos

  • Using the right workforce options to develop AI with the help of data

    Using the right workforce options to develop AI with the help of data

    While it may seem like artificial intelligence (AI) has hit the jackpot, a lot of work needs to be done before its potential can really come to life. In our modern take on the 20th century space race, AI developers are hard at work on the next big breakthrough that will solve a problem and establish their expertise in the market. It takes a lot of hard work for innovators to deliver on their vision for AI, and it’s the data that serves as the lifeblood for advancement.  

    One of the biggest challenges AI developers face today is to process all the data that feeds into machine learning systems, a process that requires a reliable workforce with relevant domain expertise and high standards for quality. To address these obstacles and get ahead, many innovators are taking a page from the enterprise playbook: where alternative workforce models can provide a competitive edge in a crowded market. 

    Alternative workforce options

    Deloitte’s 2018 Global Human Capital Trends study found that only 42% of organizations surveyed said their workforce is made up of traditional salaried employees. Employers expect their dependence on contract, freelance and gig workers to dramatically increase over the next few years. Acceleratingthis trend is the pressure business leaders face to improve their workforce ecosystem as alternative workforce options bring the possibility for companies to advance services, move faster and leverage new skills. 

    While AI developers might be tempted to tap into new workforce solutions, identifying the right approach for their unique needs demands careful consideration. Here’s an overview of common workforce options and considerations for companies to select the right strategy for cleaning and structuring the messy, raw data that holds the potential to add rocket fuel to your AI efforts:

    • In-house employees: The first line of defense for most companies, internal teams can typically manage data needs with reasonably good quality. However, these processes often grow more difficult and costlier to manage as things progress, calling for a change of plans when it’s time to scale. That’s when companies are likely to turn to alternative workforce options to help structure data for AI development.
    • Contractors and freelancers: This is a common alternative to in-house teams, but business leaders will want to factor in extra time it will take to source and manage their freelance team. One-third of Deloitte’s survey respondents said their human resources (HR) departments are not involved in sourcing (39%) or hiring (35%) decisions for contract employees, which 'suggests that these workers are not subject to the cultural, skills, and other forms of assessments used for full-time employees'. That can be a problem when it comes to ensuring quality work, so companies should allocate additional time for sourcing, training and management.
    • Crowdsourcing: Crowdsourcing leverages the cloud to send data tasks to a large number of people at once. Quality is established using consensus, which means several people complete the same task. The answer provided by the majority of the workers is chosen as correct. Crowd workers are paid based on the number of tasks they complete on the platform provided by the workforce vendor, so it can take more time to process data outputs than it would with an in-house team. This can make crowdsourcing a less viable option for companies that are looking to scale quickly, particularly if their work requires a high level of quality, as with data that provides the intelligence for a self-driving car, for example.
    • Managed cloud workers: A solution that has emerged over the last decade, combining the quality of a trained, in-house team with the scalability of the crowd. It’s ideally suited for data work because dedicated teams develop expertise in a company’s business rules over time by sticking with projects for a longer period of time. That means they can increase their context and domain knowledge while providing consistently high data quality. However, teams need to be managed in ways that optimize productivity and engagement, and that takes something. Companies should look for partners with tested procedures for communication and process.

    Getting down to business

    From founders and data scientists to product owners and engineers, AI developers are fighting an uphill battle. They need all the support they can get, and that includes a dedicated team to process the data that serves as the lifeblood of AI and machine learning systems. When you combine the training and management challenges that AI developers face, workforce choices might just be the factor that determines success. With the right workforce strategy, companies will have the flexibility to respond to changes in market conditions, product development and business requirements.

    As with the space race, the pursuit AI in the real world holds untold promise, but victory won’t come easy. Progress is hard-won, and innovators who identify strong workforce partners will have the tools and talent they need to test their models, fail faster and ultimately get it right quicker. Companies that make this process a priority now can ensure they’re in the best position to break away from the competition as the AI race continues.

    Author: Mark Sears

    Source: Dataconomy

  • Visualization, analytics and machine learning - Are they fads, or fashions?

    Machine learningI was recently a presenter in the financial planning and analysis (FP&A) track at an analytics conference where a speaker in one of the customer marketing tracks said something that stimulated my thinking. He said, “Just because something is shiny and new or is now the ‘in’ thing, it doesn’t mean it works for everyone.”

    That got me to thinking about some of the new ideas and innovations that organizations are being exposed to and experimenting with. Are they fads and new fashions or something that will more permanently stick? Let’s discuss a few of them:


    Visualization software is a new rage. Your mother said to you when you were a child, “Looks are not everything.” Well, she was wrong. Viewing table data visually, like in a bar histogram, enables people to quickly grasp information with perspective. But be cautious. Yes, it might be nice to import your table data from your spreadsheets and display them in a dashboard! Won’t that be fun? Well it may be fun, but what are the unintended consequences of reporting performance measures as a dial or barometer?

    A concern I have is that measures reported in isolation of other measures provides little to no context as to why the measure is being reported and what “drives” the measure. Ideally dashboard measures should have some cause-and-effect relationship with key performance indicators (KPIs) that should be derived from a strategy map and reported in a balanced scorecard. 

    KPIs are defined as monitoring the progress toward accomplishing the 15-25 strategic objective boxes in a strategy map defined by the executive team. The strategy map provides the context from which the dashboard performance indicators (PIs) can be tested and validated for their alignment with the executive team’s strategy.

    Business analytics

    Talk about something that is “hot.” Who has not heard the terms Big Data and business analytics? If you raised your hand, then I am honored that I am apparently the first blogger you have ever read. Business analytics is definitely a next managerial wave. I am biased towards them because my 1971 university degree was in industrial engineering and operations research. I love looking at statistics. So do television sports fans who are now provided “stats” for teams and players in football, baseball, golf and every kind of televised sport. But the peril of business analytics is they need to serve a purpose for problem solving or seeking opportunities. 

    The analytics thought leader James Taylor advises, “Work backwards with the end in mind.” That is, know why you are applying analytics. Experienced analysts typically start with a hypothesis to prove or disprove. They don’t apply analytics as if they are searching for a diamond in a coal mine. They don’t flog the data until it confesses with the truth. Instead, they first speculate that two or more things are related or that some underlying behavior is driving a pattern seen in various data.

    Machine learning and cognitive software

    There are an increasing number of articles and blogs with this theme related to artificial intelligence – the robots are coming and they will replace jobs. Here is my take. Many executives, managers, and organizations underestimate how soon they will be affected and the severity of the impact. This means that many organizations are unprepared for the effects of digital disruption and may pay the price through lower competitive performance and lost business. Thus it is important to recognize not only the speed of digital disruption, but also the opportunities and risks that it brings, so that the organization can adjust and re-skill its employees to add value.

    Organizations that embrace a “digital disruptor” way of thinking will gain a competitive edge. Digitization will create new products and services for new markets providing potentially substantial returns for investors in these new business models. Organizations must either “disrupt” or “be disrupted”. Companies often fail to recognize disruptive threats until it is too late. And even if they do, they fail to act boldly and quickly enough. Embracing “digital transformation” is their recourse for protection.

    Fads or fashions?

    Are these fads and fashions or the real deal? Are managers attracted to them as the shiny new toys that they must have on their resume for their next bigger job and employer? My belief is these three “hot” managerial methods and tools are essential. But they need to be thought through and properly designed and customized; and not just slapped in willy-nilly just to have them as shiny new toys.

    Bron: Gary Cokins (Information Management)

  • What about the relation between AI and machine learning?

    Artificial intelligenceartificial intelligence machine learning is one of the most compelling areas of computer science research. AI technologies have gone through periods of innovation and growth, but never has AI research and development seemed as promising as it does now. This is due in part to amazing developments within machine learning, deep learning, and neural networks.

    Machine learning, a cutting-edge branch of artificial intelligence, is propelling the AI field further than ever before. While AI assistants like Siri, Cortana, and Bixby are useful, if not amusing, applications of AI, they lack the ability to learn, self-correct, and self-improve. 

    They are unable to operate outside of their code, learn independently, and apply past experiences to new problems. Machine learning is changing that. Machines are able to grow outside their original code which allows them to mimic the cognitive processes of the human mind.

    Why is machine learning important for AI? As you have most likely already gathered, machine learning is the branch of AI dedicated to endowing machines with the ability to learn. While there are programs that help sort your email, provide you with personalized recommendations based on your online shopping behavior, and make playlists based on music you like, these programs lack the ability to truly think for themselves. 

    These “weak AI” programs are able to analyze data well and conjure up impressive responses, they are far cry from true artificial intelligence. The only way to arrive at anything close to true artificial intelligence would require a machine to learn. A machine with true artificial intelligence, also known as artificial general intelligence, would be aware of its environment and would manipulate that environment to achieve its goals. A machine with artificial general intelligence would be no different from a human, who is aware of his or her surroundings and uses that awareness to arrive at solutions to problems occurring within those surroundings.

    You may be familiar with the infamous AlphaGo program that beat a professional Go player in 2016 to the chagrin of many professional Go players. While AI has been able to beat chess players in the past, the AI win came as an incredible shock to Go players and AI researchers alike. Surpassing Go players was previously thought to be impossible given that each move in the ancient has almost infinite permutations. Decisions in Go are so intricate and complex that it was thought that the game required human intuition. As it so happens, Go does not require human intuition, it only requires general-purpose learning algorithms.

    How were these general-purpose learning algorithms crafted? The AlphaGo program was created DeepMind Technologies, an AI company acquired by Google in 2014 that managed to create a neural network as well as a model that allowed for machines to mimic short-term memory utilizing researchers as well as C++, Lua, and Python developers. The neural network and the short-term memory model are applications of deep learning, a cutting-edge branch of machine learning.

    Deep learning is an approach to machine learning in which software emulates the human brain. Currently, machine learning applications allow for a machine to train in a certain task by analyzing examples of that task. Deep learning allows for machines to learn in a more general way. So, instead of simply mimicking cognitive functioning in a predefined task, machines are endowed with what can be thought of as a sort of artificial brain. This artificial brain is called a artificial neural network, or neural net for short.

    There are several neural net models in use today, and all use mathematics to copy the structure of the human brain. Neural nets are divided into layers, and consist of thousands, sometimes millions, of interconnected processing nodes. Connections between nodes is given a weight. If the weight is over a predefined threshold, then the node’s data is sent through the next layer. These nodes act as artificial neurons, sharing clusters of data and storing experience and knowledge based on that data, and firing off new bits of information. These nodes interact dynamically and change thresholds and weights as they learn from experience.

    Machine learning and deep learning are exciting and alarming areas of research within AI. Endowing machines with the ability to learn certain tasks could be extremely useful, could increase productivity, and help expedite all sorts of activities, from search algorithms to data mining. Deep learning provides even more opportunities for AI’s growth. As researchers delve deeper into deep learning, we could see machines that understand the mechanics behind learning itself, rather than simply mimicking intellectual tasks

    Author: Greg Robinson

    Source: Information Management

  • What is edge intelligence and how to apply it?

    What is edge intelligence and how to apply it?

    The term “edge intelligence,” also referred to as “intelligence on the edge,” describes a new phase in edge computing. Organizations are using edge intelligence to develop smarter factory floors, retail experiences, workspaces, buildings, and cities. The edge has become “intelligent” by way of analytics that were formerly limited to the cloud or in-house data centers. In this process, an intelligent remote sensor node may make a decision on the spot or send the data to a gateway for further screening before sending it to the cloud or another storage system.

    Mining big data for useful insights can be a major challenge. Searching through data is very much like panning for gold, a time consuming task with occasional rewards. Organizations are aware of the strategic importance of big data and analytics, but there are still hurdles to overcome.

    While data can give a business a competitive edge, there is also the potential to swamp their storage systems with worthless information. There is simply an overwhelming amount of data being created on a daily basis, much of which is useless. Asha Keddy, Corporate Vice President and Manager of Next Generation and Standards at Intel, stated, “We’re generating too much data.”

    Prior to edge computing, streams of data were sent straight from the internet of things (IoT) to a central data storage system. Early edge computing was an effort to provide a data screening process using micro-data stations (preferably within 100 square feet of the sensor nodes) to eliminate unnecessary or redundant data before sending it on. In simpler terms, early edge computing attempted to send leaner, more efficient data streams, with less data to store and process on the primary system.

    Cities, buildings, and industrial systems start with an edge sensor node, which senses and measures a specific range of information that is then used in making key decisions. The edge nodes can process data intelligently, and can bundle, refine, or encrypt the data for transmission to a data storage system. Ideally, an edge node is small, unobtrusive, and can fit in environments with minimal amounts of space.

    The intelligence aspect

    There are a wide variety of sensing devices available for use at the edge that provide all kinds of data on such things as vibrations, sound, temperature, humidity, motion, pressure, pollutants, audio, and video. The screened data is then transmitted through a gateway to the cloud for storage and further analysis. These gateways are essentially small servers, and exist between an organization’s cloud or data center, or its cloud and the sensors being used.

    Edge gateways have developed into architectural components that improve the performance of the internet of networks. These gateways are available as off-the-shelf devices that are adaptable enough to mix and match with the differing clouds and sensors. Different gateways are used for different tasks. Gateways needing to perform a real-time analysis of data from a factory floor will need to be more powerful than a gateway that simply tracks the location data of an automated fulfillment center.

    Connected sensors provide a broad range of information that should be used in making key decisions. The edge node is the data source, and if recorded information is faulty and of poor quality, use of the data can do more damage than good.

    Machine learning

    Machine learning (ML) is an important aspect of edge intelligence, and chips designed for running ML models are commercially available. ML can detect patterns and anomalies in the data stream and initiate the appropriate response.

    Machine learning provides support for factories, smart cities, smart grids, augmented and virtual reality, connected vehicles, and healthcare systems. ML models are trained in the cloud and then used to make the edge intelligent.

    Machine learning is an effective way of creating a functional AI. Many ML techniques, such as  decision trees, Bayesian networks, and K-means clustering have been developed to train the AI entity to make both classifications and predictions. Deep learning (a subdivision of the ML field) is one of the techniques, and uses an artificial neural network. Deep learning has resulted in impressive abilities to perform multiple tasks, classify images, and recognize faces.

    Artificial intelligence

    While machine learning is becoming quite popular with sensor nodes in the manufacturing industry, artificial intelligence (AI) is being applied to the big data being gathered from such things as social media contents, business informatics, and online shopping records.

    This data was generally sent to and stored in massive data centers. However, with the expansion of mobile computing and the internet of things, that trend is starting to reverse itself. Cisco has estimated that by 2021, nearly 850 ZB of data will be produced by all the people, machines, and things on the network edge.

    Transporting bulk data from the IoT devices (smart phones and iPads) to the cloud for analytics can be expensive and inefficient. A recent solution uses on-device analytics that run AI applications to process IoT data locally. This situation, however, is not ideal. These AI applications require significant computational power (the kind not available on a smart phone), and often suffer from low performance and energy efficiency issues.

    One proposal suggests dealing with these challenges by pushing cloud services from the network’s core, out to the network’s edges. An edge node sensor can be the smart phone or other mobile device. The sensor communicates with a network gateway, or a micro-data center. Physical proximity to data source devices is the most important characteristic in this situation. (Let’s say you have a smart phone. Its GPS would send a signal to a nearby 5G sensor on a telephone pole, which then sends it to a gateway that would determine your location, and then send the refined, finalized data to the cloud for storage or further analysis).

    Since 2009, Microsoft has been conducting continuous research on what applications should be shifted to the edge from the cloud. Their research ranges from voice command recognition to interactive cloud gaming to real-time video analytics.

    Real-time video analytics is predicted to become a very popular application for edge computing. As an application built atop computer vision, real-time video analytics will continuously gather high-definition videos taken from surveillance cameras. These applications require high computation, high bandwidth, and low latency to analyze the videos. This is made possible by extending the cloud’s AI to gateways covering the edge.

    The smart factory

    One type of sensor that is fast gaining popularity measures the vibrations of equipment with mechanical components (rotating shafts or gears). These multi-axis sensors measure the vibrational displacement of the equipment in real time. The vibrational displacement can then be processed and compared with the acceptable range of displacement. In a factory, analyzing this information can increase efficiency, reduce down-time, and predict mechanical failures before they happen. In some cases, a piece of equipment with a disintegrating mechanical component, which will cause further damage, can be shut down immediately.

    The time needed for sensor nodes to react can be dramatically reduced by including edge node analytics. A MEMS sensor, for example, will provide a warning when threshold limits are exceeded, and will immediately send out an alert. If data suggests the event is bad enough, the sensor may disable the equipment automatically, preventing a catastrophic breakdown.

    Smart city

    In smart cities, some industrial IoT edge node sensors can be used, such as an industrial camera with embedded video analytics. The mission statement of smart cities typically include the desire to integrate and communicate useful information to its citizens and employees. A common application provides parking space availability. Cameras can be used to identify a wide variety of objects (such as parked cars) and identify motion. This can be used to analyze movement historically, as well.

    Other sensors are designed specifically for smart cities, such as pollution sensors that warn city officials when a business has exceeded its allowable standards. A sensor for sound levels can be installed in some areas, or a sensor might be used to monitor vehicles and pedestrian traffic to optimize walking and driving routes. Citizens can have their energy and water consumption monitored to get advice on reducing their usage. The increasing use of automated decision-making in our devices, apps, and business processes makes AI essential to staying competitive.

    The future of edge computing

    The intelligent edge continues to gain in popularity, connecting devices and systems to gather and analyze data. The number of IoT devices being used worldwide has exploded, and cloud computing is becoming overwhelmed with the volume of data being produced. The intelligent edge not only provides real-time insights on operational efficiency, such as improving maintenance for vital equipment before it breaks down, but also screens out useless data.

    A seamless, synchronized user experience is basic goal of many internet organizations. For technology vendors, the intelligent edge and its connected devices provide opportunities for developing smarter, more integrated systems. These connected devices reduce the cloud’s burden by screening out useless data, and businesses ignoring the concept of edge computing will inevitably lose any competitive advantage they might have had in manufacturing or customer service.

    Author: Keith D. Foote

    Source: Dataversity

  • What is Machine Learning? And which are the best practices?

    What is Machine Learning? And which are the best practices?

    As machine learning continues to address common use cases it is crucial to consider what it takes to operationalize your data into a practical, maintainable solution. This is particularly important in order to predict customer behavior more accurately, make more relevant product recommendations, personalize a treatment, or improve the accuracy of research. In this blog, we will attempt to understand the meaning of Machine Learning, what it takes for it to work and which are the best machine learning practices.

    What is ML?

    Machine learning is a computer programming technique that uses statistical probabilities to give computers the ability to 'learn' without being explicitly programmed. Simply put, machine learning 'learns' based on its exposure to external information. Machine learning makes decisions according to the data it interacts with and uses statistical probabilities to determine each outcome. These statistics are supported by various algorithms modeled on the human brain. In this way, every prediction it makes is backed up by solid factual, mathematical evidence derived from previous experience.

    A good example of machine learning is the sunrise example. A computer, for instance, cannot learn that the sun will rise every day if it does not already know the inner workings of the solar system and our planets, and so on. Alternatively, a computer can learn that the sun rises daily by observing and recording relevant events over a period of time.

    After the computer has witnessed the sunrise at the same time for 365 consecutive days, it will calculate, with a high probability, that the sun will rise again on the three hundred and sixty-sixth day. That is, of course, there will still be an infinitesimal chance that the sun won't rise the day after as the statistical data collected thus far will never allow for a 100% probability.

    There are three types of machine learning:

    1. Supervised Machine Learning

    In supervised machine learning, the computer learns the general rule that maps inputs to desired target outputs. Also known as predictive modeling, supervised machine learning can be used to make predictions about unseen or future data such as predicting the market value of a car (output) from the make (input) and other inputs (age, mileage, etc).

    2. Un-supervised Machine Learning

    In un-supervised machine learning, the algorithm is left on its own to find structure in its input and discover hidden patterns in data. This is also known as 'feature learning'.

    For example, a marketing automation program can target audiences based on their demographics and purchasing habits that it learns.

    3. Reinforcement Machine Learning

    In reinforcement machine learning, a computer program interacts with a dynamic environment in which it must perform a certain goal, such as driving a vehicle or playing a game against an opponent. The program is given feedback in terms of rewards and punishments as it navigates the problem space, and it learns to determine the best behavior in that context.

    Making ML work with data quality

    Machine Learning depends on data. Good quality data is needed for successful ML. The more reliable, accurate, up-to-date and comprehensive that data is, the better the results will be. However, typical issues including missing data, inconsistent data values, autocorrelation and so forth will affect the statistical properties of the datasets and interfere with the assumptions made by algorithms. It is vital to implement data quality standards with your team throughout the beginning stages of the machine learning initiative.

    Democratizing and operationalizing

    Machine Learning can appear complex and hard to deliver. But if you have the right people involved from the beginning, with the right skills and knowledge, there will be less to worry about.

    Get the right people on your team involved who:

    • can identify the data task, chose the right model and apply the appropriate algorithms to address the specific business case
    • have the skills in data engineering are useful, as machine learning is all about data
    • will choose the right programming language or framework for your needs
    • have a background in general logic and basic programming is vital in order
    • have a good understanding of core mathematics to help you manage most standard machine learning algorithms effectivelym, especially Linear Algebra, Calculus, Probability, Statistics and Data and Frameworks

    Most importantly, share the wealth. What good is a well-designed machine learning strategy if the rest of your organization cannot join in on the fun. Provide a comprehensive ecosystem of user-friendly, self-service tools that incorporates machine learning into your data transformation for equal access and quicker insights. A single platform that brings all your data together from public and private cloud as well as on-premise environments will enable your IT and business teams to work more closely and constructively while remaining at the forefront of innovation.

    Machine Learning best practices

    Now that you are prepared to take a data integration project that involves machine learning head-on, it is worth following these best practices below to ensure the best outcome:

    1. Understand the use case – Assessing the problem you are trying to solve will help determine whether machine learning is necessary or not.
    2. Explore data and scope – It is essential to assess the scope, type, variety and velocity of data required to solve the problem.
    3. Research model or algorithm – Finding the best-fit model or algorithm is about balancing speed, accuracy and complexity.
    4. Pre-process – Data must be collated into a format or shape which is suitable for the chosen algorithm.
    5. Train – Teach your model with existing data and known outcome.
    6. Test – Test against non-associated data without known outcomes to test accuracy
    7. Operationalize – After training and validating, start calculating and predicting outcomes with new data.

    As data increases, more observations are made. This results in more accurate predictions. Thus, a key part of a successful data integration project is creating a scalable machine learning strategy that starts with good quality data preparation and ends with valuable and intelligible data. 

    Author: Javier Hernandez

    Source: Talend

  • Why communication on algorithms matters

    Why communication on algorithms matters

    The models you create have real-world applications that affect how your colleagues do their jobs. That means they need to understand what you’ve created, how it works, and what its limitations are. They can’t do any of these things if it’s all one big mystery they don’t understand.

    'I’m afraid I can’t let you do that, Dave… This mission is too important for me to let you jeopardize it'

    Ever since the spectacular 2001: A Space Odyssey became the most-watched movie of 1968, humans have both been fascinated and frightened by the idea of giving AI and machine learning algorithms free rein. 

    In Kubrick’s classic, a logically infallible, sentient supercomputer called HAL is tasked with guiding a mission to Jupiter. When it deems the humans on board to be detrimental to the mission, HAL starts to kill them.

    This is an extreme example, but the caution is far from misplaced. As we’ll explore in this article, time and again, we see situations where algorithms 'just doing their job' overlook needs or red flags they weren’t programmed to recognize. 

    This is bad news for people and companies affected by AI and ML gone wrong. But it’s also bad news for the organizations that shun the transformative potential of machine learning algorithms out of fear and distrust. 

    Getting to grips with the issue is vital for any CEO or department head that wants to succeed in the marketplace. As a data scientist, it’s your job to enlighten them.

    Algorithms aren't just for data scientists

    To start with, it’s important to remember, always, what you’re actually using AI and ML-backed models for. Presumably, it’s to help extract insights and establish patterns in order to answer critical questions about the health of your organization. To create better ways of predicting where things are headed and to make your business’ operations, processes, and budget allocations more efficient, no matter the industry.

    In other words, you aren’t creating clever algorithms because it’s a fun scientific challenge. You’re creating things with real-world applications that affect how your colleagues do their jobs. That means they need to understand what you’ve created, how this works and what its limitations are. They need to be able to ask you nuanced questions and raise concerns.

    They can’t do any of these things if the whole thing is one big mystery they don’t understand. 

    When machine learning algorithms get it wrong

    At other times, algorithms may contain inherent biases that distort predictions and lead to unfair and unhelpful decisions. Just take the case of this racist sentencing scandal in the U.S., where petty criminals were rated more likely to re-offend based on the color of their skin, rather than the severity or frequency of the crime. 

    In a corporate context, the negative fallout of biases in your AI and ML models may be less dramatic, but they can still be harmful to your business or even your customers. For example, your marketing efforts might exclude certain demographics, to your detriment and theirs. Or that you deny credit plans to customers who deserve them, simply because they share irrelevant characteristics with people who don’t. To stop these kinds of things from happening, your non-technical colleagues need to understand how the algorithm is constructed, in simple terms, enough to challenge your rationale. Otherwise, they may end up with misleading results.

    Applying constraints to AI and ML models

    One important way forward is for data scientists to collaborate with business teams when deciding what constraints to apply to algorithms.

    Take the 2001: A Space Odyssey example. The problem here wasn’t that the ship used a powerful, deep learning AI program to solve logistical problems, predict outcomes, and counter human errors in order to get the ship to Jupiter. The problem was that the machine learning algorithm created with this single mission in mind had no constraints. It was designed to achieve the mission in the most effective way using any means necessary, preserving human life was not wired in as a priority.

    Now imagine how a similar approach might pan out in a more mundane business context. 

    Let’s say you build an algorithm in a data science platform to help you source the most cost-effective supplies of a particular material used in one of your best-loved products. The resulting system scours the web and orders the cheapest available option that meets the description. Suspiciously cheap, in fact, which you would discover if you were to ask someone from the procurement or R&D team. But without these conversations, you don’t know to enter constraints on the lower limit or source of the product. The material turns out to be counterfeit, and an entire production run is ruined.

    How data scientists can communicate better on algorithms

    Most people who aren’t data scientists find talking about the mechanisms of AI and ML very daunting. After all, it’s a complex discipline, that’s why you’re in such high demand. But just because something is tricky at a granular level, doesn’t mean you can’t talk about it in simple terms.

    The key is to engage everyone who will& use the model as early as possible in its development. Talk to your colleagues about how they’ll use the model and what they need from it. Discuss other priorities and concerns that affect the construction of the algorithm and the constraints you implement. Explain exactly how the results can be used to inform their decision-making but also where they may want to intervene with human judgment. Make it clear that your door is always open and the project will evolve over time, you can keep tweaking if it’s not perfect.

    Bear in mind that people will be far more confident about using the results of your algorithms if they can tweak the outcome and adjust parameters themselves. Try to find solutions that give individual people that kind of autonomy. That way, if their instincts tell them something’s wrong, they can explore this further instead of either disregarding the algorithm or ignoring potentially valid concerns.

    Final thoughts: shaping the future of AI

    As Professor Hannah Fry, author of Hello World: How to be human in the age of the machine,  explained in an interview with the Economist:

    'If you design an algorithm to tell you the answer but expect the human to double-check it, question it, and know when to override it, you’re essentially creating a recipe for disaster. It’s just not something we’re going to be very good at.

    But if you design your algorithms to wear their uncertainty proudly front and center, to be open and honest with their users about how they came to their decision and all of the messiness and ambiguity it had to cut through to get there, then it’s much easier to know when we should trust our own instincts instead'.In other words, if data scientists encourage colleagues to trust implicitly in the HAL-like, infallible wisdom of their algorithms, not only will this lead to problems, it will also undermine trust in AI and ML in the future. 

    Instead, you need to have clear, frank, honest conversations with your colleagues about the potential and limitations of the technology and the responsibilities of those that use it, and you need to do that in a language they understand.

    Author: Shelby Blitz

    Source: Dataconomy

  • Why data is key in driving Robotic Process Automation

    Why data is key in driving Robotic Process Automation

    Enterprise technology innovation and investments are typically driven by compelling events in an organization, especially in the areas of computing and machinery, as companies seek to do business faster and remain competitive. For the past several decades, the theme of transferring mundane, repetitive, error-prone, and painful tasks over to machines has largely comprised enterprise innovations. Businesses will typically install and configure these new technologies, followed by engaging a team of experts to monitor the outputs and make appropriate tweaks to react to changes in the business. The next phase of this continuum is Robotic Process Automation (RPA), removing the continuous improvement responsibility from experts and trained technicians and transferring it into the machines themselves. To realize the promise of RPA, a solid data foundation built on high-quality, relevant data is absolutely necessary.

    Data driving RPA

    As with any machine learning-based solution, the quality of results is directly related to the training sets and processes used to train and tune the algorithms. The 'garbage-in-garbage-out' principle is certainly at play, but in reality, the data input strategy for RPA solutions is much more nuanced. Depending on the types of problems that RPA is attempting to solve, data sets that contain “bad” data may be needed as a core or large part of the training sets for the models and inputs. If a company wanted to implement RPA to help automate testing of changes to their ERP landscape for instance, training sets and tuning processes would need to be established that contained both traditionally 'good' data as well as data that is specifically used to drive exception and failure testing. Real care and curation will ensure the right training paths and data sets are established to drive RPA in the right direction.

    There is a common misconception around RPA solutions that they can operate in a 'set it and forget it' mode. In reality, the implementation of RPA solutions mirrors the implementation of any modern technology system. An RPA solution is defined by its specification and requirements, which the implementation team takes forward to configure the solution to meet implementation needs. A nuanced difference in RPA implementations is the importance of establishing fit-for-purpose data early in the implementation phases. For example, when a new business process system such as CRM is first put into place, sample data can be used to help drive the implementation and test different features and functional flows. With RPA solutions however, sample data could run the risk of driving bias or misdirected learnings into the solution that would take more time and effort to correct. The cost of an issue increases the further away the issue is located from its source. When an issue with an RPA solution is found late during implementation, it can mean revising many algorithms with large data sets is required, which can translate to increased costs and decreased satisfaction with the solution. Confirming fit-for-purpose data upfront is key for many RPA implementation projects to be successful.

    Data generation through RPA

    Beyond the initial data that goes into RPA solutions, the very nature of RPA is that more data will be generated faster by organizations that take advantage of RPA. With machines running regularly over many iterations and permutations, and with each iteration generating more data, the speed and volumes of data from business processesexecuted via RPA will inevitably be higher than processes executed by a combination of people and technology today. This increase in data can be a boon both to an organization’s analytical planning and teams as well as for the data packaging and potential valuation of the data. Data scientists and others performing analytical roles will have more inputs to determine areas of investment and innovation based on analysis of the output of RPA data. A general rule is: the larger the data sets, the more statistically trusted the analytical insights, and RPA will generate large data sets for analysis.

    Any uptick in the flow, volume, velocity, or overall size of data will also increase the need to properly manage and understand that data. The most common location for analytical data coming from RPA implementations to be stored is in a data lake. By including the necessary data governance cycles into RPA implementations, especially around data lake setup and Data Science access, the downstream benefits of RPA can be accelerated. Working through a clear understanding of what the RPA data means, who owns it, where it lives, and how it can be used in an appropriate way will help get both buy-in from those across the organization as well as the backing of what is today a typically well-funded and influential part of the business: the data science and analytics teams.

    RPA solutions offer the promise of innovation, acceleration of results, and lessening the mundane for all involved. By starting off with fit-for-purpose data sets to train RPA implementations and a strategically aligned plan for managing and leveraging the data being produced by the RPA solution, organizations will be able to take RPA from hype to helpful by using it as a meaningful part of their digital transformation journey.

    Author: Tyler Warden

    Source: Dataversity

  • Why human-guided machine learning is the way to go

    Why human-guided machine learning is the way to go

    Every data consumer wants to believe that the data they are using to make decisions is complete and up-to-date. However, this is rarely the case. Customers change their address, products experience revisions, new suppliers get on-boarded, etc. All of these common events introduce data variety that can be a nightmare to address if your organization isn’t equipped with the right solutions.

    It’s unrealistic to expect humans to be able to monitor every single piece of data throughout the enterprise for changes. Few changes require human expertise to understand their implications; e.g., selecting an up-to-date billing address is difficult when a person purchases a second home, but not when a person relocates. The majority simply requires a consistent set of guidelines and an ability to handle massive volumes of information, two tasks that machines do well. Combining human expertise with the scalability of machine learning provides the best of both worlds, enabling data consumers to finally be able to make decisions based on reliable data.

    The undeniable complexity of enterprise data

    It’s no secret that enterprise data is a messy web of siloed sources and applications. It’s easy to point the finger at internal dynamics, such as 'shadow IT' teams or a broader lack of governance. In reality, a large reason for the complexity of enterprise data is due to external factors. Customers begin to prefer to engage through new channels, the market demands a different category of products. The dynamic nature of customers and markets means that data sources will constantly be evolving, and data complexity will only continue to grow.

    This complexity can come in many forms. A customer may go by different names (e.g., 'Kate O. Woods' and 'K. Woods'), a product may have varying levels of detail across channels (e.g., the weight of a product may be listed on the company’s website but not distributors’), or a supplier may issue invoices under the parent company’s name while conducting all other business under the name of each subsidiary. Enterprises need a reliable way to overcome these complexities to gain a complete view of each of these entities and empower business stakeholders to make good, data-driven decisions.

    Finding a formula that works: human-guided machine learning

    Historically, IT organizations have tried applying rules alone to overcome this data variety. These approaches have massive upfront costs, requiring months or years to start generating meaningful results. Their siloed nature means that outcomes are disconnected from business requirements, making it almost impossible to generate accurate results that are trusted by the people who will be consuming the data. Once in production, very few people understand how they work and can modify them, leading to exorbitant maintenance costs.

    A new approach has emerged that overcomes many of the limitations of a rules-only approach. This approach combines human expertise with the modern technology of machine learning to uncover patterns in the data that may be difficult for a human to codify. Humans provide feedback, in the form of simple examples such as ‘yes’ these two records are a match or ‘no’ these two records are not, that the machine uses as input into a model that can be applied across all data sources.

    The model that is developed considers all of the attributes available across the data sources, learning how to compare each attribute individually and as part of the whole record. For example, the model may learn that even if two supplier records have completely different business names but have similar addresses, similar sales contacts, and share a website, they are likely the same supplier.

    Unlike a rules-only approach, the model can compute a confidence level for each set of recommendations. When it is unconfident about a recommendation, it can proactively reach out for feedback, further training its model and giving confidence to end users that the data is being accurately mastered. Rules may get introduced during this process, but by leaving most of the heavy lifting to machines, results can be generated at unprecedented levels of speed and accuracy, so you can finally keep up with changes to your data.

    Getting to a 'single version of the truth'

    The mastering process is important, but often data consumers just want a 'golden record', a record for each entity that contains the best and most up-to-date information available about it. Once data is accurately mastered, business logic can be applied that selects the right information for each attribute and creates a 'single version of the truth'.

    Business logic needs to be able to be applied flexibly across attributes to be effective, an external data source may be the authoritative source of billing address, but an internal CRM may be the better source for a phone number. Regardless of the specific logic, data consumers should be able to weigh in on this logic, so that they are bought in. This is much easier to do when their input has also been considered throughout the mastering process and they feel ownership over the results.

    Data consumers understand that data is becoming increasingly complex, but can’t use that as an excuse for poor decision making. Chief Data Officers need to equip them with trusted, up-to-date data, which can only be achieved by bringing human insight together with machine learning.

    Author: Daniel Gutierrez

    Source: Insidebigdata

  • Why machine learning has a major impact on all industries

    Why machine learning has a major impact on all industries

    Machine learning is having a major impact on the global marketplace. It will have a profound effect on companies of all sizes over the next few years.

    Artificial Intelligence is surrounding us everywhere. We cannot go with our day without approaching a solution involving AI. Machine learning is a field of Artificial Intelligence which specializes in setting machine using algorithms to learn certain things by itself.

    Machine learning has a vast number of applications. We can approach machine learning systems by going out shopping, using our banking account or even in public transport.

    How much is machine learning changing things up? What is the demand for this new technology? One estimate pegged the global market for machine learning at $2.5 billion in 2017 and estimated that it would reach $12.3 billion less than a decade later. These estimates have been raised even higher by a newer study by Deloitte. This is proof that it is in high demand and is making a huge splash on the global marketplace.

    Why is machine learning everywhere now?

    Machine learning can be beneficial for your company in many ways. Of course, these applications depend on the needs of your organization.

    It can be used in various ways. For example, if we have a problem with managing our customer service, we should consider implementing a machine learning application in this part of our company. In 2013, a company named DigitalGenius was founded to use machine learning to solve a number of customer service issues.

    But AI can do so much more!

    With an AI application (based on machine learning algorithms) that is built specially to fit our needs, we can automate any repetitive tasks like doing our company’s monthly paperwork. Our employees could then focus on more creative tasks that cannot be accomplished by an algorithm. Deloitte points out that machine learning is invaluable for boosting efficiency in many organizations. This is one of the reasons they estimate 2021 spending on machine learning will exceed $57 billion.

    One of the most wanted features of machine learning AI is the capability to predict certain things. AI can analyze the market or the data we provide to make assumptions that turn out to be mostly accurate. Thanks to that feature AI among other things can target products to customers based on their shopping habits and online actions.

    In which company can machine learning be most beneficial?

    Nowadays AI is implemented nearly in every field of business. The most inspiring examples are in the medical industry. Artificial Intelligence can improve performing various tests which in result can profit in saving more lives. Quicker diagnosis is a quicker recovery.

    Due to that, we cannot specify what field of business can benefit the most from implementing machine learning AI into their company’s system. It also does not depend on the size of the company. Of course, the small enterprises have a smaller amount of money that they can invest but with AI the more time and effort we put into making and implementing the application the more time and money we are likely to pull out in the future. It is a long-distance investment but without a doubt a smart one.

    What type of AI is the most suited for us?

    It is really important though to choose wisely from different types of AI. It should suit our company’s needs only. It is best if we have big data to manage. If the provided information is small in the amount a machine learning solution may not be the best option for us. Machine learning AI to function properly should be provided with a vast amount of 'good' data that can be systematized into patterns. So the best thing to do is to hire a data scientist who can initially manage our big data and then present it to our algorithm.

    If we are still not certain whether we should or should not invest in an AI solution the best thing we can do is to contact a professional machine learning expert. One way out is to reach out to the company providing these types of services.

    Machine learning is driving countless changes in every industry

    Machine learning is having a major impact on the global marketplace. Companies in every industry are using machine learning technology to increase efficiency and boost output. It will have a profound effect on companies of all sizes over the next few years.

    Author: Ryan Kh

    Source: SmartDataCollective

  • Why the right data input is key: A Machine Learning example

    Why the right data input is key: A Machine Learning example

    Finding the ‘sweet spot’ of data needs and consumption is critical to a business. Without enough, the business model under performs. Too much and you run the risk of compromised security and protection. Measuring what data intake is needed, like a balanced diet, is key to optimum performance and output. A healthy diet of data will set a company on the road to maximum results without drifting into red areas either side. 

    Machine learning is not black magic. A simple definition is the application of learning algorithms to data to uncover useful aspects of the input. There are clearly two parts to this process, though: the algorithms themselves and the data being processed and fed in.

    The algorithms are vital, and continually tuning and improving them makes significant difference to the success of the solutions. However, these are just mathematical experiments on the data. The pivotal bit is the data itself. Quite simply, the algorithms cannot work well on poor data volume, and a deficit of data leaves the system undernourished and, ultimately, the system hungering for more. With more data to consume, the system can be trained more fully and the outcomes are stronger.

    Without question, there is a big need for an ample amount of data to offer the system a healthy helping to configure the best outcomes. What is crucial, though, is that the data collected is representative of the tasks you intend to perform.

    Within speech recognition, for example, this means that you might be interested in any or all of the following attributes:


    • formal speech/informal speech
    • prepared speech/unprepared speech
    • trained speakers/untrained speakers
    • presenter/conversational
    • general speech/specific speech
    • accents/dialects


    • noisy/quiet
    • professional recording/amateur recording
    • broadcast/telephony
    • controlled/uncontrolled

    In reality, all of these attributes impact the ability to perform the tasks required of speech recognition with ultimate accuracy. Therefore, the data needed to tick all the boxes is different and involves varying degrees of difficulty to obtain. Bear in mind that it is not just the audio that is needed, accurate transcripts are required to perform training. That probably means that most data will need to be listened to by humans to transcribe or validate the data, and that can create an issue of security.

    An automatic speech recognition (ASR) system operates in two modes: training and operating.


    Training is most likely managed by the AI/ML company providing the service, which means the company needs access to large amounts of relevant data. In some cases, this is readily available in the public domain anyway. For example, content that has already been broadcast on television or radio and therefore has no associated privacy issues. But this sort of content cannot help with many of the other scenarios in which ASR technology can be used, such as phone call transcription, which has many different translation characteristics. Obtaining this sort of data can be tied up with contracts for data ownership, privacy and usage restrictions.


    In operational use, there is no need to collect audio. You just use the models that have previously been trained. But the obvious temptation is to capture the operational data and use it. However, as mentioned, this is where the challenge begins: ownership of the data. Many cloud solution providers want to use the data openly, as it will enable continuous improvement for the required use cases. Data ownership becomes the lynchpin.

    The challenge is to be able to build great models that work really well in any scenario without capturing privately-owned data. A balance between quality and security must be struck. This trade-off happens in many computer systems but somehow data involving people’s voices often, understandably, generates a great deal of concern.

    Finding a solution

    To ultimately satiate an ASR system, there needs to be just enough data provided to execute the training so good systems can be built. There is an option for companies to train their own models, which enables them to maintain ownership of the data. This can often require a complex professional services agreement, requiring a good investment of time, but it can provide a solution at a reasonable cost very quickly.

    ML algorithms are in a constant state of evolution, and techniques can now be used that allow smaller data sets to be used to bias systems already trained on big data. In some cases, smaller amounts of data can achieve ‘good enough’ accuracy. The overall issue of data acquisition is not removed, but sometimes less data can provide solutions.

    Finding a balanced data diet by enabling better algorithm tuning, and filtering and selection of data, can get the best results without collecting everything that has ever been said. More effort may be needed to achieve the best equilibrium. And, without doubt, the industry must maintain its search for ways to make the technology work better without people’s privacy being compromised.

    Author: Ian Firth

    Source: Insidebigdata

  • Wie domineert straks: de mens of de machine?

    mens of machineDe ontwikkelingen op informatie-technologisch gebied gaan snel en misschien wel steeds sneller. We horen en zien steeds meer van business intelligence, self service BI, artificial intelligence en machine learning. We zien dit terug bij werknemers die steeds meer de beschikking hebben over stuurinformatie via tools, zelfsturende auto’s, robots voor dementerenden maar ook computers die de mens verslaan spelletjes.

    Wat betekent dit?

    • Verdienmodel van bedrijven zullen anders worden
    • Innovaties komen misschien niet meer primair van de mens
    • Veel meer nu nog menselijke arbeid zal door machines worden overgenomen.

    Een paar ontwikkelingen in dit artikel worden uitgelicht om aan te geven hoe belangrijk business intelligence vandaag de dag is.

    Verdienmodel op basis van data

    Dat de informatietechnologie bestaande verdienmodellen op z’n kop zet lezen we dagelijks. We hoeven alleen maar naar V&D te kijken. De hoeveelheid bedrijven  die gebruik maken van een business model waarbij externe dataverzameling en analyse een cruciaal onderdeel is van het verdienmodel neemt hand over hand toe. Zelfs in tot nu toe sterk gedomineerde overheidssectoren zoals onderwijs of gezondheidszorg. Bekende bedrijven, zoals Google en Facebook, zijn overigens zonder concreet verdienmodel begonnen, maar zouden niets meer kunnen zonder genoemde data(analyse).


    Neem bijvoorbeeld een bedrijf als Amazon dat volledig draait op data. De verzamelde data heeft in grote mate betrekking op wie we zijn, hoe we ons gedragen en op onze voorkeuren. Amazon geeft deze data steeds meer betekenis door de toepassingen van de nieuwste technologieën. Een voorbeeld is hoe Amazon zelfs films en boeken ontwikkelt op basis van ons aankoop, kijk- en leesgedrag en hier zal het zeker niet bij blijven. Volgens Gartner is Amazon een van meeste leidende en visionaire spelers in de markt voor Infrastructure as a Service (IaaS). Bovendien prijst Gartner Amazon voor haar snelle manier van anticiperen op de technologische behoeftes uit de markt.


    Volgens de Verenigde Naties zullen de nieuwste innovaties ontstaan vanuit kunstmatige intelligentie. Dit veronderstelt dat de machine de mens passeert met betrekking het bedenken van vernieuwingen. De IBM Watson-computer heeft bijvoorbeeld de mens al verslagen met het spelprogramma Jeopardy. Met moeilijke wiskundige berekening kunnen we niet meer zonder computer, maar dat wil nog niet zeggen dat de computer de mens overal in voorbij streeft. Met de ontwikkeling van zelfsturende auto’s is onlangs aangetoond dat middels machine learning de mens nog steeds leidend kan zijn en per saldo was er veel minder ontwikkelingstijd nodig.

    Mens of machine?

    Een feit is dat de machine steeds meer taken van de mens gaat overnemen en de mens in denkvermogen soms zelfs gaat overtreffen. De mens en machine zullen in de komende periode steeds meer naast elkaar gaan leven en de computer zal het menselijk handelen steeds beter begrijpen en beheersen. Het gevolg is, dat bestaande business modellen zullen gaan veranderen en veel banen in bestaande sectoren verloren zullen gaan. Maar of de computer de mens voorbij streeft en dat in de toekomst zelfs alleen innovatie via kunstmatige intelligentie komt is nog maar de vraag? Ok de industriële revolutie heeft een zeer grote impact op de mensheid gehad en terugkijkend heeft deze vele voordelen gebracht al zal het voor velen in die tijd niet altijd gemakkelijk geweest zijn. Laten we kijken hoe we hier ons voordeel mee kunnen doen. Geïnteresseerd? Klik hier voor meer informatie.

    Ruud Koopmans, RK-Intelligentie.nl, 29 februari 2016


  • Zooming In On The Data Science Pipeline

    Zooming In On The Data Science Pipeline

    Finding the right data science tools is paramount if your team is to discover business insight. Here are five things to look for when you search for your next data science platform.

    If you are old enough to have grown up with the Looney Tunes cartoons, you probably remember watching clips of Wile E. Coyote chasing the Road Runner hoping to one day catch him. In each episode, the coyote would use increasingly outrageous tools to try to outwit his nemesis, only to fail disastrously each time. Without the right tools, he was forever doomed to failure.

    As a data scientist, do you constantly feel like you are bringing the wrong tool to the job as you strive to find and capture one of the most valuable, yet elusive, targets around -- business insight?

    As data science tools and platforms mature, organizations are constantly looking to find what their analysts need to be most effective in their jobs. The right tool could mean the difference between success and failure when put in the hands of capable data scientists.

    As you are trying to find the right data science tools for your team, here are five areas to consider in your evaluation.


    The first thing you need to evaluate when looking at a potential data science platform is what algorithms it supports. In your assessment of algorithms, you must understand just what your business is and what your data science organization will actually use.

    There are many algorithms available. Some are generic in nature and can be used in a broad set of scenarios. Others are very specific to unique problem sets. In the hands of the right data scientist, both types of algorithms can be extremely advantageous and valuable. The challenge is that the more algorithms available, the harder it is for the team to select the correct one to meet the current business problem. In your evaluation, ensure that the algorithms known to your team are available and are not crowded out by algorithms they will not use.

    In addition to the algorithms that are already pre-packaged as part of the data science platform, one area to look at is the extensibility of the platform. Can new algorithms be added? Are there marketplaces of new algorithms available for the platform? Can the team evolve the algorithms to meet their needs? Such extensibility will provide your team access to new and valuable algorithms as they become available and can become a critical success factor for your data science team.

    Data Preprocessing

    One of the main tasks your team will be performing is preparing the data. This involves cleaning the data, transforming it, breaking the conglomerate data into its parts, and normalizing it. Different types of algorithms have limitations on what data they can consume and use. Your data science platform must be able to take available data and prepare it for input into your process.

    If you have text data in your environment, text processing can be a vital component to your data science platform. This can be as simple as parsing the text into individual words or it can involve more complex data, such as the meaning of these words, the topics associated with the text, or the sentiment of the text. If this is important to your data science program, make sure your platform has the right support for your use cases.

    Model Training and Testing

    Once you have the right data in the right format and you have chosen the right algorithm or set of algorithms, the next step is to use these to define a model. When evaluating data science tools, understand what this process of model training and testing looks like and how it functions.

    In your evaluation, understand if this process is accomplished through a graphical user interface or through coding. With the training process, understand what parameters are available to measure the progress on the model creation and how to define stopping points. As an automated iterative process, you will want your team to define when that process is completed and when the results are good enough to move to the next step.

    Look at the documentation output of the model development process. Does it give you enough traceability about what the resulting model is, how it works, and why it chose that model over other variations? These can be critical in selling your results to the business and are becoming a requirement from governments if the model has an impact on decisions where bias could be detrimental to people.


    You might have a small team of data scientists or a large team with many different roles. Either way, it is important that your team members have an effective ecosystem where they can collaborate. This can involve collaboration on the cleaning of data, the development and testing of models, or on the deployment of these models into production.

    With the shortage of data science resources in the market, some companies are starting to look outside the walls of their organizations for citizen data scientists -- individuals outside of the organization who can collaborate with your teams to perform analysis of data and create models. As the extent of your team boundaries grows, your requirements for a platform that enables that collaboration increase as well. Ensure that the platform you select can be used across those boundaries.

    MLOps and Operationalization

    Data science in the laboratory is important, but for the results of their work to be beneficial to your business in a sustainable and repeatable way, the data preprocessing and model deployment has to be operationalized. Creating models and deploying models to a production environment require different skills. Sometimes, you will have resources who span both disciplines, but as your team grows and becomes more complex, these resources will often be very different.

    It is important that you assess the platform’s capabilities to facilitate the collaboration between the data scientists as well as the collaboration between the data scientists and MLOps, who have the responsibility for deployment and ongoing sustainability of these models.

    Evaluate what mechanisms are in place in your platform to enable the models to be promoted from the development stage to production stage and what gates exists along the way to maintain system integrity.

    Evaluate Your Platform

    As you meet with potential vendors, make sure you know what your team needs to be successful and then use those criteria to evaluate the fit of the tool to the situation at hand. Using these five key areas of evaluation will provide you the basis for an effective set of conversations with your vendor. If you have the right tools on hand for your data scientists, hopefully you won’t find yourself like Wile E. Coyote -- getting burned in the end -- but rather capturing that elusive target: business value.

    Author: Troy Hiltbrand

    Source: TDWI

EasyTagCloud v2.8