3 items tagged "data scientist"

  • 9 Tips to become a better data scientist

    9 Tips to become a better data scientist

    Over the years I worked on many Data Science projects. I remember how easy it was to get lost and waste a lot of energy in the wrong direction. In time, I learned what works for me to be more effective. This list is my best try to sum it up:

    1. Build a working pipeline first

    While it’s tempting to start with the cool stuff first, you want to make sure that you don't spend time on small technical things like loading the data, feature extraction and so on. I like to start with a very basic pipeline, but one that works, i.e., I can run it end to end and get results. Later I expand every part while keeping the pipeline working.

    2. Start simple and complicate one thing at a time

    Once you have a working pipeline, start expanding and improving it. You have to take it to step by step. It is very important to understand what caused what. If you introduce too many changes at once, it will be hard to tell how each change affected the whole model. Keep the updates as simple and clean as possible. Not only it will be easier to understand its effect, but also, it will be easier to refactor it once you come up with another idea.

    3. Question everything

    Now you have a lot on your hands, you have a working pipeline and you already did some changes that improved your results. It’s important to understand why. If you added a new feature and it helped the model to generalize better, why? If it didn't, why not? Maybe your model is slower than before, why’s that? Are you sure each of your features/modules does what you think it does? If not, what happened?

    These kinds of questions should pop in your head while you’re working. To end up with a really great result, you must understand everything that happens in your model.

    4. Experience a lot and experience fast

    After you questioned everything, you got stuck with… well, a lot of questions. The best way to answer them is to experiment. If you followed this far, you already have a working pipeline and a nicely written code, so conducting an experiment shouldn't waste much of your time. Ideally, you’ll be able to run more than one experiment at a time, this you’ll help you answer your questions and improve your intuition of what works and what is not.

    Things to experiment: Adding/removing features, changing hyperparameters, changing architectures, adding/removing data and so on.

    5. Prioritize and Focus

    At this point, you did a lot of work, you have a lot of questions, some answers, some other tasks and probably some new ideas to improve your model (or even working on something entirely different).

    But not all these are equally important. You have to understand what is the most beneficial direction for you. Maybe you came up with a brilliant idea that slightly improved your model but also made it much more complicated and slow, should you continue with this direction? It depends on your goal. If your goal is to publish a state-of-the-art solution, maybe it is. But if your goal is to deploy a fast and descent model to production, then probably you can invest your time on something else. Remember your final goal when working, and try to understand what tasks/experiments will get you closer to it.

    6. Believe in your metrics

    As discussed, understanding what’s working and what is not is very important. But how do you know when something works? you evaluate your results against some validation/test data and get some metric! You have to belive that metric! There may be some reasons not to believe in your metric. It could be the wrong one, for example. Your data may be unbalanced so accuracy can be the wrong metric for you. Your final solution must be very precise, so maybe you’re more interested in precision than recall. Your metric must reflect the goal you’re trying to achieve. Another reason not to believe in your metric is when your test data is dirty or noisy. Maybe you got data somewhere from the web and you don’t know exactly what’s in there?

    A reliable metric is important to advance fast, but also, it’s important that the metric reflects your goals. In data science, it may be easy to convince ourselves that our model is good, while in reality, it does very little.

    7. Work to publish/deploy

    Feedback is an essential part of any work, and data science is not an exception. When you work knowing that your code will be reviewed by someone else, you’ll write much better code. When you work knowing that you’ll need to explain it to someone else, you’ll understand it much better. It doesn't have to be a fancy journal or conference or company production code. If you’re working on a personal project, make it open source, write a post about it, send it to your friends, show it to the world!

    Not all feedback will be positive, but you’ll be able to learn from it and improve over time.

    8. Read a lot and keep updated

    I’m probably not the first one suggesting keeping up with the recent advancement to be effective, so instead of talking about it, I’ll just tell you how I do it: good old newsletters! I find them very useful as it’s essentially someone that keeps up with the most recent literature, picks the best stuff and sends it to you!

    9. Be curious

    While reading about the newest and coolest, don't limit yourself to the one area you’re interested in, try to explore others (but related) as well. It could be beneficial in a few ways. You can find a technique that works in one domain to be very useful in yours, you’ll improve your ability to understand complex ideas, and, you may find another domain that interests you so you’ll be able to expand your data skills and knowledge.

    Conclusion

    You’ll get much better results and enjoy the process if you’re effective. While all of the topics above are important, if I have to choose one, it will be 'Prioritize and Focus'. For me, all other topics lead to this one eventually. The key to success is to work on the right thing.

    Author: Dima Shulga

    Source: towards data science

  • Moving Towards Data Science: Hiring Your First Data Scientist

    Moving Towards Data Science: Hiring Your First Data Scientist

    In October 2020 I joined accuRx as the company’s first data scientist. At the time of joining, accuRx was a team of 60-odd employees who had done an incredible job relying on intuition and a stellar team of user researchers to create products that GPs needed and loved. This, combined with the increased need for good tech solutions in healthcare in 2020, resulted in our reach expanding (literally) exponentially. Suddenly, we were in almost every GP practice in the UK.

    We found ourselves in an interesting position: we now had several products that were being used very widely by GPs each day, and another set of nascent product ideas that we were only just bringing to life. We knew that at this point we’d need to start relying more on insight from quantitative data to test out our hypotheses and move our product suite in the right direction.

    At this point, we didn’t need advanced ML solutions or the latest big data processing tools. What we really needed was the ability to verify our assumptions at scale, to understand the needs of a very large and diverse group of users and to foster a culture of decision-making in which relying on quantitative data was second nature. This was why I was brought in, and it’s not been without its challenges. Here are a few things I’ve learnt so far: 

    1. New roles create new conversations

    Adding new members to teams presents a series of inevitable challenges: team dynamics change, the initial cost of onboarding is high and there’s now one more voice in the room when making decisions. The effect of this is substantially amplified when you’re adding not just a new member but a new role to a team.

    Before I joined, data science had not been a core part of the product development process. Suddenly, the team were introduced to a host of new concerns, processes and technical requests that they’d not needed to consider before, and addressing these often required a sizeable shift in the entire team’s ways of working.

    A few examples of this are:

    • Software engineers had to spend more time adding analytics to feature releases and making sure that the pipelines producing those analytics were reliable.
    • Sometimes, AB test results take a while to trickle in. Given that those results (hopefully) inform the direction a product will move in next, product managers, designers and engineers often found themselves facing a fair degree of ambiguity over how best — and how quickly — to iterate on features and ideas.
    • Having an additional set of information to consider often meant that it took us longer to reach a decision about which direction to move a product in. We now had to reconcile our intuitions with what the data was telling us — and also make a call as to how reliable we thought both of those were!

    It’ll take a bit of trial and error, but it’s important to find a way of working that gives product managers, designers and engineers the freedom to ship and iterate quickly without sacrificing your commitment to analytical rigour. In our case, this looked like figuring out which product changes were worth testing, what level of detail was worth tracking and what kinds of analyses are most useful at different stages of the product development process.

    2. Effective communication is more than half the battle

    It doesn’t matter how useful you think your analyses are — if people don’t know about or understand them, they’re not likely to have much long-term impact. In addition, the way in which you communicate your findings will determine how much impact your analysis ultimately has.

    Communicate widely and frequently.

    Importantly, it’s not enough to relay your findings to team leads only — the whole team has invested a lot of time and effort adjusting to new ways of working that support analytics, and they expect to be able to see what impact those adjustments have had. Communicating how those changes have positively impacted decision making will go a long way to creating the kind of positive feedback loop needed to motivate your team to keep relying on the processes and techniques that you’ve introduced.

    Once you’ve got your team on board, the really tough part is in ensuring that the initial excitement around using data to make decisions persists. A mistake I’ve made (more than once!) is assuming that communication around analytics is a ticket that you can mark as done. If you’re looking to drive a culture change, you’ll need to continually remind people why they should care about the thing as much as you do. As people hear more and more about the positive inroads teams have made off the back of insight from data, relying on data to back up product decisions should start to become expected and more automatic.

    Always present data with insight.

    Wherever possible, try to communicate your findings in terms of how this will affect decision-making and what people should do as a result. The less abstract you can make the results of an analysis, the better. One simple way to make your results less abstract is to clearly quantify how much impact you think the change will have.

    For example, if you’ve run an AB test to determine if a new feature increases your conversion rate, instead of saying ‘The change was statistically significant’, rather try ‘If we rolled out this new change to all our users, it’s likely that our conversion rate would increase from 5% to 7%, which translates to an additional 200 active users per week’.

    Similarly, when sharing data visualisations with a team, try to be explicit about what the graph is and isn’t showing. Remember that you’ve spent a lot of time thinking about this visualisation, but someone seeing it with fresh eyes likely doesn’t have as much context as you do. Simple ways to make visualisations clear are to make sure that the exact data you’ve used to define a metric is understood, and that you offer an interpretation of the trend or finding you’ve visualised alongside the graph. If you can, try to explain the implications of the trend you’ve visualised for your team’s goals so that they can take action off the back of the insight you’ve shared.

    Speed is good, but accuracy is better.

    There’s no surer way to ensure that your work has low impact than by making a habit of communicating incorrect or partially-correct results. If you’re the first or only data scientist in your team, you are the authority on what constitutes good or sufficient evidence and so, ironically, you have very little margin for error.

    You’ll often find yourself having to trade-off getting results out to teams quickly and making sure that the analyses producing those results are robust, particularly if you’re working with new, suboptimal or unfamiliar tools. In most cases, I’ve found there’s usually a compromise you can reach — but this requires that you’re very clear about the limitations of the data you’ve used to reach a particular conclusion. When in doubt, caveat!

    People will quickly learn if they can trust you, and once broken trust is a tricky thing to get back. This is not to say that you won’t make mistakes — but it’s really important that when these happen they’re caught early, acknowledged widely and that robust processes are put in place to avoid similar mistakes in future.

    3. Good data infrastructure is a prerequisite for good data science

    When it comes to accurate and useful analyses, it’s a foregone conclusion that they’re enabled by accessible and reliable data. No matter how good your infrastructure, it’s reasonable to expect to have to spend a significant chunk of your time cleaning data before running your analyses. As such, if your data infrastructure is not optimised for analytics, the additional time spent cleaning and wrangling data into a usable format will quickly become a major barrier. Up until this point, we hadn’t prioritised securing best in class analytics tools — getting this right is hard work, and it’s something we’re still working towards.

    Death by a thousand cuts…

    The effect of this is twofold. First, it adds enough friction in your workflow that you are likely to forego using information that could be valuable because you’re having to weigh the usefulness of the information against the cost of getting it. When an organisation moves fairly quickly, the time and effort this requires is often prohibitive.

    Secondly, the probability of making mistakes compounds each time you shift and transform data across different platforms. Each relocation or adjustment of your data is associated with some chance of making a mistake — naturally, the more of this you do, the higher the likelihood that your data is less reliable by the time you actually run your analysis. These two barriers together strongly dis-incentivises people in analytics roles to solve problems creatively, and adds enough friction that your approach to analysis might become a fair bit more rigid and instrumental — and where’s the fun in that!

    You become the bottleneck.

    Related to this is the issue of accessibility for the wider team. If data scientists are struggling to access data reliably, you can bet your bottom dollar that everyone else is probably worse off! The result of this is that queries for simple information are outsourced to you — and as people become aware that you are able and willing to wade through that particular quagmire, you, ironically, start to become the bottleneck to data-driven decision-making.

    At this point, your role starts to become a lot more reactive — you’ll spend a majority of your time attending to high effort, marginal value tasks and find that you’ve got a lot less time and headspace to devote to thinking about problems proactively.

    To avoid these pitfalls, you’ll need to make sure that you motivate for the tools you need early on, you automate as much of your own workflow as possible and you provide enough value that people can see that they’d get a lot more from you if you were able to work more efficiently.

     
    Author: Tamsyn Naylor
    Source: Towards Data Science
  • Zooming In On The Data Science Pipeline

    Zooming In On The Data Science Pipeline

    Finding the right data science tools is paramount if your team is to discover business insight. Here are five things to look for when you search for your next data science platform.

    If you are old enough to have grown up with the Looney Tunes cartoons, you probably remember watching clips of Wile E. Coyote chasing the Road Runner hoping to one day catch him. In each episode, the coyote would use increasingly outrageous tools to try to outwit his nemesis, only to fail disastrously each time. Without the right tools, he was forever doomed to failure.

    As a data scientist, do you constantly feel like you are bringing the wrong tool to the job as you strive to find and capture one of the most valuable, yet elusive, targets around -- business insight?

    As data science tools and platforms mature, organizations are constantly looking to find what their analysts need to be most effective in their jobs. The right tool could mean the difference between success and failure when put in the hands of capable data scientists.

    As you are trying to find the right data science tools for your team, here are five areas to consider in your evaluation.

    Algorithms

    The first thing you need to evaluate when looking at a potential data science platform is what algorithms it supports. In your assessment of algorithms, you must understand just what your business is and what your data science organization will actually use.

    There are many algorithms available. Some are generic in nature and can be used in a broad set of scenarios. Others are very specific to unique problem sets. In the hands of the right data scientist, both types of algorithms can be extremely advantageous and valuable. The challenge is that the more algorithms available, the harder it is for the team to select the correct one to meet the current business problem. In your evaluation, ensure that the algorithms known to your team are available and are not crowded out by algorithms they will not use.

    In addition to the algorithms that are already pre-packaged as part of the data science platform, one area to look at is the extensibility of the platform. Can new algorithms be added? Are there marketplaces of new algorithms available for the platform? Can the team evolve the algorithms to meet their needs? Such extensibility will provide your team access to new and valuable algorithms as they become available and can become a critical success factor for your data science team.

    Data Preprocessing

    One of the main tasks your team will be performing is preparing the data. This involves cleaning the data, transforming it, breaking the conglomerate data into its parts, and normalizing it. Different types of algorithms have limitations on what data they can consume and use. Your data science platform must be able to take available data and prepare it for input into your process.

    If you have text data in your environment, text processing can be a vital component to your data science platform. This can be as simple as parsing the text into individual words or it can involve more complex data, such as the meaning of these words, the topics associated with the text, or the sentiment of the text. If this is important to your data science program, make sure your platform has the right support for your use cases.

    Model Training and Testing

    Once you have the right data in the right format and you have chosen the right algorithm or set of algorithms, the next step is to use these to define a model. When evaluating data science tools, understand what this process of model training and testing looks like and how it functions.

    In your evaluation, understand if this process is accomplished through a graphical user interface or through coding. With the training process, understand what parameters are available to measure the progress on the model creation and how to define stopping points. As an automated iterative process, you will want your team to define when that process is completed and when the results are good enough to move to the next step.

    Look at the documentation output of the model development process. Does it give you enough traceability about what the resulting model is, how it works, and why it chose that model over other variations? These can be critical in selling your results to the business and are becoming a requirement from governments if the model has an impact on decisions where bias could be detrimental to people.

    Collaboration

    You might have a small team of data scientists or a large team with many different roles. Either way, it is important that your team members have an effective ecosystem where they can collaborate. This can involve collaboration on the cleaning of data, the development and testing of models, or on the deployment of these models into production.

    With the shortage of data science resources in the market, some companies are starting to look outside the walls of their organizations for citizen data scientists -- individuals outside of the organization who can collaborate with your teams to perform analysis of data and create models. As the extent of your team boundaries grows, your requirements for a platform that enables that collaboration increase as well. Ensure that the platform you select can be used across those boundaries.

    MLOps and Operationalization

    Data science in the laboratory is important, but for the results of their work to be beneficial to your business in a sustainable and repeatable way, the data preprocessing and model deployment has to be operationalized. Creating models and deploying models to a production environment require different skills. Sometimes, you will have resources who span both disciplines, but as your team grows and becomes more complex, these resources will often be very different.

    It is important that you assess the platform’s capabilities to facilitate the collaboration between the data scientists as well as the collaboration between the data scientists and MLOps, who have the responsibility for deployment and ongoing sustainability of these models.

    Evaluate what mechanisms are in place in your platform to enable the models to be promoted from the development stage to production stage and what gates exists along the way to maintain system integrity.

    Evaluate Your Platform

    As you meet with potential vendors, make sure you know what your team needs to be successful and then use those criteria to evaluate the fit of the tool to the situation at hand. Using these five key areas of evaluation will provide you the basis for an effective set of conversations with your vendor. If you have the right tools on hand for your data scientists, hopefully you won’t find yourself like Wile E. Coyote -- getting burned in the end -- but rather capturing that elusive target: business value.

    Author: Troy Hiltbrand

    Source: TDWI

EasyTagCloud v2.8