2 items tagged "data integrity"

  • Staying on the right track in the era of big data

    Staying on the right track in the era of big data

    Volume dominates the multidimensional big data world. The challenge many organizations today are facing is harnessing the potential of the data and applying all of the usual methods and technologies at scale. After all, data growth is only increasing and is currently being produced at 2.5 quintillion bytes of data per day. Unfortunately, a large portion of this data is unstructured, making it even harder to categorize.

    Compounding the problem, most businesses expect that decisions made based on data will be more effective and successful in the long run. However, with big data often comes big noise. After all, the more information you have, the more chance that some of that information might be incorrect, duplicated, outdated, or otherwise flawed. This is a challenge that most data analysts are prepared for, but one that IT teams need to consider and factor into their downstream processing and decision making to ensure that any bad data does not skew the resulting insights.

    This is why overarching big data analytics solutions alone are not enough to ensure data integrity in the era of big data. In addition, while new technologies like AI and machine learning can help make sense of the data en masse, often these rely on a certain amount of cleaning and condensing going on behind the scenes to be effective and able to run at scale. While accounting for some errors in the data is fine, being able to find and eliminate mistakes where possible is a valuable capability, which can have a catastrophic effect in terms of derailing effective analysis and delaying the time to value. In particular if there is a configuration error or problem with a single data source creating a stream of bad data. Without the right tools, these kinds of errors can create unexpected results and leave data professionals with an unwieldy mass of data to sort through to try and find the culprit.

    This problem is compounded when data is ingested from multiple different sources and systems, each of which may have treated the data in a different way. The sheer complexity of big data architecture can turn the challenge from finding a single needle in a haystack to one more akin to finding a single needle in a whole barn.

    Meanwhile, this problem has become one that doesn’t just affect the IT function and business decision making, but is becoming a legal requirement to overcome. Legislation like the European Union’s General Data Protection Regulation (GDPR) mandates that businesses find a way to manage and track all of their personal data, no matter how complicated the infrastructure or unstructured information. In addition, upon receiving a valid request, organizations need to be able to delete information pertaining to an individual or collect and share it as part of an individual’s right to data portability.

    So, what’s the solution? One of the best solutions for managing the beast of big data overall is also one that builds in a way to ensure data integrity, ensuring a full data lineage by automating data ingestion. This creates a clear path showing how data has been used over time, as well as its origins. In addition, this process is done automatically, making it much easier and more reliable. However, it is important to ensure that lineage is done at the fine detail level.

    With the right data lineage tools, ensuring data integrity in a big data environment becomes far easier. The right tracking means that data scientists can track data back through the process to explain what data was used, from where, and why. Meanwhile, businesses can track down the data of a single individual, sorting through all the noise to fulfill subject access requests without disrupting the big data pipeline as a whole, or diverting significant business resources. As a result, analysis of big data can deliver more insight, and thus more value, faster, despite its multidimensional complexity.

    Author: Neil Barton

    Source: Dataversity

  • The persuasive power of data and the importance of data integrity

    The persuasive power of data and the importance of data integrity

    Data is like statistics: a matter of interpretation. The process may look scientific, but that does not mean the result is credible or reliable.

    • How can we trust what a person says if we deny the legitimacy of what he believes?
    • How can we know a theory is right if its rationale is wrong?
    • How can we prove an assertion is sound if its basis is not only unsound but unjust?

    To ask questions like these is to remember that data is neutral, it is an abstraction, whose application is more vulnerable to nefarious ends than noble deeds; that human nature is replete with examples of discrimination, tribalism, bias, and groupthink; that it is not unnatural for confirmation bias to prevail at the expense of logic; that all humanity is subject to instances of pride, envy, fear, and illogic.

    What we should fear is not data, but ourselves. We should fear the misuse of data to damn a person or ruin a group of people. We should fear our failure to heed Richard Feynman’s first principle about not fooling ourselves. We should fear, in short, the corruption of data; the contemptible abuse of data by all manner of people, who give pseudoscience the veneer of respectability.

    Nowhere is the possibility of abuse more destructive, nowhere is the potential for abuse more deadly, nowhere is the possible, deliberate misreading of data more probable than in our judicial system.

    I write these words from experience, as both a scientist by training and an expert witness by way of my testimony in civil trials.

    What I know is this: Data has the power to persuade.

    People who use data, namely lawyers, have the power to persuade; they have the power to enter data into the record, arguing that what is on the record, that what a stenographer records in a transcript, that what jurors read from the record is dispositive.

    According to Wayne R. Cohen, a professor at The George Washington University School of Law and a Washington, DC injury claims attorney, data depends on context.

    Which is to say data is the product of the way people gather, interpret, and apply it.

    Unless a witness volunteers information, or divulges it during cross-examination, a jury may not know what that witness’s data excludes: exculpatory evidence, acts of omission, that reveals the accused is not guilty, that the case against the accused lacks sufficient proof, that the case sows doubt instead of stamping it out.

    That scenario should compel us to be more scrupulous about data.

    That scenario should compel us to check (and double-check) data, not because we should refuse to accept data, but because we must not accept what we refuse to check.

    That scenario summons us to learn more about data, so we may not have to risk everything, so we may not have to jeopardize our judgment, by speculating about what may be in lieu of what is.

    That scenario is why we must be vigilant about the integrity of data, making it unimpeachable and unassailable.

    May that scenario influence our actions.

    Author: Michael Shaw

    Source: Dataversity

EasyTagCloud v2.8