How Data Science is Changing the Entertainment Industry
Beyond how much and when, to what we think and how we feel
Like countless other industries, the entertainment industry is being transformed by data. There’s no doubt data has always played a role in guiding show-biz decision-making, for example, in the form of movie tracking surveys and Nielsen data. But with the ever-rising prominence of streaming and the seamless consumption measurement it enables, data has never been more central to understanding, predicting, and influencing TV and movie consumption.
With experience as both a data scientist in the entertainment space and a researcher of media preferences, I’ve had the fortune of being in the trenches of industry analyzing TV/movie consumption data and being able to keep up with media preferences research from institutions around the world. As made evident by the various citations to come, the component concepts presented here themselves aren’t anything new, but I wanted to apply my background to bring together these ideas in laying out a structured roadmap for what I believe to be the next frontiers in enhancing our ability to understand, predict, and influence video content consumption around the world. While data can play a role at many earlier phases of the content lifecycle — e.g. in greenlighting processes or production — and what I am about to say can be relevant in various phases, I write mainly from a more downstream perspective, nearer to and after release as content is consumed, as cultivated during my industry and academic work.
Beyond Viewing and Metadata
When you work in the entertainment space, you end up working a lot with title consumption data and metadata. To a large extent, this is unavoidable — all “metadata” and “viewing data” really mean is data on what’s being watched and how much — but it’s hard to not start sensing that models based on such data, as commonly seen in content similarity analyses, output results that fall into familiar patterns. For example, these days when I see “similar shows/movies” recommendations, a voice in my head goes, “That’s probably a metadata-based recommendation ,” or, “Those look like viewership-based recommendations,” based on what I’ve seen during my work with such models. I can’t be 100% sure, of course, and the voice is more confident with smaller services that likely use more off-the-shelf approaches; on larger platforms, recommendations are often seamless enough that I’m not thinking about flaws, but who knows what magic sauce is going into them?
I’m not saying viewing data and metadata will ever stop being important, nor do models using such data fail to explain ample variance in consumption. What I am saying is that there is a limit to how far solely these elements will get us when it comes to best analyzing and predicting viewership— we need new ways to enhance understanding of viewers and their relationship with content. We want to understand and foresee title X’s popularity at time point A beyond, “It was popular at A-1, it will be popular at A,” or, “title Y, which is similar to X, was popular, so X will be popular”, especially since often, data at A-1 or on similarity between X and Y may not be available. Let’s talk about one type of data that I think will prove critical in enhancing understanding of and predictive capacity concerning viewership moving forward.
Psychometrics: Who is Watching and Why
People love to talk demographics when it comes to media consumption. Indeed, anyone who’s taken a movie business class is likely to be familiar with the “four quadrant movie”, or a movie that can appeal to men and women over and under the ages of 25. But demographics are limited in their explanatory and predictive utility in that they generally go as far as telling us the who but not necessarily the why.
That’s where psychometrics (a.k.a. psychographics) can provide a boost. Individuals in the same demographic can easily have different tendencies, values, preferences; an example would be the ability to divide men or women into people who tend to be DIYers, early adopters, environmentalists, etc. based on their measured characteristics across various dimensions. Similarly, people of different demographics can easily have similar characteristics, such as being high in thrill-seeking, being open to new experiences, or identifying as left/right politically. Such psychometric variables have indeed been shown to influence media preference — for example, agreeable people like talk shows and soaps more, higher sensation seeking individuals like violent content more — and improve the capacity of recommendation models. My own research has shown that even abbreviated psychometric measures can produce an improvement in model fit to genre preference data compared to demographic data alone. Consumer data companies have already begun to recognize the importance of psychometric data, with many of them incorporating them in some form into their services.
Psychometric data can be useful at the individual-level at which they are often collected, or aggregated to provide group-level — audience, userbase, country, so on — psychometric features of various kinds. Some such data might come ‘pre-aggregated’ at the source, as is the case with, for example, Hofstede’s cultural dimensions. In terms of collection, when direct collection for all viewers in an audience isn’t feasible (e.g. when you can’t survey millions of users), a “seed” set of self-report survey data from responding viewers could be used to impute the values to similar non-respondents using nearest neighbor methods. Psychometric data can also be beneficial in cold-start problem scenarios — if you don’t have direct data about what a particular audience watches or how much they would watch particular titles, wouldn’t data about their characteristics that point to the types of content are likely to want be useful?
Consumption as Viewer-Content Trait Interaction
The above section discusses psychometrics in particular, but zooming out a bit, what it is more broadly pushing for is an expansion of the viewer/audience feature space beyond the demographic and behavioral. This is because all consumption is inherently an interaction between the traits of a viewer and the traits of a piece of content. This concept is simpler and more well-tread than it may sound; all it really means is that some element of a viewer (viewer trait) means they are more (or less) drawn to some element of a piece of content (content trait). Even familiar stereotypes about genre preferences — children are more into animation, men are more into action, etc. — inherently concern viewer-content trait interactions (viewer age-content genre, viewer sex-content genre in above examples), and the aforementioned research on viewer psychometrics effects on content preferences also fall under this paradigm.
The larger the array of viewer traits we have, the more things we can consider might interact with some kind of content trait to impact their interest in consuming the title. Conversely, this also means that it is beneficial to have new forms of data title-side as well. It can seem like people are more readily ‘get deep’ with title-side data, in the form of metadata (genre, cast, crew, studio, awards, average reviews, etc.), than they do with viewer-side data, but there’s still room for expansion title-side, especially if one is expanding viewer-side data as suggested above through collection of psychometrics and the like. Tags and tagging are a good place to start in this regard. Human tagging can particularly be beneficial by capturing latent information still difficult for machines to detect on their own — e.g. humor, irony, sarcasm, etc. — but automated processes can provide useful baseline content tags of a consistent nature. However, these days, tags are just the start when it comes to generating additional title-side data. It’s possible to engineer all sorts of features from the audio and video of titles, as well as to extract the emotional arc of a story from text.
Once you consider consumption from the viewer-content interaction lens and expand data collection on both the viewer and title sides, the possibilities really open up. You could, for example, code the race/ethnicity and gender of characters in a title and see how demographic similarity between the title cast/crew and the typical users of a streaming platform can impact the title’s success. Or maybe you want to code titles for their message sensation value to see how that’s associated with the title’s appeal to a particular high sensation-seeking group. Or perhaps you want to use data from OpenSubtitles or the like to determine the narrative arc type of all the titles in your system and see if any patterns arise as to the appeal of certain arcs to individuals of certain psychographics.
Parsing the Pipeline: Perception, Interest, Response
Lastly, there needs to be a more granular consideration of the consumption pipeline, from interest to response. Though easily lumped together as “good” signals of a consumer’s feelings about a title, being interested in, watching, and liking a piece of content are entirely different things. Here’s how the full viewing process should be parsed out when possible, separated broadly into pre-consumption and post-consumption phases.
Perception (Pre-consumption): Individuals of different demographics, and presumably of different psychographics, can perceive the same media product differently. These perceptions can be shaped by the elements of a product’s brand design, font, colors, and advertisements. Perception arguably has important effects on the next phase in the pipeline.
Interest, and Selection (Pre-consumption): First off, though related and the former certainly increases the likelihood of the latter, it is important to note that interest (a.k.a. preference) is not the same has selection (a.k.a. choice). Though analyses regarding one may often be relevant to the other, we cannot always assume that an individual who expresses interest in something or has a high likelihood of being interested in something will always choose to consume it. This is well exemplified by models like the Reasoned Action Model, within which framework an individual who feels favorably about watching a movie may not watching it to perceived unfavorable norms about watching said movie. Examining factors driving interest-selection conversion may be beneficial.
Response (Post-consumption): Lastly, there is how individuals feel after watching a piece of content. This could be as simple as whether they liked it or not; and though it can be tempting to equate high viewership with wow, people really like that movie when looking at a dataset, it’s critical to remember that how much people watch something and whether they like it are related but ultimately separate things, as anyone who was stoked for a movie then crushed by its mediocrity can attest; my own research has shown that the effects at play with interest in unseen content can differ from, even be the opposite of, the effects at play with liking of seen content. Beyond liking, responses can also include elements such as how viewers felt about the content emotionally, how much they related with the characters, to what degree they were immersed into the storyline, and more.
Media preference and consumption does not need to be considered a singular, stationary process, but instead, separated out this way, a fluid modular, process where strategic management of upstream processes can impact the likelihood of desired outcomes, whatever they may be, down the line. How can we selectively optimize perception of a media product across different demographic and psychographic groups to get maximum interest in a title — or perhaps, optimize the desired downstream outcome? How can we optimally convert interest into selection? Can certain upstream perceptions or overly high levels of interest interact adversely with the content of a certain title such that the ultimate response to the title is more negative than it would have been had perceptions been different or interest less extreme? In addition, though I provide potential key mechanisms of relevance to each step of the pipeline, certain mechanisms may be of relevance at multiple phases or across different phases of the pipeline — for example, (potential) viewer-character similarity may impact perception of and interest in a title after exposure to advertising, while social network effects may mean the post-consumption responses of certain individuals heavily influence pre-consumption interest among other individuals.
Conclusion
As an industry, we’ve only begun to scratch the surface of how data can help us understand, predict, and influence content consumption, and these are just some of my thoughts on what I believe will be important considerations as data science becomes ever more prevalent and critical in the entertainment space. Audience psychometrics will help enhance understanding of audiences beyond what demographics can do alone; considering interactions between new audience and content features will provide superior strategic insights and predictive capacity; and a nuanced consideration of the full consumption pipeline from interest to response will help optimize desired outcomes.
Author: Danny Kim
Source: Towards Data Science