Determining the feature set complexity
Thoughtful predictor selection is essential for model fairness
One common AI-related fear I’ve often heard is that machine learning models will leverage oddball facts buried in vast databases of personal information to make decisions impacting lives. For example, the fact that you used Arial font in your resume, plus your cat ownership and fondness for pierogi, will prevent you from getting a job. Associated with such concerns is fear of discrimination based on sex or race due to this kind of inference. Are such fears silly or realistic? Machine learning models are based on correlation, and any feature associated with an outcome can be used as a decision basis; there is reason for concern. However, the risks of such a scenario occurring depend on the information available to the model and on the specific algorithm used. Here, I will use sample data to illustrate differences in incorporation of incidental information in random forest vs. XGBoost models, and discuss the importance of considering missing information, appropriateness and causality in assessing model fairness.
Feature choice — examining what might be missing as well as what’s included– is very important for model fairness. Often feature inclusion is thought of only in terms of keeping or omitting “sensitive” features such as race or sex, or obvious proxies for these. However, a model may leverage any feature associated with the outcome, and common measures of model performance and fairness will be essentially unaffected. Incidental correlated features may not be appropriate decision bases, or they may represent unfairness risks. Incidental feature risks are highest when appropriate predictors are not included in the model. Therefore, careful consideration of what might be missing is crucial.
This article builds on results from a previous blog post and uses the same dataset and code base to illustrate the effects of missing and incidental features [1, 2]. In brief, I use a publicly-available loans dataset, in which the outcome is loan default status (binary), and predictors include income, employment length, debt load, etc. I preferentially (but randomly) sort lower-income cases into a made-up “female” category, and for simplicity consider only two gender categories (“males” and “females”). The result is that “females” on average have a lower income, but male and female incomes overlap; some females are high-income, and some males low-income. Examining common fairness and performance metrics, I found similar results whether the model relied on income or on gender to predict defaults, illustrating risks of relying only on metrics to detect bias.
My previous blog post showed what happens when an incidental feature substitutes for an appropriate feature. Here, I will discuss what happens when both the appropriate predictor and the incidental feature are included in the data. I test two model types, and show that, as might be expected, the female status contributes to predictions despite the fact it contains no additional information. However, the incidental feature contributes much more to the random forest model than to the XGBoost model, suggesting that model selection may be help reduce unfairness risk, although tradeoffs should be considered.
Fairness metrics and global importances
In my example, the female feature adds no information to a model that already contains income. Any reliance on female status is unnecessary and represents “direct discrimination” risk. Ideally, a machine learning algorithm would ignore such a feature in favor of the stronger predictor.
When the incidental feature, female status, is added to wither a random forest or XGBoost model, I see little change in overall performance characteristics or performance metrics (data not shown). ROC scores barely budge (as should be expected). False positive rates show very slight changes.
Demographic parity, or the difference in loan default rates for females vs. males, remain essentially unchanged for XGBoost (5.2% vs.5.3%) when the female indicator is included, but for random forest, this metric does change from 4.3% to 5.0%; I discuss this observation in detail below.
Global permutation importances show weak influences from the female feature for both model types. This feature ranks 12/14 for the random forest model, and 22/26 for XGBoost (when female=1). The fact that female status is of relatively low importance may seem reassuring, but any influence from this feature is a fairness risk.
There are no clear red flags in global metrics when female status is included in the data — but this is expected as fairness metrics are similar whether decisions are based on an incidental or causal factor . The key question is: does incorporation of female status increase disparities in outcome?
Aggregated shapley values
We can measure the degree to which a feature contributes to differences in group predictions using aggregated Shapley values . This technique distributes differences in predicted outcome rates across features so that we can determine what drives differences for females vs. males. Calculation involves constructing a reference dataset consisting of randomly selected males, calculating Shapley feature importances for randomly-selected females using this “foil”, and then aggregating the female Shapley values (also called “phi” values).
Results are shown below for both model types, with and without the “female” feature. The top 5 features for the model not including female is plotted along with female status for the model that includes that feature. All other features are summed into “other”.
Image by author
First, note that the blue bar for female (present for the model including female status only) is much larger for random forest than for XGBoost. The bar magnitudes indicate the amount of probability difference for women vs. men that is attributed to a feature. For random forest, the female status feature increases the probability of default for females relative to males by 1.6%, compared to 0.3% for XGBoost, an ~5x difference.
For random forest, female status ranks in the top 3 influential features in determining the difference in prediction for males vs. females, even though the feature was the 12th most important globally. The global importance does not capture this feature’s impact on fairness.
As mentioned in the section above, the random forest model shows decreased demographic parity when female status is included in the model. This effect is also apparent in the Shapley plots– the increase due to the female bar is not compensated for by any decrease in the other bars. For XGBoost, the small contribution from female status appears to be offset by tiny decreases in contributions from other features.
The reduced impact of the incidental feature for XGBoost compared to random forest makes sense when we think about how the algorithms work. Random forests create trees using random subsets of features, which are examined for optimal splits. Some of these initial feature sets will include the incidental feature but not the appropriate predictor, in which case incidental features may be chosen for splits. For XGBoost models, split criteria are based on improvements to a previous model. An incidental feature can’t improve a model based on a stronger predictor; therefore, after several rounds, we expect trees to include the appropriate predictor only.
Demographic parity decreases for random forest can also be understood considering model building mechanisms. When a subset of features to be considered for a split is generated in the random forest, we essentially have two “income” features, and so it’s more likely that (direct or indirect) income information will be selected.
The random forest model effectively uses a larger feature set than XGBoost. Although numerous features are likely to appear in both model types to some degree, XGBoost solutions will be weighted towards a smaller set of more predictive features. This reduces, but does not eliminate, risks related to incidental features for XGBoost.
Is XGBoost fairer than Random Forest?
In a previous blog post , I showed that incorporation of interactions to mitigate feature bias was more effective for XGBoost than for random forest (for one test scenario). Here, I observe that the XGBoost model is also less influenced by incidental information. Does this mean that we should prefer XGBoost for fairness reasons?
XGBoost has advantages when both an incidental and appropriate feature are included in the data but doesn’t reduce risk when only the incidental feature is included. A random forest model’s reliance on a larger set of features may be a benefit, especially when additional features are correlated with the missing predictor.
Furthermore, the fact that XGBoost doesn’t rely much on the incidental feature does not mean that it doesn’t contribute at all. It may be that only a smaller number of decisions are based on inappropriate information.
Leaving fairness aside, the fact that the random forest samples a larger portion of what you might think of as the “solution space”, and relies on more predictors, may be have some advantages for model robustness. When a model is deployed and faces unexpected errors in data, the random forest model may be somewhat more able to compensate. (On the other hand, if random forest incorporates a correlated feature that is affected by errors, it might be compromised while an XGBoost model remains unaffected).
XGBoost may have some fairness advantages, but the “fairest” model type is context-dependent, and robustness and accuracy must also be considered. I feel that fairness testing and explainability, as well thoughtful feature choices, are probably more valuable than model type in promoting fairness.
What am I missing?
Fairness considerations are crucial in feature selection for models that might affect lives. There are numerous existing feature selection methods, which generally optimize accuracy or predictive power, but do not consider fairness. One question that these don’t address is “what feature am I missing?”
A model that relies on an incidental feature that happens to be correlated with a strong predictor may appear to behave in a reasonable manner, despite making unfair decisions . Therefore, it’s very important to ask yourself, “what’s missing?” when building a model. The answer to this question may involve subject matter expertise or additional research. Missing predictors thought to have causal effects may be especially important to consider [5, 6].
Obviously, the best solution for a missing predictor is to incorporate it. Sometimes, this may be impossible. Some effects can’t be measured or are unobtainable. But you and I both know that simple unavailability seldom determines the final feature set. Instead, it’s often, “that information is in a different database and I don’t know how to access it”, or “that source is owned by a different group and they are tough to work with”, or “we could get it, but there’s a license fee”. Feature choice generally reflects time and effort — which is often fine. Expediency is great when it’s possible. But when fairness is compromised by convenience, something does need to give. This is when fairness testing, aggregated Shapley plots, and subject matter expertise may be needed to make the case to do extra work or delay timelines in order to ensure appropriate decisions.
What am I including?
Another key question is “what am I including?”, which can often be restated as “for what could this be a proxy?” This question can be superficially applied to every feature in the dataset but should be very carefully considered for features identified as contributing to group differences; such features can be identified using aggregated Shapley plots or individual explanations. It may be useful to investigate whether such features contribute additional information above what’s available from other predictors
Who am I like, and what have they done before?
A binary classification model predicting something like loan defaults, likelihood to purchase a product, or success at a job, is essentially asking the question, “Who am I like, and what have they done before?” The word “like” here means similar values of the features included in the data, weighted according to their predictive contribution to the model. We then model (or approximate) what this cohort has done in the past to generate a probability score, which we believe is indicative of future results for people in that group.
The “who am I like?” question gets to the heart of worries that people will be judged if they eat too many pierogis, own too many cats, or just happen to be a certain race, sex, or ethnicity. The concern is that it is just not fair to evaluate individual people due to their membership in such groups, regardless of the average outcome for overall populations. What is appropriate depends heavily on context — perhaps pierogis are fine to consider in a heart attack model, but would be worrisome in a criminal justice setting.
Our models assign people to groups — even if models are continuous, we can think of that as the limit of very little buckets — and then we estimate risks for these populations. This isn’t much different than old-school actuarial tables, except that we may be using a very large feature set to determine group boundaries, and we may not be fully aware of the meaning of information we use in the process.
Feature choice is more than a mathematical exercise, and likely requires the judgment of subject matter experts, compliance analysts, or even the public. A data scientist’s contribution to this process should involve using explainability techniques to populations and discover features driving group differences. We can also identify at-risk populations and ask questions about features known to have causal relationships with outcomes.
Legal and compliance departments often focus on included features, and their concerns may be primarily related to specific types of sensitive information. Considering what’s missing from a model is not very common. However, the question, “what’s missing?” is at least as important as, “what’s there?” in confirming that models make fair and appropriate decisions.
Data scientists can be scrappy and adept at producing models with limited or noisy data. There is something satisfying about getting a model that “works” from less than ideal information. It can be hard to admit that something can’t be done, but sometimes fairness dictates that what we have right now really isn’t enough — or isn’t enough yet.
Author: Valerie Carey
Source: Towards Data Science