Classification is the foundation of targeting and tailoring information and experiences to individuals. Big data promises—or threatens—to bring classification to an increasing range of human activity. While many companies and government agencies foster an illusion that classification is (or should be) an area of absolute algorithmic rule—that decisions are neutral, organic, and even automatically rendered without human intervention—reality is a far messier mix of technical and human curating. Both the datasets and the algorithms reflect choices, among others, about data, connections, inferences, interpretation, and thresholds for inclusion that advance a specific purpose. Like maps that represent the physical environment in varied ways to serve different needs—mountaineering, sightseeing, or shopping—classification systems are neither neutral nor objective, but are biased toward their purposes. They reflect the explicit and implicit values of their designers. Few designers “see them as artifacts embodying moral and aesthetic choices” or recognize the powerful role they play in crafting “people’s identities, aspirations, and dignity.”[1] But increasingly, the subjects of classification, as well as regulators, do.

Today, the creation and consequences of some classification systems, from determination of tax-exempt status to predictive analytics in health insurance, from targeting for surveillance to systems for online behavioral advertising (OBA), are under scrutiny by consumer and data protection regulators, advocacy organizations and even Congress. Every step in the big data pipeline is raising concerns: the privacy implications of amassing, connecting, and using personal information, the implicit and explicit biases embedded in both datasets and algorithms, and the individual and societal consequences of the resulting classifications and segmentation. Although the concerns are wide ranging and complex, the discussion and proposed solutions often loop back to privacy and transparency—specifically, establishing individual control over personal information, and requiring entities to provide some transparency into personal profiles and algorithms.[2]

The computer science community, while acknowledging concerns about discrimination, tends to position privacy as the dominant concern.[3] Privacy-preserving advertising schemes support the view that tracking, auctioning, and optimizing done by the many parties in the advertising ecosystem are acceptable, as long as these parties don’t “know” the identity of the target.[4]

Policy proposals are similarly narrow. They include regulations requiring consent prior to tracking individuals or prior to the collection of “sensitive information,” and context-specific codes respecting privacy expectations.[5] Bridging the technical and policy arenas, the World Wide Web Consortium’s draft “do-not-track” specification will allow users to signal a desire to avoid OBA.[6] These approaches involve greater transparency.

Regrettably, privacy controls and increased transparency fail to address concerns with the classifications and segmentation produced by big data analysis.

At best, solutions that vest individuals with control over personal data indirectly impact the fairness of classifications and outcomes—resulting in discrimination in the narrow legal sense, or “cumulative disadvantage” fed by the narrowing of possibilities.[7] Whether the information used for classification is obtained with or without permission is unrelated to the production of disadvantage or discrimination. Control-based solutions are a similarly poor response to concerns about the social fragmentation of “filter bubbles”[8] that create feedback loops reaffirming and narrowing individuals’ worldviews, as these concerns exist regardless of whether such bubbles are freely chosen, imposed through classification, or, as is often the case, some mix of the two.

At worst, privacy solutions can hinder efforts to identify classifications that unintentionally produce objectionable outcomes—for example, differential treatment that tracks race or gender—by limiting the availability of data about such attributes. For example, a system that determined whether to offer individuals a discount on a purchase based on a seemingly innocuous array of variables being positive (“shops for free weights and men’s shirts”) would in fact routinely offer discounts to men but not women. To avoid unintentionally encoding such an outcome, one would need to know that men and women arrayed differently along this set of dimensions. Protecting against this sort of discriminatory impact is advanced by data about legally protected statuses, since the ability to both build systems to avoid it and detect systems that encode it turns on statistics.[9] While automated decisionmaking systems “may reduce the impact of biased individuals, they may also normalize the far more massive impacts of system-level biases and blind spots.”[10] Rooting out biases and blind spots in big data depends on our ability to constrain, understand, and test the systems that use such data to shape information, experiences, and opportunities. This requires more data.

Exposing the datasets and algorithms of big data analysis to scrutiny—transparency solutions—may improve individual comprehension, but given the independent (sometimes intended) complexity of algorithms, it is unreasonable to expect transparency alone to root out bias.

The decreased exposure to differing perspectives, reduced individual autonomy, and loss of serendipity that all result from classifications that shackle users to profiles used to frame their “relevant” experience, are not privacy problems. While targeting, narrowcasting, and segmentation of media and advertising, including political advertising, are fueled by personal data, they don’t depend on it. Individuals often create their own bubbles. Merely allowing individuals to peel back their bubbles—to view the Web from someone else’s perspective, devoid of personalization—does not guarantee that they will.[11]

Solutions to these problems are among the hardest to conceptualize, in part because perfecting individual choice may impair other socially desirable outcomes. Fragmentation, regardless of whether its impact can be viewed as disadvantageous from any individual’s or group’s perspective, and whether it is chosen or imposed, corrodes the public debate considered essential to a functioning democracy.

If privacy and transparency are not the panacea to the risks posed by big data, what is?

First, we must carefully unpack and model the problems attributed to big data.[12] The ease with which policy and technical proposals revert to solutions focused on individual control over personal information reflects a failure to accurately conceptualize other concerns. While proposed solutions are responsive to a subset of privacy concerns—we discuss other concepts of privacy at risk in big data in a separate paper—they offer a mixed bag with respect to discrimination, and are not responsive to concerns about the ills that segmentation portends for the public sphere.

Second, we must approach big data as a sociotechnical system. The law’s view of automated decisionmaking systems is schizophrenic, at times viewing automated decisionmaking with suspicion and distrust and at others exalting it as the antidote to the discriminatory urges and intuitions of people.[13] Viewing the problem as one of machine versus man misses the point. The key lies in thinking about how best to manage the risks to the values at stake in a sociotechnical system.[14] Questions of oversight and accountability should inform the decision of where to locate values. Code presents challenges to oversight, but policies amenable to formal description can be built in and tested for. The same cannot be said of the brain. Our point is simply that big data debates are ultimately about values first, and about math and machines only second.

Third, lawyers and technologists must focus their attention on the risks of segmentation inherent in classification. There is a broad literature on fairness in social choice theory, game theory, economics, and law that can guide such work.[15] Policy solutions found in other areas include the creation of “standard offers”; the use of test files to identify biased outputs based on ostensibly unbiased inputs; required disclosures of systems’ categories, classes, inputs, and algorithms; and public participation in the design and review of systems used by governments.

In computer science and statistics, the literature addressing bias in classification comprises: testing for statistical evidence of bias; training unbiased classifiers using biased historical data; a statistical approach to situation testing in historical data; a method for maximizing utility subject to any context-specific notion of fairness; an approach to fair affirmative action; and work on learning fair representations with the goal of enabling fair classification of future, not yet seen, individuals.

Drawing from existing approaches, a system could place the task of constructing a metric—defining who must be treated similarly—outside the system, creating a path for external stakeholders—policymakers, for example—to have greater influence over, and comfort with, the fairness of classifications. Test files could be used to ensure outcomes comport with this predetermined similarity metric. While incomplete, this suggests that there are opportunities to address concerns about discrimination and disadvantage. Combined with greater transparency and individual access rights to data profiles, thoughtful policy, and technical design could tend toward a more complete set of objections.

Finally, the concerns related to fragmentation of the public sphere and “filter bubbles” are a conceptual muddle and an open technical design problem. Issues of selective exposure to media, the absence of serendipity, and yearning for the glue of civic engagement are all relevant. While these objections to classification may seem at odds with “relevance” and personalization, they are not a desire for irrelevance or under-specificity. Rather they reflect a desire for the tumult of traditional public forums—sidewalks, public parks, and street corners—where a measure of randomness and unpredictability yields a mix of discoveries and encounters that contribute to a more informed populace. These objections resonate with calls for “public” or “civic” journalism that seeks to engage “citizens in deliberation and problem-solving, as members of larger, politically involved publics,”[16] rather than catering to consumers narrowly focused on private lives, consumption, and infotainment. Equally important, they reflect the hopes and aspirations we ascribe to algorithms: despite our cynicism and reservations, “we want them to be neutral, we want them to be reliable, we want them to be the effective ways in which we come to know what is most important.”[17] We want to harness the power of the hive brain to expand our horizons, not trap us in patterns that perpetuate the basest or narrowest versions of ourselves.

The urge to classify is human. The lever of big data, however, brings ubiquitous classification, demanding greater attention to the values embedded and reflected in classifications, and the roles they play in shaping public and private life.