Last Week in Tech Law & Policy, Vol. 29: The Dangers of “Innocuous” Data

(by Parker Ragland, Colorado Law 2L)

People often hold one of two views on privacy—either it is important to them, or they state, “I have nothing to hide.” While the latter response legitimately expresses fear that privacy laws may be used by wrongdoers to shield themselves from justice, it also reveals a common misconception about privacy: only mistakes in your past can harm your future. Problems associated with data science, and specifically the data-broker industry, are at the core of this misconception.

What is a data broker? For many, the term calls to mind the Shadow Broker from the Mass Effect video game series. In Mass Effect‘s science-fiction universe, the Broker seemed to be all-knowing, weaving in and out of the story at opportune moments. At times he was important, but mostly he was forgotten. He loomed over the narrative like a cloud, sometimes helping and other times harming. The Broker, hidden amid the clouds of a nebula, perched in front of myriad screens displaying almost every current event.

Mass Effect‘s Broker is not so different from real-life data brokers. They aren’t in space ships (yet), but they are as close as it gets to all-knowing. And they weave into our lives on a daily basis.

Put simply, data brokers are firms that aggregate data from many different sources, then sell them—facilitating a crosspollination of information. Have you ever wondered why Facebook advertised garden gnomes to you after you purchased Your Backyard Herb Garden on Amazon?

Of course, that example isn’t scary at all; it could even be helpful. In fact, data brokers may have much to offer society. Their ability to aggregate and sell data can facilitate better selections on Netflix, improved fuel efficiency, decreased turnover in businesses, and more.

But data brokers can also lump you into categories—often carrying concerning connotations. Former FTC Commissioner Julie Brill has highlighted categories such as “single mom struggling in an urban setting” and “people who did not speak English and felt more comfortable speaking Spanish.” Journalists have outlined highly-invasive stories such as a father receiving advertisements for baby products after his teenage daughter had purchased a pregnancy test. And jarring categorizations, such as brokers “sell[ing] lists of rape victims and AIDS patients,” have reached the Senate floor.

Many believe there is a basic threshold at which personally identifiable information (“PII”—any information that can be traced to your person) should be protected. Some agencies have set the threshold at sensitive personally identifiable information (“SPII”), which includes social security numbers and health information. But FTC Chairwoman Edith Ramirez announced last week that SPII isn’t enough. Now “persistent identifiers”—including MAC addresses, static IP addresses, and even retail loyalty cards—can satisfy the test.

Why is Chairman Ramirez’s definition so consequential? Users voluntarily post large amounts of data about themselves on the Internet. Data brokers then collect, clean, and correlate that data. Although a single correlation may reveal little meaningful information, aggregated correlations can form the basis for models, which can be compared to other models to determine whether you fit within a certain category of people. Some models can accurately predict behavior with  only seemingly-innocuous information posted on social media (e.g., frequency of posting, number of words used in posts, and breadth of vocabulary).

Even without the help of data brokers, these models can be quite powerful. For example, Psychologist John Gottman can predict with about 90% accuracy whether a couple will divorce in the next four to six years, often by observing just the first three minutes of an argument they have.

With the proliferation of data online, what else can be predicted about your life? With enough information, a firm can know more about you than your significant other or best friend, and data brokers allow almost any organization to acquire information sufficient  to develop highly-predictive models of their customers’, clients’, and employees’ behaviors.

The buck doesn’t stop at predictive models. After developing an accurate model of your behavior, that knowledge can be used to influence it, either overtly or otherwise. For example, the researchers who ran Facebook’s mood-manipulation experiment altered the news feeds of more than 700,000 users, showing some positive stories and some negative. In response, “participants”  in the experiment posted happy or sad keywords themselves, corresponding with the valence of the news articles they were exposed to. Not every user was affected this way or to the same degree, but enough users were influenced to achieve statistically-confident results.

Facebook’s experiment revealed not only that people can be subconsciously manipulated through an online medium, but also that a targeted outcome can be achieved by strategically altering such a medium. Facebook’s researchers achieved their goal by understanding how people think and, in that case, the theory of emotional contagion. Similarly, researchers have identified how people’s thoughts and feelings about certain topics or toward certain groups can be manipulated under experimental conditions (i.e., environments where we feel like no one is watching). The Internet is just such an environment, and actual life hacks could soon be exploited in that space.

Traditionally, researchers have struggled to acquire good data. They have needed to ensure that their findings are generalizable to the actual population and comport their methods to standards set by institutional review boards. Now, however, these barriers are easily overcome. Due to the large amounts of data readily available through brokers, researchers can easily acquire good data, and as a result, less scrutiny is afforded to the acquisition of users’ information. Often, brokers have no way of knowing what kinds of information will be valuable to different companies, so they simply collect as much as possible. And in general, people have no way of knowing what kinds of data are being collected about them, though some brokers allow consumers to see what information they’ve already collected.

In addition, users have some protections under current law. Apart from firms’ promises to protect certain aspects of users’ privacy (i.e., in privacy policies), PII is the term used to denote what information should receive privacy protections, and agencies such as the FTC and FCC are grappling with its definition right now. If certain data are defined as PII, companies may be unable to use those to improve users’ experiences with various products and services. However, data that exceed the scope of the term PII are afforded little or no privacy protections, and even innocuous data can become harmful after being aggregated.

What do you think? How should we approach this problem? Does the protecting PII sufficiently shield users’ privacy? If so, where do we draw the line for what is protectable PII? Is there a better way to balance firms’ desires to increase functionality, but at the same time, ensure that they don’t use data to manipulate consumers? How much of the burden should rest on data brokers, who are currently the gatekeepers to that information, to ensure that data are not used in harmful ways?