Afterthoughts: Big Data and Us Little People

The original series was written in a frenzy. Aggregating is the inverse of broadcasting. Aggregation is biased towards anonymity. Being the subject of a data point is a matter of immediate experience. Cohorts choose people more than people choose cohorts. And so on.

Frenzies have a welcome feature: the simultaneity of both focus and chaos. Which means that while I noted some thoughts along the way in writing the series, I had neither the time nor the appropriate peripheral attention to get to them.

Still, when the focus is broken and the chaos finally gives way to some type of order, what was left on the periphery can get mentioned. As follows.

Artifacts are either intentional (the Pyramids at Teotihuacán for example), accidental (millions of unearthed pot shards) or inferred (the details of the social organization that must have existed to build monuments like the Pyramids). Aggregation is only possible in the presence of large datasets.

The majority of technology awe expressed by writers today focuses on the advances in technology that either expand the manageable size of these datasets (big data), the emerging sources of them (the Internet of Things), or the more technical side of manipulating the data (beyond SQL).

The precursor to all these advances is the actual cultural shift towards the acceptance of transaction level logging for software applications. The inferred artifact is the transition to this acceptance and the way it has become close to an expectation. Why do we think it is ok to collect all that data?

In fact, some privacy debates still focus on preventing the actual recording of events as a way of protecting the privacy of an individual. But that debate has increasingly shifted to protecting the records, which, it is taken as a given, are ubiquitous.

When the US Supreme Court ruled that the contents of a cell phone could not be searched routinely without probable cause just as when the EU Court of Justice ruled on the right to be forgotten, the existence of the records was never questioned. Nor was a future in which the volume of such records would continue to increase or the “argument” for aggregating those data.

II.

Cases that have been brought to prevent the NSA from collecting metadata on telephone calls usually do not call for the metadata to not exist. Companies will insist that such transaction level logging is required for accurate accounting and management of their business, to assist them in resolving customer disputes and preventing fraud.

However, the assumption that transactions must be recorded and persisted as data is not accurate. Companies can and do store images of bills to be referred to by their customers. This is usually an option on on-line self-service portals. The implication is that such images are a transition from the time of paper bills to the truly on-line “data dump” of a bill, i.e. a table of records displayed on a screen.

Those images, however, could contain every bit of detail needed for customer service as the large datasets currently stored. Because they are still being stored digitally, the company could always find an individual’s records in the same way the consumer using the portal does.

A customer service representative would not have to read the image to resolve the dispute. If the images were stored in such a way that they could be converted (the technical term is “parsed”) into data when a service rep needed to address a customer concern, then all the same computer screens that the rep now users could be populated.

Once the customer’s compliant was addressed, the transactions that had been converted into searchable data could be wiped from the application’s working storage. At that point, the customer is left with the company once again just storing images of their bills.

And when subpoenaed, the company could produce the images which law enforcement would then have to parse. We are used to thinking of digital images as types of documents and so the idea of a subpoena would be easier for law enforcement to adapt then the current mess of trying to determine what constitutes reasonable search and seizure of millions of records at once.

Advances in the efficiency of processors and storage capacity make this less farfetched a technical feat than ever before. The only thing that would be lost would be ease of creating large aggregated datasets for analysis.

III.

When Andy Warhol predicted that each person would get 15 minutes of fame, he never said they would not have to share those 15 minutes. There are trains of thought in philosophy, economics and political science that discuss concepts like the alienation of workers, individualization, and consumerism. Some of these concepts go back over a 100 years. We do not need to pass moral judgements on these concepts to recognize that the technologies of a networked world, a data-centric environment, relies on identifying individuals.

We may have reached the point where at least local celebrity and alienation have converged. I would argue that this is the result of identity itself becoming a data point.

Sometimes those identities are at the level of impersonal data points like MAC and IP addresses. Sometimes they are specific to the individual but still somehow removed as in the case of email addresses and mobile phone numbers.

Authentication is what makes identity personal in a data-centric world. So long as identity theft, impersonation, creates nothing more than acceptable noise in data models, then the individual’s interest in protecting their own identity will be greater than any data collection enterprise’s interest in it.

I am distinguishing authentication here as a way of identifying the subject of a data point from the more common use of it: as a way of controlling access and being tied to authorization. As a security mechanism, coupled with authorization, authentication is invaluable and hence we see efforts to increase its strength.

But as a way of identifying events and data points, authentication takes on a different dimension. In everything from Social Media to financial transactions, from cloud based storage to instant messaging, people are being increasingly confronted with the idea that they must identify themselves as unique individuals and they must protect that identity from all other individuals.

And it is a confrontation. Consider the jargon: the security questions that some systems employ to validate your identity in password reset scenarios are referred to as “challenges.” They call to mind the fairy tale scene where the stranger approaches the heavily guarded gate and answers a series of questions to be permitted to pass. Hence the most common authentication element is still called a “password.”

So the individual devotes increasing amounts of attention to ensuring they can reliably assert their “individualness.” They do this because the data point, the transaction as well as their defined privacy rights all insist on the individual as the subject.

Again, I am not trying to place a moral judgement on these phenomena, merely to observe that this very emphasis on the individual as a person who is regularly asserting that individualness is reinforced by the way the technologies have evolved.

This causes us to need to distinguish between “individualness” and “individuality.” The former can be described as the establishment of yourself as an individual. The latter retains its common meaning of describing the fact that there are unique things about you. To be blunt: the difference between having a unique identity and being a unique person.

IV.

The emphasis on establishing and protecting on-line identity is urgent. There is a good deal at stake for the individual. Victims of identity theft suffer harm that ranges from the inconvenience of changing account numbers and passwords to the well documented cases of finding themselves owing money for goods and services they had nothing to do with acquiring.

This emphasis is, therefore, necessary and appropriate. As an Information Security professional, I would never argue against it. I actively advocate for it, in fact.

I believe that scholars, like Ulrich Beck, would recognize this emphasis as falling into what they describe as “institutionalized individualization” which they see as breaking down the potential for collective action if not collective consciousness itself among people[1].

However, just because people may not be recognizing themselves as members of a community, does not mean that views of individuals aggregated into groups do not exist.

Just as the argument can be made that individuals can become so pre-occupied with the role of being an individual that they cannot associate themselves with collective action, so I would argue that this pre-occupation, when carried into the data-centric world, contributes to the fact that people are divorced from the cohorts they are put into by collectors of the data about them.

I have argued elsewhere that there is at least one new role in data-centric world: the searcher. Both the EU Court of Justice decision on the right to be forgotten and the Federal Data Mining Reporting Act of 2007 carve out a privileged position for searchers that give them rights separate from the privacy rights of the subjects of the data (even though each data point undoubtedly has a subject).

The U.S. law defines “data mining” as an activity…

“…involving pattern-based queries, searches, or other analyses of 1 or more electronic databases, where—

(A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals;

(B) the queries, searches, or other analyses are not subject-based and do not use personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals, to retrieve information from the database or databases;”

(Data Mining Reporting Act of 2007, Section 804(b)(1), emphasis mine)

In addition, it is well documented that the buying and selling for aggregated datasets constitutes a market and that there is a specific role in this market for data brokers.

Describing the characteristics of the Data Broker Industry in the United States, the Federal Trade Commission noted:

The Data Broker Industry is Complex, with Multiple Layers of Data Brokers Providing Data to Each Other: Data brokers provide data not only to end-users, but also to other data brokers. The nine data brokers studied obtain most of their data from other data brokers rather than directly from an original source. Some of those data brokers may in turn have obtained the information from other data brokers. Seven of the nine data brokers in the Commission’s study provide data to each other. Accordingly, it would be virtually impossible for a consumer to determine how a data broker obtained his or her data; the consumer would have to retrace the path of data through a series of data brokers.
Data Brokers Collect and Store Billions of Data Elements Covering Nearly Every U.S. Consumer: Data brokers collect and store a vast amount of data on almost every U.S. household and commercial transaction. Of the nine data brokers, one data broker’s database has information on 1.4 billion consumer transactions and over 700 billion aggregated data elements; another data broker’s database covers one trillion dollars in consumer transactions; and yet another data broker adds three billion new records each month to its databases. Most importantly, data brokers hold a vast array of information on individual consumers. For example, one of the nine data brokers has 3000 data segments for nearly every U.S. consumer.

Data Brokers: A Call for Transparency and Accountability. Federal Trade Commission. May 2014.

In testifying before Congress regarding this industry, Paul Kurtz, at that time the Executive Director of the Cyber Security Industry Alliance CSIA), described the importance of data brokers as follows:

“We need, simply, to come to terms with our reliance on information systems and the vast amount of personal information in storage and in transit in such systems…the representatives of the information-broker industry will testify this morning that the American economy and even our national security are becoming increasingly dependent on this industry.”[2]

Mr. Kurtz’s testimony was delivered on May 10^th 2005 and the FTC report quoted above it was, of course, published nine years later. It is perhaps fair to say that in those nine years, we did not come to terms with our reliance on information systems in the way that Kurtz meant. I believe that is because we have not come to terms with the unique roles that go beyond individual data record transactions.

While the traditional roles in data privacy and security are both functional and transactional—subject, user, collector, discloser—it seems important to accept that there are two equally important roles for us to consider, roles that are functional only when focusing on the aggregation of data. The roles are searcher and broker.

VI.

We tend to look at technologies by what they do, how they were developed and what they replaced. While there are tremendous differences between primitive Paleolithic technology, a stone axe for example, and our current ones, we gravitate towards thinking of them both as enabling. The rhetorical “what did we do before we had x” applies equally to the water wheel as the jet plane. We call them “advances” and discuss what we can now do that we could not do before. Or what we can now do more of or do more efficiently.

Those discussions are necessary, often accurate, but not sufficient. We need to look at technological advancement in the same way we look at the development of pharmaceuticals. The discussions of what a technology does and how it works is what, in the world of products subject to Food and Drug Administration (FDA) approval, would be called intended or approved use. But we should, taking their lead, also discuss side effects and even what is referred to as off-label use.

This has happened with some technologies. The impact of technology on the environment is perhaps the best example of discussing side effects. Whether or not blue tooth headsets are safe for one’s health is another. The early studies of the effect of watching television on children are also examples of looking at safety and side effects with respect to a technology.

Anyone who has ever driven an infant around the block over and over again just to get them to go to sleep is familiar with one off label use of the automobile. Data collected to record purchases at a supermarket being then aggregated and used to model buying behavior is another classic example of an off label use.

In these articles on aggregation, I’ve attempted to look at the broader characteristics of the technology and how it is changing us as well as to suggest where to look to explore those two apt criteria used by the FDA when evaluating new treatments for approval: safety and efficacy.

[1] “If institutionalized individualization means that there is a growing pressure towards reflexive life styles and individualized biographies and that meaning and identity need to be discovered individually, can there still be a collective identity of class?” p. 685. from Beck, Ulrich, Beyond class and nation: reframing social inequalities in a globalizing world. The British Journal of Sociology 2007 Volume 58 Issue 4

[2] Identity Theft and data Broker Services. Hearing before the Committee on Commerce, Science, and Transportation. May 10, 2005, p. 53. S. Hrg. 109-1087. U.S. GPO. Washington. 2010

Afterthoughts: Big Data and Us Little People

By: David Sheidlower

CISO

Turner Construction

July 7, 2015

Leave a Reply Cancel reply