Guest blog post by Don Philip Faithful
The idea of environmental determinism once made a lot of sense. Hostile climates and habitats prevented the expansion of human populations. The conceptual opposite of determinism is called possibilism. These days, human populations can found living in many inhospitable habitats. This isn't because humans have physically evolved. But rather, we normally occupy built-environments. We exist through our technologies and advanced forms of social interaction: a person might not be able to build a house, but he or she can arrange for financing to have a house constructed. "Social possibilism" has enabled our survival in inhospitable conditions. Because humans today almost always live within or in close proximity to built-environments, among the most important factors affecting human life today is data. The systems that support human society make use of data in all of its multifarious forms; this being the case, data science is important to our continuation and development as a species. This blog represents a discussion highlighting the need for a universal data model. I find that the idea of "need" is highly subjective; and perhaps the tendency is to focus on organizational needs specifically. I don't dispute the importance of such a perspective. But I hope that readers consider the role of data on a more abstract level in relation to social possibilism. It is this role that the universal data model is meant to support. Consider some barriers or obstacles that underline the need for a model, listed below.
Barriers to Confront
I certainly don't suggest that in this blog that I am introducing the authoritative data model to end all models. Quite the contrary, I feel that my role is to help promote discussion. I imagine even in the list of barriers, there might be some disagreement among data scientists.
(1) Proxy reductionism triggered by instrumental needs: I believe some areas of business management have attempted to address highly complex phenomena through the simplification of proxies (i.e. data). The nominal representation of reality facilitates production, but also insulates an organization from its environment. Thus production can occur disassociated from surrounding phenomena. I feel that this nominalism is due to lack of a coherent model to connect the use of data to theory. We gain the illusion of progress through greater disassociation, exercising masterful control over data while failing to take into account and influence real-life conditions.
(2) Impairment from structurally inadequate proxies: Upon reducing a problem through the use of a primitive proxies, an organization might find development less accessible. I believe that a data model can help in the process of diagnosis and correction. I offer some remedial actions likely applicable to a number of organizations: i) collection of adequate amounts of data; ii) collection of data of greater scope; and iii) ability to retain the contextual relevance of data.
Social Disablement Perspective
My graduate degree is in critical disability studies - a program that probably seems out-out-place in relation to data science. Those studying traditional aspects of disability might argue that this discipline doesn't seem to involve big data, algorithms, or analytics. Nonetheless, I believe that disablement is highly relevant in the context of data science albeit perhaps in a conceptual sense. While there might not be people with apparent physical or mental disabilities, there are still disabling environments. Organizations suffering from an inability to extract useful insights from their data might not be any more disabled than the data scientist surrounding by tools and technologies disassociated from their underlying needs. Conversely, those in the field of disability might discuss the structural entrenchment of disablement without ever targeting something as pervasive as data systems. However, for those open to different perspectives, I certainly discuss aspects of social disablement in my blogs all the time. Here, I will be arguing that at its core, data is the product of two forces in a perpetual tug-of-war: disablement and participation. So there you go. I offer some cartoon levity as segue.
I recently learned that the term "stormtroopers" has been used to describe various military forces. For the parable, assume that I mean Nazi shock troops. I'm uncertain how many of my peers have the ability to write computer programs. I create applications from scratch using a text editor. Another peculiarity of mine is the tendency to construct and incorporate elaborate models into my programming. It is never enough for a program to do something. I search for a supporting framework. Programming for me is as much about research through framework-development as it is about creating and running code. In the process of trying to communicate models to the general public, I sometimes come up with examples that I admit are a bit offbeat. Above in the "Parable of the Stormtrooper and the Airstrip," I decided to create personifications to explain my structural conceptualization of data. The stormtrooper on the left would normally be found telling people what to do. Physical presence or presence by physical proxy is rather important. (I will be using the term "proxy" quite frequently in this blog.) He creates rules or participates in structures to impose those rules. He hollers at planes to land on his airstrip. I chose this peculiar behaviour deliberately. Command for the soldier is paramount, effectiveness perhaps less so. In relation to the stormtrooper, think social disablement; this is expressed on the drawing as "projection."
On the other side of the equation is this person that sort of resembles me and who I have identified as me although this is a personification of an aspect of data. He is not necessarily near or part of the enforcement regime. His objective rather than to compel compliance is to make sense of phenomena: he aims to characterize and convey it especially those aspects of reality that might be associated with but not necessarily resulting from the activities of the stormtrooper. There are no rules for this individual to impose. Nor does he create structures to assert his presence over the underlying phenomena. In his need to give voice to phenomena, he seeks out "ghosts" through technology. If this seems a bit far-fetched, at least think of him as a person with all sorts of tools designed to detect events that are highly evasive. Perhaps his objective is to monitor trends, consumer sentiment, heart palpitations, or patterns leading to earthquakes. Participation is indicated on the drawing as "articulation."
So how is a model extracted from this curious scene? I added to the drawing what I will refer to as the "eye": data appears in the middle surrounded by projection and articulation. Through this depiction, I am saying that data is often never just plain data. It is a perpetual struggle between the perceiver and perceived. I think that many people hardly give "data" much thought: e.g. here is a lot of data; here is my analysis; and here are the results. But let us consider the idea that data is actually quite complicated from a theoretical standpoint. I will carry on this discussion using an experiment. The purpose of this experiment is not to arrive at a conclusion but rather perceive data in its abstract terms.
An Experiment with No Conclusion
A problem when discussing data on an abstract level is the domain expertise of individuals. I realize this is an ironic position to take given so many calls for greater domain expertise in data science. The perspective of a developer is worth considering: he or she often lacks domain expertise, and yet this person is responsible for how different computer applications make use of data. Consequently, in order to serve the needs of the market, it is necessary for the developer to consider how "people" regard the data. Moreover, the process of coding imposes distance or abstraction since human mental processes and application processes are not necessarily similar. A human does not generate strings from bytes and store information at particular memory addresses. But a computer must operate within its design parameters. The data serves human needs due to the developer's transpositional interpretation of the data. The developer shapes the manner of conveyance, defines the structural characteristics of the data, and deploys it to reconstruct reality.
I have chosen an electrical experiment. There is a just single tool, a commercial grade voltmeter designed to detect low voltages. The voltage readings on this meter often jump erratically when I move it around a facility full of electrical devices; this behaviour occurs when the probes aren't touching anything. Now, the intent in this blog is not to determine the cause of the readings. I just want readers to consider the broader setting. Here is the experiment: with the probes sitting idle on a table, I took a series of readings at two different times of the day. The meter detected voltage - at first registering negative then becoming positive after about a minute. As indicated below on the illustration, these don't appear to be random readings. Given that there is data, what does it all mean? The meter is reading electrical potential, and this is indeed the data. What is the data in more abstract terms regardless of the cause?
Being a proxy is one aspect of data. Data is a constructed representation of its underlying phenomena: the electrical potential is only a part of the reality captured in this case by the meter. The readings from the meter define and constrain the meaning of the data such that it only relates to output of the device. In other words, what is the output of the device? It is the data indicated on the meter. It is a proxy stream; this is what we might recognize in the phenomena; for this is what we obtain from the phenomena using the meter. From the experiment itself, we actually gain little understanding of the phenomena. We only know its electrical readings. So although the data is indeed some aspect of the articulated reality, this data is more than anything a projection of how this reality is perceived. It is not my intention to dismiss the importance of the meter readings. However, we would have to collect far more data to better understand the phenomena. Our search cannot be inspired by the meter readings alone; it has to be driven by the phenomena itself.
Another problem relates to how the meter readings are interpreted. Clearly the readings indicate electrical potential; so one might suggest that the data provides us with an understanding of how much potential is available. The meter provides data not just relating to electrical potential alone but also dynamic aspects of the phenomena: its outcomes, impacts, and consequences. This is not to say that electrical potential is itself the outcome or antecedent of an outcome; but it is part of the reality of which the device is designed to provide only readings of potential. We therefore should distinguish between data as a proxy and the underlying phenomena, of which the data merely provides a thin connection or conduit. There is a structure or organizational environment inherent in data that affects the nature and extent to which the phenomena is expressed. The disablement aspect confines phenomena to contexts that ensure the structure fulfills instrumental requirements. Participation releases the contextual relevance of data.
I have met people over the years that refuse to question pervasive things. I am particularly troubled by the expression "no brainer." If something is a no-brainer, it hardly makes sense to discuss it further; so I imagine these people sometimes avoid deliberating over the nature of things. This strategy is problematic from a programming standpoint where it is difficult to hide fundamental lack of knowledge. It then becomes apparent that the "no brainer" might be the person perceiving the situation as such. Keeping this interpretation of haters and naysayers in mind, let's consider the possibility that it actually takes all sorts of brains to characterize data - that in fact the task can incapacitate both people and supercomputers. If somebody says, "Hey, that's a no brainer" to me or anybody else, my response will be, "You probably mean that space in your head!" (Shakes fist at air.)
I provide model labels on the parable image: projection, data, and articulation. I generally invoke proper terms for aspects of an actual model. "Disablement" can be associated with "projection" on the model; and "participation" with the term "articulation." The conceptual opposition is indicated on the image below as point #1. Although the parable makes use of personifications, there can sometimes be entities in real-life doing the projection: e.g. the oppressors. There can also be real people being oppressed. In an organizational context, the issue of oppression is probably much less relevant, but the dynamics still persist between the definers and those being defined: e.g. between corporate strategists and consumers. Within my own graduate research, I considered the objectification of labourers and workers. As production environment have developed over the centuries, labour has become commodified. In the proxy representation, workers have been "defined" using the most reified metrics; but there is a counterforce also, for self-definition or some level of autonomy. Data exists within a context of competing interests as indicated on point #2
From the experiment I indicated how data is like a continuum formed by phenomena and its radiating consequences: I said that readings can be taken of dynamic processes. This is a bit like throwing stones in a lake and having sensors detect ripples and counter-ripples. An example would be equity prices in the stock market where a type of algorithmic lattice can bring to light the dynamic movement of capital. Within this context, it is difficult to say whether what we are measuring is more consequence or antecedent; but really it is both. I believe it is healthy to assume that the data-logger or reading device offers but the smallest pinhole to view the universe on the other side. Point #3 shows these additional dynamics. There is a problem here in terms of graphical portrayal - how to bring together all three points into a coherent model. I therefore now introduce the universal data model. I also call this the Exclamation Model or the Model! The reasons will be apparent shortly.
The Exclamation Model visually resembles an exclamation mark, as shown on the image below. For the purpose of helping readers navigate, I describe the upper portion of the model as "V" and the lower part as "O," or "the eye" as I mentioned previously since it resembles a human eye. The model attempts to convey all of the different things that people tend to bundle up in data perhaps at times subconsciously. An example I often use in my blogs is sales data, which doesn't actually tell us much about why consumers buy products. There might be high demand one year followed by practically no demand the next; yet analysts try to plot sales figures as if the numbers follow some sort of intrinsic force or built-in natural pattern. Sales figures do not represent an articulation of the underlying phenomena, but rather it causes externally desired aspects of the phenomena to conform to an interpretive framework. Within any organizational context, there is a battle to dictate the meaning of data. If an organization commits itself to the collection of sales data and nothing beyond this to understand its market, it would be difficult at a later time to find a suitable escape route leading away from the absence of information. The eye is inherent in the structure of data extending in part from the authority and control of those initiating its collection.
As one goes up the V, both projection and articulation transform to accommodate the increasing complexity of the phenomena; but also while going up, there is greater likelihood of separation between the articulated reality (e.g. employee stress) and the instrumental projection (e.g. performance benchmarks) resulting in different levels of alienation. As one travels down the V, there is less detachment amid declining complexity, which improves the likelihood of inclusion. In this explanation, I am not suggesting that alienation or inclusion is directly affected by the level of sophistication in the data. The V can become narrower or wider depending on design. Complexity itself does not cause alienation between data and its phenomena; but there is greater need for design to take complexity into account due to the risk of alienation. It might be tempting to apply this model to social phenomena directly, but actually this is all meant for the data that is conveying phenomena. Data can be alienated from complex phenomena.
Rooted in Systems Theory
I realize that the universal data model doesn't resemble a standard input-process-output depiction of a system; but actually it is systemic. Projection provides the arrow for productive processes sometimes portrayed in a linear fashion: input, process, and output. Articulation represents what has often been described as "feedback." Consequently, the eye suggests that the entire system is a source of data. In another blog, I support this argument by covering three major data types that emerge in organizations: data related to projection resulting from metrics of criteria; data from routine operations as part of production processes; and data from articulation from the metrics of phenomena. The eye is rather like a conventional system viewed from a panoramic lens. The V provides an explanation of the association between proxies and phenomena under different circumstances.
Arguments Regarding Evidence
The simplification movement has mostly been about simplification of proxies and not the underlying phenomena. Data as a proxy is made simpler in relation to what it is meant to represent. Consider a common example: although employees might have many attributes and capabilities, in terms of data they are frequently reduced to hours worked. The number of hours worked is a metric intended to indicate the cost of labour. A data system might retain data focused on the most instrumental aspects of production thereby giving the illusion that an organization is only responsible for production. I feel that as society becomes more complex and the costs associated with data start to decline in relation to the amount of data collected, the obligation that an organization has to society will likely increase. This obligation will manifest itself in upgrades to data systems and not only this but improved methodologies surrounding the collection and handling of data. The model provides a framework to examine the extent to which facts could and should have been collected. Consider a highly complex problem such as infection rates in a hospital. The hospital might address this issue by collecting data on the number of hours lost through illness and sick days used. But this alone does not provide increased understanding of infections; some might argue therefore that such inadequate efforts represent a deliberate form of negligence apparent in the alienation of proxies.
Relation to Computer Coding
I have a habit of inventing terms to describe things particularly in relation to application development. Experience tells me that if I fail to invent a term and dwell on its meaning, the thing that I am attempting to describe or understand will fade away. I am about to make use of a number of terms that have meaning to me in my own projects; and I just want to explain that I claim no exclusive rights or authority over these terms. In this blog, I have described data as "proxy" for "phenomena." I make use of a functional prototype called Tendril to examine many different types of data. Using Tendril, there are special terms to describe particular types of proxies: events, contexts, systems, and domains. These proxies all represent types of data or more specifically the organization of aspects of phenomena that we customarily refer to as data.
The most basic type of proxy is an event. I believe that when most people use the term "data," they mean a representation quite close to a tangible aspect of phenomena. I make no attempt to confine the meaning of phenomena. There can be hundreds of events describing different aspects of the same underlying reality. I consider the search for events a fluid process that occurs mostly on a day-to-day level rather than during design. Another type of proxy - i.e. a different level of data - is called a context. Phenomena can "participate" in events. The "relevance" of events to different contexts is established on Tendril using something called a relevancy algorithm. I placed a little person on the illustration to show what I consider to be the comfort zone for most people in relation to data. I would say that people tend to focus on the relevance of events to different contexts.
The idea of "causality" takes on special meaning in relation to the above conceptualization. Consider the argument that poverty is associated with diabetes. Two apparently different domains are invoked: social sciences and medicine. Thus, the events pertaining to social phenomena are being associated with a medical context. The social phenomena might relate to unemployment, stress, poor nutrition, inaccessible education, violence, homelessness, inadequate medical care: any reasonable person even without doing research could logically infer adverse physiological and psychological consequences. Yet the connection might not be made I believe because the proxy seems illegitimate. How can a doctor prescribe treatment? If human tolerance for social conditions has eroded, one approach is to treat the problem as if it were internal to the human body. Yet the whole point of the assertion is to identify the importance of certain external determinants. Society has come to interpret diabetes purely as a medical condition internal to the body. This is an example of how data as a proxy can become alienated from complex underlying phenomena. We say that people are diseased, failing to take into account the destructive and unsustainable environment that people have learned to tolerate.
Since there is no ceiling or floor on the distribution of proxies in real life, the focus (on contexts and events) does not necessarily limit the data that people use but rather the way that they interpret it, not being machines. I feel that due to its abundance, people habitually choose their place in relation to data; and they train themselves to ignore data that falls outside their preferred scope. Moreover, the data that enters their scope becomes contextually predisposed. Consequently, it might seem unnecessary to make use of massive amounts of data and many different contexts (e.g. in relation to other interests). But this predisposition is like choice of attire. The fact that data might fall outside of scope does not negate its broader relevance; nor does its presence within scope mean that it is relevant only in a single way.
The Phantom Zone
It is not through personal strength or resources that a person can get a road fixed. One calls city hall. There is no need to build shelter. One rents an apartment or buys a house. In human society, there are systems in place to handle different forms of data. These systems operate in the background at times without our knowledge enabling our existence in human society and offering comfort. Our lack of awareness does not mean that the systems do not exist. Nor does our lack of appreciation for the data mean that the structure of the data is unimportant. In fact, I suggest that the data can enable or disable the extent to which these systems serve the public good. Similarly, the way in which organizations objectify and proxy phenomena can lead to survivorship outcomes. An organization can bring about its own deterministic conditions.
The universal data model - really just "introduced" in this blog - is meant to bring to light the power dynamics inherent in data: the tug-of-war between disablement and participation. I have discussed how an elaborate use of proxies can help to reduce alienation (of the data from its underlying phenomena) and accommodate greater levels of complexity to support future development. This blog was inspired to some extent by my own development projects where I actually make creative use of proxies to examine phenomena. However, this is research-by-proxy - to understand through the manipulation of data structures the existence of ghosts - entities that are not necessarily material in nature. I attempt to determine the embodiment of things that have no bodies - the material impacts of the non-material - the ubiquity of the imperceptible. It might seem that humans have overcome many hostile environments. While we have certainly learned to conquer the material world, there are many more hazards lurking in the chasms of our data awaiting discovery. However, before we can effectively detect passersby in the invisible and intangible world, we need to accept how our present use of data is optimized for quite the opposite. Our evolution as a species will depend on our ability to combat things beyond our natural senses.