BLOG. Big data may provide us with new knowledge and insights. With the gigantous collection of big data, we can access large amounts of data from a variety of sources and with high velocity – even real time data – in ways we have never experienced before. The driving force is the algorithm.
Data used in big data analyses encompass personal data about individuals and cover public available data; data disclosed by the data subject him or herself; or registered by a company or authority, and include private and sensitive data.
The private sphere under pressure
The collection of different types of data represents a major challenge. Non-sensitive personal data combined with similar data stemming from a variety of sources may accumulate into sensitive data revealing information about our private life and providing insight into a private sphere that is protected by law and closely linked to our need for having a private zone, a place to be let alone.
The already incorporated practice of many companies to combine personal data from loyalty cards, cookies and internal customer registers with information obtained with the customer, other databases, authorities, social media and apps, may be used for classifying and describing segments and groups of customers or people, but also to create detailed profiles of individual persons.
All parts of our private lifes are potentially unveiled: our patterns of consumption and spending, our preferences and habits, our family, friends and other personal or professional relations, our sex life or sexual orientation, affiliation with communities or associations, education, work and income, locations and mobility, our use of mobiles and digital gadgets. The list is endless.
For insurance companies and banks the business opportunities in big data analyses as a tool to assess whether a customer satisfies the criteria for having an insurance, a loan or a credit is evident. The risk assessment of the client and the subsequent price setting are now based on an individual risk profile embracing more information on the specific behavior of the client than ever before. German insurance companies has thus introduced a fitness app in connection to their health insurance that help the insured person to keep track of his or her daily training and exchange the data with insurance company. The benefit of a good training record is a lower price.
The digitalizing of the public sector and the transformation of decision-making into automated systems follows a similar trend: we trust the algorithms to provide us with objective and fair decisions. And perceive it as a step forward in our efforts to enhance effectiveness and quality in decision-making procedures. We do it fast and in a way that ensures predictability and transparency. That is at least what we think.
From causality to correlation
But big data analytics differ fundamentally from human judgement in decision-making procedures with the purpose of to providing a customer or a citizen access to services. This represents another major challenge.
Big data analytics are based on algorithms. The algorithm allows us to look for specific features in an indefinite amount of data. If we want to know more about the number and character of incidents reported to the insurance company or the use of shelters by homeless people, we use these pieces of information with other variables such as age, social security number, address or zip code, education and income. The algorithm will reveal correlations among the listed variables that we did not know about and thus provide us with more insights into the behavior of the insurance customer or homeless citizen. This is indeed smart. Standing on our new knowledge base, we may avoid high-risk customers – those with frequent insurance claims or from high-risk neighborhoods. Similarly, the maintenance of shelters and allocation of resources become more efficient when we know the persons and groups most vulnerable to homelessness and their whereabouts.
We may also predict their future behavior. If our data shows a correlation between a specific behavior – towards insurance claims or use of shelters – and specific social security numbers, zip codes, age etc., we can start planning for the future capacity and initiate preventive activities to reduce the number of claims and homelessness. Smart, isn’t it? And to the benefit for all of us.
Lack of transparency and justice
Unfortunately, we are not yet aware of the disadvantages of big data analyses, especially their negative impact on economic equality, social mobility and inclusion in our societies. Many decisions are still made by companies and authorities on the basis of knowledge, experience and explanations of causality between human behavior and its consequences. The human judgement and arguments of causality disappear with algorithms. They show merely the correlation between a number of variables, but don’t explain causality – or have an eye for subjective elements in the real world. That creates a risk for wrong or inaccurate assessments and decisions.
Yet another risk is linked to the design and choice of variables. If a prejudice, e.g. towards a connection between a zip code and frequency in insurance claims or the use of shelters with the ethnic background of a person or a group, is inherent in the variables, this prejudice will be repeated in the data analyses and in the subsequent classifications and decisions. This may imply exclusion from insurance or shelter services and to discrimination.
In other words, algorithms lack transparency or opacity in relation to variables and classification schemes, and hide the (legal) basis for decision-making. It is neither objective nor fair, and it does not ensure correct and just results.
A need for data ethics
What can we do about that? One solution is to introduce audit schemes to algorithm designed to fulfill tasks in companies and authorities if the applied variables are at risk of infringing privacy, cause discrimination or social exclusion, and furthermore may lead to unauthorized access to data for persons for whom data is not relevant or necessary. Such audit requirement complies nicely with the demands to data processing in the upcoming EU Regulation on data protection. It thus introduces a legal obligation to conduct data protection assessments when using technologies to process high-risk data.
Another – or supplementary – solution is the realization of data ethics as imperative in the process of developing algorithms. This very important task can only be accomplished through the active engagement of managers and civil servants responsible for demanding big data analyses, software suppliers and data scientists developing the algorithms, as well as the caseworker who applies the results to decide on the provision of services to customers and citizens.
A new and strong legal framework to ensure privacy in the public and private sector is in place very soon. But to be effective, the protection of privacy must also be ensured in daily practice. Systematic and continuous data ethical considerations is an important element in that respect. So let us introduce a mandatory principle: no algorithmsfor big data analyses of personal data without integrating data ethics.