Over the last decade, the government has linked together vast amounts of personal data it holds on nearly every New Zealand resident. That data is driving policy and is regularly shared with external researchers. Some argue it’s a necessary feature of a modern democracy, but critics say it increases the power of the state over its citizens.
What if you could predict if a child would succeed in life?
From birth, you could estimate how likely they were to finish school, and list the precise factors most likely to cause them to fail. You could anticipate the social services they would use throughout their life, how much those services would cost, and the benefits of intervening early.
This idea underpinned the Equity Index, a method of deciding school funding that will replace the decile system next year.
The decile system categorised schools based on how many of their students came from poor neighbourhoods; a clumsy method that equates students to their environment. The Equity Index doesn’t care what neighbourhood a student comes from; it cares about who they are, as individuals.
READ MORE:
* Data and Stats bill fails to recognise digitisation risks
* 2023 Census to include questions about sexual orientation and gender identity
* Inside Stats NZ’s gaping money hole
* Govt releases more data to provide insights into problems for ‘at-risk youth’
This requires incredible amounts of data: The ability to distil a child’s life, in all its breadth and richness, into categories for statistical analysis. That data must sit atop a model that projects how a child’s life is likely to play out, based on the families they are born into, and other essential characteristics of their being. It must be applied to nearly every child in the country.
Where will all this data come from? The government already has it.
The Integrated Data Infrastructure (IDI) is a colossal trove of information. It holds personal data about virtually every New Zealander, and millions more who have lived, and died, here.
Most of its data is administrative, meaning it comes from government agencies that collect personal information in the course of their day-to-day work. If you receive a benefit, it is recorded; if you go to a doctor or visit a hospital, it is recorded; where and when you went to school, and for how long, is recorded.
This information used to be tightly held by the organisation that collected it. It was rarely shared with other agencies, and virtually never with outside parties.
The point of collecting this information wasn’t to analyse it; it was to deliver a government service.
One reason for this was the law. Under the Privacy Act, no two agencies can use the same identifier for an individual, which is meant to prevent the existence of a population register. It’s why you might have an NHI number, an IRD number, an ACC number, a passport number and an MSD client number that are all different – one cannot be directly connected to the other.
But there’s a way around this. You can link different data sets together probabilistically – meaning on the basis they are probably the same person – based on other information, such as their name, sex, date of birth, and address. In a small country like New Zealand, that’s usually enough to identify someone.
These details – which are taken from birth, tax, and visa records – form the “spine” of the IDI, which contains information for around 9m living people. Other data sets, called nodes, are linked to individuals in the spine, but not each other; think of planets orbiting a star.
Data in the IDI is “de-identified”, meaning some information is encrypted (replaced by a number) or removed. Identifying information is still held by Stats NZ, but is only used to link the data together. To any researcher using the IDI, the people they’re studying are lines on a spreadsheet.
Data linking started in the mid-1990s, but the IDI only became a fully-fledged product within Stats NZ in 2013. It quickly became the engine of “social investment”, the data-driven, highly-targeted approach to social issues favoured by the then National Government.
It has become a valuable piece of kit for many government agencies, used to guide policy and better target how their services are delivered. It has become easier to use; One of its most useful functions is the ability to track (de-identified) people over time, through their interactions with the state.
Let’s say you wanted to know what happened to children after they were subject to an abuse notification from Oranga Tamariki (or its precursor, CYFS). Using the IDI, you could follow each child’s government interactions in the years afterwards. How many went to prison, or went on a benefit? At what age did they have children of their own? How much did each of those interactions cost the state? With the click of a button, you can reconstruct someone’s life, through the lens of their interactions with the government.
In 2015, a team within Treasury was set up to use the IDI for analysis.
One of its first projects looked at the factors among children that were linked to poor outcomes later in life.
From the vast amount of information in the IDI, it found four characteristics most closely linked to poor long-term outcomes:
Having a finding of abuse or neglect, or having spent time in the care of child protection services.
Having spent most of their lifetime supported by benefits.
Having a parent who has received a community or custodial sentence.
Having a mother with no formal qualifications.
In another paper, the researchers found that some five-year-olds – around 600 each year – had three of those risk factors. Those five-year-olds would, on average, cost the state $320,000 each before they turned 35.
This test case was heralded by the government, which believed this information could be used to target social services towards those most at-risk, potentially changing the trajectory of their lives whilst getting better returns for taxpayers.
Treasury itself highlighted how this type of analysis could have enormous implications. In one paper, it argued that the financial benefits of data-driven changes to social services were potentially equivalent to raising the superannuation age to 67.
With data-driven, evidence-based policy ruling the roost, a prime target sat exposed in the education sector.
The decile system was unpopular, and many had pointed out its flaws. It was nevertheless true that higher decile schools produced better test scores than lower decile schools, producing the common belief that higher decile schools were superior.
But this isn’t true: There is virtually no difference in school quality across decile levels.
How do we know that? The IDI.
One group that studied this question was the New Zealand Initiative (NZI). The think-tank wanted to know what happened if you separated a student’s family background – their parents’ education history, their household income, any notifications to CYF, and so on – from the school they went to.
“What you want to be able to do is statistically adjust for all of these things to get a fairer picture of how each school is performing,” says Dr Eric Crampton, chief economist of the NZI.
“The IDI provides a perfect way of doing that because you can link the students in each school to all of their NCEA records, and you can even cast forward to trace how they do after school.”
For about a year, one of the group’s researchers trawled through the IDI, cleaning the data and devising a method to separate a school’s performance from its students’ family backgrounds.
The resulting report, released in 2019, showed that a child’s family circumstances almost entirely predicted their school performance.
Government officials came to a similar conclusion. Kernels of this thinking had, by 2017, led to the Risk Index, the government’s planned alternative to the decile system. It drew heavily on the work on poor social outcomes done by Treasury; one 2017 paper underpinning its development, released to Stuff under the Official Information Act, looked at numerous factors that could be included in the new index. It shared some of the same authors as Treasury’s earlier work.
In that paper, the authors examined around 50,000 children who were born in 1998, and tracked them through the IDI to the point where they either passed or failed NCEA Level 2. What factors in their early lives best predicted whether they would fail?
There were five that stood out. The proportion of time the child was supported by a benefit; being male; being Māori; mother’s age at first child; and mother’s education.
The Risk Index was dropped after a change in government, and replaced with the slightly modified Equity Index.
By 2019, it included 26 variables, which has since expanded to 37. They mostly concern personal information about the child’s parents; their benefits, their criminal history, their age when the child was born, and their qualifications. Some variables have yet to be announced, and more are likely to be added over time.
Documents reveal that more variables were considered – and could technically have been added, because they were in the IDI – but were rejected due to the “high risk of potential stigma”. Among them were whether the child had suffered an accident at home (such as poisoning), and the number of times a parent had been hospitalised for mental health reasons, including self-harm.
The IDI is not a secret, but it’s not well publicised, either.
It doesn’t exist in legislation, so it has never been subject to widespread public consultation: it was developed under the 1975 Statistics Act, which gives the Government Statistician oversight of data within the government, if that data has been voluntarily provided by other agencies.
Several public surveys have found low public awareness of the IDI, and moderate levels of concern about the government sharing personal data.
About 1000 people are accredited to use the IDI, and it’s used for hundreds of research projects each year. It includes valuable research into the needs of children, people with disabilities, and the rainbow community. It has better targeted social spending to those who need it most. It was even used for pandemic spread modelling that guided the government’s Covid-19 response.
Thanks to the IDI, we can estimate the disease burden for families that live in unsafe or substandard housing. We know that women who take time off work to have children are penalised with lower wages when they re-enter the labour market, and that commuting patterns have changed significantly over time. We even know workplace accidents disproportionately happen on Mondays.
It has been regularly described as “world-leading”; a way to undertake high-level analyses of trends, without compromising individual privacy.
To its advocates, the potential of this accelerating data usage is dazzling. Those in charge of public policy are no longer fumbling in the dark; they have cold, hard numbers, extracted from real people and rendered on a spreadsheet.
“In government, we use it to target service delivery, to inform policy, to evaluate whether policies are working or not… it’s the way that we direct funding into schools and hospitals and roads,” says Dr Craig Jones, the deputy Government Statistician.
“Without research and statistics, we’d be flying blind. All we’d really have is guesswork and good intentions.”
Journalist Keith Ng has described this data-driven style of government as a “Datatopia”: A country awash with data, a resource to be tapped for evidence-based policies that use the language of science to describe the actions of the government.
The reality, as Ng wrote, is more complicated. Data has limitations; like any powerful tool, it can be misused.
While the development of the Equity Index reveals the sophistication of the government’s data operation, it also shows the breadth of its access to personal information.
The IDI includes standard, non-controversial administrative data about individuals, but it may also know if they’ve ever tried to commit suicide, or been prescribed medication for a mood disorder. If someone is elderly and in residential care, it may know if they’re incontinent, or showing early signs of dementia. If a child has been assessed under a public service such as B4 School Check, it may know how hyperactive they are, and the state of their teeth.
The amount of data in the IDI has become so vast that a search tool was developed to help researchers narrow down what they can find. It lists more than 50,000 variables – types of information – that can potentially be linked to individuals.
This data can be, and has been, used to make specific predictions about future behaviour. The Ministry of Justice developed a tool allowing it to “estimate the number of offences and victimisations likely to be committed or experienced by each resident of New Zealand between now and the end of their life”.
CYF (now Oranga Tamariki) explored using administrative data to predict which newborn infants were most likely to experience a substantiated claim of abuse.
This can bring an uncomfortable dimension to how this data is collected and shared.
Administrative data – which is most of the IDI’s information – is by definition more likely to cover people who interact with the state. Statistically, this is likely to include Māori, poorer people, and people with ongoing health conditions.
Some researchers have framed the IDI in the context of colonialism; a continuation of the state’s surveillance and control over Māori.
“Although data are de-identified before being made available to researchers, the linking of multiple data sources enables new forms of surveillance that exist outside of ethics and other privacy or consent mechanisms,” researchers Donna Cormack and Professor Tahu Kukutai wrote this year.
“Linkage is generally not based on individuals’ informed consent for their data to be included in the IDI, shared between agencies, or linked to other datasets, and there are few mechanisms for opting out.
“Most data are collected as part of other routine or survey collections, so people may be unaware that the data they provide will be able to be linked to multiple other data sources in a way that allows for them to be tracked over time and across social services.”
Māui Hudson, an Associate Professor at the University of Waikato, says this data can have a “deficit-focus”.
“That sort of information can only tell certain kinds of stories,” he says.
“That’s certainly one of the criticisms about that over-reliance on administrative data… It doesn’t tell the strength-based stories that Te Ao Māori would like to be told.”
It stands in contrast with what the Government doesn’t know. In 2019, Parliamentary Commissioner for the Environment Simon Upton released a report criticising the lack of data about the environment.
“We hear a lot about living in an information age … But when we try to find out what’s happening on our land or what’s happening to our water, there are huge gaps,” he wrote.
Earlier this year, Revenue Minister David Parker observed that authorities “have virtually no idea what rate of tax is paid by the very wealthy”. It’s because the data does not exist – government surveys have never collected data on people worth more than $20m.
There are no such problems for those who use social services, whose personal information forms the foundation of a growing data apparatus – one that knows exactly how much they’ve cost the state.
It’s understood there has been concern within the government about how the public would react to the IDI, if its existence became widely known.
One person familiar with its usage within government, who requested anonymity to speak openly, said it was generally acknowledged that wider public awareness of its existence could lead to blowback.
Partly for that reason, Stats NZ has been measuring its ‘social licence’ to look after all this data.
Social licence is typically associated with extractive industries such as mining, oil and gas drilling, and forestry to describe practices outside the norm.
In 2018, it commissioned research into what social licence entailed.
“What the research found is that social licence to operate is based on whether that information is going to be used for purposes that are positive and beneficial, that robust security processes are in place – we can’t have peoples’ information being unintentionally or illegally disclosed – and that the processes for stewarding information are transparent,” says Dr Craig Jones, the deputy Government Statistician.
“I’m really confident that we have robust processes in place to meet each of those aspects.”
Concerns about these practices have already led to changes.
Some of it has been in response to the Māori Data Sovereignty movement, which argues that data about Māori should be governed by Māori.
Among its principles is that Māori data should be used for the collective benefit of Māori, and should be accessible, accurate, and used with informed consent.
“One of the reasons we started talking about Māori Data Sovereignty was because we were concerned there was this increasing push towards the collection and reuse of data – taking data out of one context, and using it for a whole lot of others,” says Māui Hudson, who is a member of Te Mana Raraunga, the Māori Data Sovereignty Network.
“A good example is the collection of different administrative data across a whole range of agencies, and then repurposing it for researchers to use – which itself may not be inherently bad, but we didn’t feel like there was as much public discussion as might have been required.”
A core component of that work is the purpose of data: What is it used for, and who does it benefit?
“What we really want to see is what sorts of changes can be made to the way in which data is collected, so that it can be refocused towards information sources that Māori think are important or relevant to a Te Ao Māori view of the world,” Hudson says.
“And that’s not necessarily going to be the administrative data, it’s going to be something else.”
In response to those concerns, Stats NZ developed the Ngā Tikanga Paihere framework, which is meant to guide ethical and culturally appropriate data usage.
It lays out further considerations for how data in the IDI is used – researchers must demonstrate how their research will benefit the communities being researched, and that risks are clearly identified in advance.
It has already changed how access to the IDI is granted.
“I think the notion of engaging with people and understanding how they feel is really key to that social licence, and the Ngā Tikanga Paihere framework really gets us into that space,” Jones, from Stats NZ, says.
“Every researcher who conducts research in the IDI about a population or a group of people needs to demonstrate they’ve engaged with that group of people, to understand the community needs and to do the research in a way that provides value back to that community.
A database containing personal information about nearly every New Zealand resident presents an obvious privacy risk.
Stats NZ is fiercely protective of the IDI, and highly sensitive to privacy threats. Access to it is carefully managed. External researchers are vetted, and have to sign a lifelong declaration of secrecy, promising not to reveal any private information they may see.
They are only given data that pertains to their research, and it can only be accessed from a Data Lab, a specified room with no internet access – the IDI itself is on a separate server. Anything the researchers publish is checked by Stats NZ officials, who can veto any information deemed to be at risk of breaching someone’s privacy.
When a new data set is approved, it is either transferred to Stats NZ over a secure filing system or, in some cases, handed in person to a Stats NZ official on an encrypted hard drive, who must walk (without stopping) to a Stats NZ office where it can be securely uploaded.
These steps are part of the “Five Safes” framework, an internationally recognised system for managing confidential data.
For some IDI users, the rules have proved frustrating. Rather than increasing trust in how data is collected and used, it could have the opposite effect; turning it into a walled-off garden, accessible only to those with advanced degrees.
The NZ Initiative’s Eric Crampton points to examples in the US where de-identified data can be accessed and analysed by anyone.
“What that does is let communities themselves figure out what’s going on,” he says.
“You don’t have to have a PhD to get in to analyse the data. Anyone with even high school statistics would be able to get something meaningful out of it, using just the web interface to find out what’s going on in their community and what’s working well, and what isn’t.
“Right now, it’s very difficult for people to do that. And that makes it hard to keep social licence for the maintenance of that very confidential data.”
His concern is that adding extra barriers to using data makes it both harder for researchers to study New Zealanders, but also risks trust in the system as a whole.
“You end up in a spot where data winds up being something that is done to communities, rather than by them or with them … It’s something that’s kept somewhere by people we don’t understand, eggheads go and do something with it, and then policy gets thrown on our community, and we’ve had no part of it,” he says.
“That’s not a healthy relationship with data.“
The IDI is likely to become even more integral to the government.
In the lead-up to the 2018 census, much was made of the fact it would be the first to be primarily filled out online.
The results were catastrophic. Response rates were much lower than expected, and for some groups, so low they were statistically unusable.
As officials scrambled to paper over the gaps, they turned to a handy tool: the IDI.
By feeding the census returns they had from 2018 into the IDI – using names, ages, and home addresses to link the data – they could see who was unaccounted for, and fill in the gaps.
For all intents and purposes, the government’s data colossus had come to hold more information on New Zealanders than could be gathered from the Census.
This was particularly true for Māori and Pacific people, who had Census night coverage rates of just 68% and 65% respectively. Administrative data were used for 11% of people within each population (compared to 3% for the NZ European group) – if someone did not fill out a form, whatever their reason, one was filled out for them.
While the experience as a whole was damaging for Stats NZ, the use of administrative data was an unexpected upside.
“Although the extensive use of administrative data in the 2018 Census was not planned, Stats NZ has gained invaluable insight into how administrative data can be used, and its strengths and limitations,” an independent review of the 2018 census found.
“It undoubtedly advanced the research related to the long- term feasibility of delivering a census using administrative data.”
Depending on your view, this is either encouraging progress towards a data-led future, or a chilling glimpse at the rise of the administrative state.
© 2022 Stuff Limited