An earlier post highlighted several of the difficulties that Australian researchers face. One of the most prominent of these is related to accessing health and medical research (HMR) data. This is not due to a lack of data, or associated metadata directory such as AIHW’s Meteor, but rather due to a lack of transparency around the custodianship of the many health datasets that exist and the conditions under which they can be used for research. Much of this stems from the fact that a large number of agencies and entities play different roles in the management of HMR data.
Stewards and Custodians
In their quest to access data, researchers often come into contact with a variety of entities. Two of the most common of these are data custodians and data stewards. Data Stewards are usually responsible for data content, context and associated business rules. For Data Custodians, let’s turn to the definition provided by the Australian Government’s Statistical Agency:
"Data custodians are agencies responsible for managing the use, disclosure and protection of source data used in a statistical data integration project. Data custodians collect and hold information on behalf of a data provider (defined as an individual, household, business or other organisation which supplies data either for statistical or administrative purposes). The role of data custodians may also extend to producing source data, in addition to their role as a holder of datasets.
For any given data integration project (or family of projects) involving Commonwealth data, there may be one or more data custodians. These may be from the same organisation or from separate institutions, will include at least one Commonwealth agency, and may include state/territory agencies and non-government organisations such as universities and private sector businesses."
As per this definition, a data custodian has to:
- Hold the data on behalf of a data provider
- Protect source data
- Produce source data when required
- Maintain data in a usable form for statistical data integration projects
- Maintain information related to the provenance of data, where it came from, its original purpose, and how it was collected
- Publish metadata suitable for discovery of the data by other interested parties
While these definitions seem simple, the reality is less so. For instance, when there is more than one dataset involved, the term data custodian takes on a distinctly elastic set of meanings – something which makes the HMR data landscape rather murky. Data Linkage Units such as ChereL or Data Linkage (Western Australia) often act on behalf of custodians and in ensuring that the privacy of unit records is protected. At other times these linkage units work as brokers negotiating with other custodians and stewards to generate appropriate linkage keys for specific research projects. However, their roles do not extend to storing linked records in secure infrastructures: other agencies such as the Sax Institute step in to provide secure access to the researchers.
Where is the data?
Lots of Data, Lots of Processes, Not enough Information
The sheer number of different entities that researchers have to interact with would be complicated enough if researchers simply wanted to access a single dataset. However, many researchers working on longitudinal studies and multi-disciplinary research need to link multiple datasets or need access to linked datasets from a variety of sources. As a result, a number of different data custodians enter the arena. Each has their own distinct set of processes and protocols for researchers to satisfy, as well as their own fees for data use – leading to a significant amount of time wasted preparing applications and negotiating for access to data.
When researchers want to link datasets which are sourced from across multiple states and jurisdictions – as is usually the case – then this process is made vastly more time-consuming, meaning that it often takes the researchers several months or even years to gain access to all the data sets which contain each of the variables specified in their original research protocol. More importantly, the wide variety of different custodians and data holdings mean that researchers are often unaware of where they should look to find the data they need, or which datasets might be available to help inform their research. Even though agencies like ANDS and the Australian Institute of Health and Welfare (AIHW) list their data holdings, and the Population Health Research Network (PHRN) lists several linkage units as partners, the information is still fragmented as none of them provide a comprehensive view of HMR data resources to a researcher.
As such, researchers are left wondering, where do we go?
Australia has good quality digital health data both at the individual and population level. The importance of getting access to data and the associated problems in gaining accees to data was highlighted by the many submissions to the Productivity Commission's recent enquiry into Data Availability and Use. In order to truly harness these data assets, Productivity Commission’s Inquiry report on Data Avaiability and Use released on 8 May 2017, proposes a new "Data Sharing and Release Act, and a National Data Custodian to guide and monitor new access and use arrangemnents, including proactively managing risks and broader ethical considerations around data use."
In order to successfully implement these recommendations, what we need is a single central agency or resource that can provide researchers with a comprehensive directory listing of every data resource in the country and should include:
- The health data source and its associated metadata directory (AIHW’s Meteor is one such example),
- The custodians who can grant access to the data,
- Data linkage units (as listed by PHRN), the linkage keys they provide, and their jurisdiction,
- Ethics clearances required for the use of specific datasets including linked data
- Conditions governing the use of specific datasets including retention policy required to reproduce the research.
Ideally, such an agency should be empowered with appropriate regulatory policies and tasked with the responsibility to enable access to de-identified data that was created for research purposes. With appropriate ethics clearances in place, enabling re-use of research data can not only save valuable resources, but also assist other researchers to both validate and advance earlier research findings.