Linking and merging of data

Under the Commonwealth arrangements an integrating authority is responsible for managing the linking and merging of different source datasets.

In this guide, linking refers to the process of finding and determining links between individual records across different datasets. In its simplest form, linking uses a unique identifier (such as an Australian Business Number or a social security number) to match records. However, where a unique identifier is not available or not of sufficient quality, a proxy identifier, often referred to as a linkage key may be created. Linkage keys are created using a combination of common information in those records (such as name, address and date of birth).

For records that can be linked, data merging refers to the process of combining individual records (or information in those records) into an integrated dataset. It is recommended that the integrated dataset be de-identified, unless the use of identified data is required and approved for the purpose of the project and in accordance with relevant legislation. If the unique identifier or the linkage key is retained on the integrated dataset, then it is recommended to encrypt these as an additional safety measure to avoid the risk of identification or re-identification.

During this stage, as part of the linking and merging function, the integrating authority should perform quality checks of the integrated data to ensure that it meets the required standard and is fit for the purpose specified in the project proposal. See data quality.

Data linking protocols

The integrating authority has an important role in managing the increased risk of identification that exists when two or more datasets are integrated. Generally an integrating authority will be chosen because they have expertise to manage the confidentiality of the integrated dataset and capability for maintaining security. For this reason, while the integrating authority may choose to outsource the creation of a linkage key (see Outsourcing and working in partnership), it is recommended that the integrating authority responsible for the project perform the merging of the analytical data (such as health, education or employment details) to create a new integrated dataset for a defined research purpose.

An important protocol to observe when linking and merging data is the need-to-know principle that is a fundamental rule for protective security required by the Australian Government Protective Security Policy Framework. To ensure that only those people who need to access the data are given that access, there needs to be a system for access control, audit trails and other measures in place. For more information see the data security section.

In the context of a data integration project, the need-to-know principle can be applied to ensure that people working with or using the data only have access to the information needed to perform their role (be it linking the data, merging the data or analysing the data). This is referred to as the separation principle. To ensure that privacy is protected, use of the separation principle is a protocol that is strongly recommended for all data integration projects involving Commonwealth data, unless access to identified data is needed and approved for the purpose of the project and consistent with applicable legislation.

For more information about applying the separation principle see:

For more information on the project delivery phase see: