Understanding Data De-Identification: A Practical Guide
This document from Ontario’s privacy watchdog explains how organizations can protect people’s privacy while still using valuable data. Here’s what you need to know:
What is De-identification?
Think of de-identification as removing the “name tags” from data so you can’t easily figure out who someone is. It’s like publishing survey results without being able to trace answers back to specific people.
Two key processes:
- Pseudonymization: Replacing obvious identifiers (names, addresses, phone numbers) with codes or removing them entirely
- De-identification: Going further to also disguise subtle details (like birthdates, postal codes) that could still identify someone when combined
The Core Principle: Risk-Based Approach
De-identification doesn’t make re-identification impossible, it makes it very unlikely. The goal is to reduce the risk to a “very low” level based on what’s reasonably foreseeable, not to achieve zero risk.
Public vs. Private Data Sharing
Public release (like open government data):
- Assumes anyone might try to identify people
- Requires heavy data transformation
- No practical way to enforce rules on users
Private sharing (with specific partners):
- Can assess who’s receiving the data
- Uses contracts and security measures
- Requires less data distortion because controls provide protection
Key Concepts Explained
Direct identifiers: Obvious personal details like names, addresses, health card numbers
Indirect identifiers: Details that seem harmless alone but can identify someone when combined, like birth year, gender, postal code, profession, or education level
The privacy-utility tradeoff: The more you protect privacy by changing data, the less useful it becomes. The art is finding the right balance.
The 12-Step Process
Organizations should:
- Get expert help – This is technical work requiring specialized knowledge
- Define clear purposes – Know why you’re sharing data and with whom
- Determine release type – Public or controlled sharing?
- Classify your data – Identify which fields could reveal identities
- Remove obvious identifiers – Pseudonymize first
- Set risk thresholds – More sensitive data requires stricter protection (typically keeping re-identification risk below 5-9%)
- Measure vulnerability – Calculate how identifiable the data is
- Assess attack likelihood – For private sharing, evaluate recipient’s security
- Transform the data – Generalize, suppress, or add noise to reduce risk
- Check usefulness – Ensure data still serves its purpose
- Document everything – Create records of your decisions and methods
- Monitor ongoing – Re-assess every 2-3 years as new data sources emerge and technology changes
Common Protection Techniques
- Generalization: Changing “born in 1985” to “born 1980-1989”
- Suppression: Removing unusual values that make someone stand out
- Adding noise: Slightly randomizing numbers like dates or amounts
- Synthetic data: Using AI to create fake but realistic data that maintains patterns without matching real people
For Private Data Sharing
Organizations must implement strong controls including: limiting access to authorized staff only, requiring confidentiality agreements, securing data storage, training employees on privacy, monitoring access through audit logs, and having breach response protocols.
Important Warnings
- Simply removing names doesn’t make data safe. Aggregated or summarized data can still be identifiable
- Linking multiple de-identified datasets together can dramatically increase re-identification risk
- If someone does successfully re-identify data, organizations must verify the claim, notify affected individuals, retrieve datasets where possible, and review their methods
The Bottom Line
De-identification is a specialized process that balances privacy protection with data utility. It requires technical expertise, careful documentation, ongoing monitoring, and, for private sharing, strong contractual and security controls. When done properly, it allows valuable data to be used for research, innovation, and public good while keeping people’s personal information protected.

