Skip links

The Importance of PII/PHI Protection in Healthcare

Background Summary

This article explains how a healthcare data team secured PII/PHI in an Azure Databricks Lakehouse using Medallion Architecture. It covers encryption at rest and in transit, column-level encryption, data masking, Unity Catalog policies, 3NF normalization for RTBF, and compliance anchors for HIPAA and CCPA.

In healthcare, trust starts with how you protect patient data. Every lab result, claim, and encounter add to a record that links back to a person. If that link leaks, the cost is more than penalties. It affects patient confidence and care coordination.
In 2024, U.S. healthcare reported 725 large breaches, and PHI for more than 276 million people was exposed. That is an average of over 758,000 healthcare records breached per day, which shows how urgent this problem has become.
With cloud analytics and healthcare data lakes now standard, teams must protect Personally Identifiable Information (PII) and Protected Health Information (PHI) through the entire pipeline while meeting HIPAA, CCPA, and other rules.
This article shows how we secured PII/PHI on Azure Databricks using column-level encryption, data masking, Fernet with Azure Key Vault, and Medallion Architecture across Bronze, Silver, and Gold layers. The goal is simple. Keep data useful for analytics, but safe for patients and compliant for auditors. Microsoft and Databricks outline the technical controls for HIPAA workloads, including encryption at rest, in transit, and governance.

The Challenge: Securing PII/PHI in a Cloud Data Lake

Healthcare data draws attackers because it contains identity and clinical context. The largest U.S. healthcare breach to date affected about 192.7 million people through a single vendor incident, and it disrupted claims at a national scale. The lesson for data leaders is clear. You must plan for data loss, lateral movement, and recovery, not only for perimeter events.

Our needs were twofold:

  • Data security
    Protect PII/PHI as it moves from ingestion to analytics and machine learning.
  • Compliance
    Meet HIPAA, CCPA, and internal standards without slowing down reporting.

We adopted end-to-end encryption and column-level security and enforced them per layer using Medallion Architecture:

Bronze

Raw, encrypted data with rich lineage and tags.

Silver

Cleaned, standardized, 3NF-normalized data with PII columns clearly marked.

Gold

Aggregated, masked datasets for BI and data science, with policy-driven access and role-based access control.

For scale, we added Unity Catalog controls and policy objects that apply at schema, table, column, and function levels. This helps enforce row filters and column masks without custom code in every job.

Protecting PII/PHI: Encryption at Every Stage

We used three layers of protection so PII/PHI stays safe and still usable.

Encryption in Transit

Data travels over TLS from sources to Azure Databricks. For cluster internode traffic, Databricks supports encryption using AES-256 over TLS 1.3 through init scripts when needed. This reduces exposure during shuffle or broadcast.

Encryption at Rest

Raw data in Bronze and refined data in Silver/Gold stay encrypted at rest with AES-256 using Azure Storage Service Encryption. Azure’s model follows envelope encryption and supports FIPS 140-2 validated algorithms. This satisfies common control requirements for HIPAA encryption standards and workloads.

Column-Level Encryption

This is the last mile. We encrypted specific fields that contain PII/PHI.

  • Identify sensitive columns. With data owners and compliance teams, we tagged names, contact details, SSNs, MRNs, and any content that can re-identify a person.
  • Fernet UDFs on Azure Databricks. We used Fernet in a User-Defined Function so encryption is non-deterministic. The same input encrypts to different outputs, which reduces linking risk across tables.
  • Azure Key Vault for key management. We stored encryption keys in Azure Key Vault and used Databricks secrets for retrieval. We set rotation, separation of duties, and least privilege to keep access tight. Microsoft documents customer-managed key options for the control plane and data plane.

Together, these patterns form our Azure Databricks PII encryption approach and support HIPAA control mapping.

Identifying PII in Healthcare Data: A Collaborative and Automated Approach

PII Storage

  • Collaboration with business teams
    Subject-matter experts show which fields matter most for care and billing. They confirm what counts as PII/PHI by dataset and by jurisdiction, since a payer file and an EHR table carry different fields and retention rules. We document these rules in a data catalog entry and bind them to  Unity Catalog policies.
  • Automated Python scripts for data profiling
    Our scripts look for regex patterns, outliers, and value density that point to contact info or identifiers. We score each column for PII likelihood and tag it at ingestion. We also write the score and the supporting evidence to the catalog. That way, audits can see when we marked a column and why.
  • Analyzing nested data for sensitive information
    Clinical feeds often arrive as JSON or XML with nested groups. We flatten with stable keys, then scan inner nodes. We also search free-text fields for names or IDs. The same rules apply: detect, tag, then protect.
  • What we do with tags
    Tags flow into policies for masking, access control, and key selection. This reduces manual steps and keeps rules consistent as teams add new feeds.

This practice underpins data governance in healthcare and makes PII/PHI classification repeatable.