The Payer Data Classification Gap That's Going to Show Up in Your NPRM Asset Inventory

when you read the HIPAA Security Rule NPRM and realize your purview environment needs months' worth of work

In the previous post I argued the HIPAA Security Rule NPRM is essentially an autopsy report, with each major mandate traceable to a specific named failure mode from 2024. The one mandate that doesn't trace to a single named breach, but to the entire enforcement pattern OCR has built since October 2024, is proposed § 164.308(a)(1)(i): the written technology asset inventory and the network map illustrating "the movement of ePHI through, into, and out of" your information systems.

The assumption I keep seeing across payer security discourse on NPRM readiness (vendor pitches, webinar Q&As, LinkedIn threads, partner channel briefings, etc.) is that a Microsoft Purview tenant gets you most of the way to that artifact. That if you've turned on the U.S. Health Insurance Act (HIPAA) Enhanced template, applied the Medical and Health Sensitive Information Types (SITs), and let auto-labeling run for a quarter, you've at least built a defensible starting position for the inventory.

I don't think that's true. The reason has almost nothing to do with the quality of Microsoft Purview.

Payer ePHI doesn't look like what DLP engines were built to find. The data that defines a U.S. healthcare payer's daily operating reality lives in structures the 300+ built-in SITs either don't address, address by accident, or address generically enough to produce noise fatigue.

The artifact OCR is going to ask you to produce

Here's the proposed § 164.308(a)(1)(i) text this post is going to keep coming back to. From the preamble in Federal Register doc 2024-30983:

"We propose a standard at 45 CFR 164.308(a)(1)(i) that would require a regulated entity to conduct and maintain an accurate and thorough written technology asset inventory and a network map of its electronic information systems and all technology assets that may affect the confidentiality, integrity, or availability of ePHI."

And from the proposed implementation specification at § 164.308(a)(1)(ii)(B), the network map must illustrate "the movement of ePHI throughout its electronic information systems, including but not limited to how ePHI enters and exits such information systems, and is accessed from outside of such information systems."

The rub is in the elevated written risk analysis at proposed § 164.308(a)(2)(ii)(A), which requires the regulated entity, at minimum, to "review the technology asset inventory and the network map to identify where ePHI may be created, received, maintained, or transmitted within its information systems."

→You cannot review what you cannot inventory

→You cannot inventory what your classification engine cannot identify

Payer ePHI is not a 9 digit number with the word "patient" nearby

If you handed a default DLP engine a sample of the data that flows through a typical payer environment in a typical day, this is roughly what it would see.

EDI X12 transactions

The 837 (Healthcare Claim), 835 (Remittance Advice), 270/271 (Eligibility), and 278 (Prior Authorization) are the four transaction sets every U.S. payer touches every day, in volume. They're governed by ASC X12 005010 implementation guides under the HIPAA Administrative Simplification rules. An EDI 837 file looks like this in plain text:

ISA*00*          *00*          *ZZ*SUBMITTER123   *ZZ*RECEIVER456   *250515*1234*^*00501*000000001*0*P*:~
GS*HC*SUBMITTER*RECEIVER*20250515*1234*1*X*005010X222A1~
ST*837*0001*005010X222A1~
BHT*0019*00*0001*20250515*1234*CH~
NM1*85*2*BILLING PROVIDER ORG*****XX*1234567893~
...
NM1*IL*1*DOE*JOHN****MI*W12345678901~
...
CLM*PATIENT123*150***11:B:1*Y*A*Y*Y~
HI*ABK:Z00121~
NM1*82*1*SMITH*JANE****XX*1245319599~
SV1*HC:99213*100*UN*1***1~

A single 837 file routinely contains the rendering provider NPI (NM1*82, qualifier XX), the billing provider NPI (NM1*85), the subscriber's payer member ID (NM1*IL, qualifier MI), the patient control number (CLM01), ICD-10 diagnosis codes (HI segment), and CPT/HCPCS procedure codes (SV1). To a DLP engine looking at the file as plain text, that's an asterisk-delimited stream where a few strings match ICD-10 codes from the built-in dictionary and most strings match nothing in particular.

HL7 v2 messages

A typical ADT^A04 patient registration message looks like this:

MSH|^~\&|EPIC|EPICADT|iFW|SMSADT|199912271408|CHARRIS|ADT^A04|1817457|D|2.5|
PID|||0493575^^^2^MR||DOE^JOHN^||19480203|M|||...||(216)555-1234|...|123-45-6789|
PV1||O|168^...
IN1|1|ABC123|HFM12345|Blue Cross Blue Shield||||GROUP01|...

PID-3 is the medical record number with its assigning authority. PID-19 carries the SSN in most U.S. implementations. IN1 carries the member's coverage including the insurance plan ID, the group number, and the member identifier. The pipe is the delimiter, the caret-tilde-backslash-ampersand sequence in MSH-2 defines the sub-component encoding, and the order of fields in each segment is positional and rigid.

An ICD-10 code in an OBX result segment will match Purview's built-in ICD-10-CM dictionary. An SSN in PID-19 will match the built-in U.S. SSN SIT. But the structural understanding that this is a registration message containing demographic, financial, and clinical data, and therefore the file as a whole is ePHI to which a sensitivity label should be auto-applied, is not natively encoded anywhere in the default classifier library.

FHIR Coverage and Patient resources

An FHIR Coverage resource in CARIN Consumer-Directed Payer Data Exchange (C4BB) format or Da Vinci Payer Data Exchange (pdex) format identifies a member like this:

{
  "resourceType": "Coverage",
  "id": "Coverage1",
  "identifier": [{
    "type": {
      "coding": [{
        "system": "http://terminology.hl7.org/CodeSystem/v2-0203",
        "code": "MB",
        "display": "Member Number"
      }]
    },
    "value": "W12345678901"
  }],
  "subscriberId": "W12345678901",
  "dependent": "01"
}

The MB code in Coverage.identifier.type.coding.code is the FHIR semantic anchor for "this is the member number." A general-purpose classification engine looking at this JSON sees a 12-character alphanumeric string somewhere near the word "Member" and somewhere near the word "subscriber." Whether it fires depends entirely on whether you've built a custom SIT to recognize the MB code pattern. Out of the box, you haven't.

National Provider Identifiers

NPIs are 10-digit intelligence-free identifiers issued by CMS through the NPPES system. The first digit is currently 1 or 2. The 10th digit is a Luhn checksum, calculated by prepending the constant 80840 (where 80 indicates health applications and 840 indicates the United States per the NCITS.284 standard) and running the standard modulus 10 double-add-double algorithm.

A real validatable NPI looks like 1245319599. The CMS NPI check digit specification publishes the validation procedure as follows:

"The National Provider Identifier check digit is calculated using the Luhn formula for computing the modulus 10 'double-add-double' check digit... When an NPI is used as a card issuer identifier on a standard health identification card, it is preceded by the prefix 80840."

Microsoft Purview's built-in SIT library as of this writing does not include an NPI sensitive information type. There is no built-in regex pattern, no built-in checksum function, no built-in proximity-keyword dictionary for "rendering provider," "billing provider," or NM1*82.

Every 10-digit number in your tenant looks the same to your DLP engine. Phone numbers without separators, account numbers, claim numbers, internal reference IDs, and the actual NPIs of every provider in your network sit at the same confidence level when you try to write a policy.

State Medicaid identifiers

Ohio Medicaid uses a 12-digit recipient number. Texas Medicaid uses a 9-digit individual number. Connecticut Medicaid uses a 9-digit format with category-specific leading digits. California Medi-Cal uses a 14-character alphanumeric BIC number with a check digit and date suffix. Most state Medicaid IDs are purely numeric and vary in length with California being an alphanumeric outlier. There isn't a single regex that recognizes "state Medicaid ID." There is no built-in SIT for any state's Medicaid ID format. For a regional payer or any organization processing Medicaid managed care across multiple states, every state contract creates a separate identifier format that needs to be engineered separately.

Member identifiers and document control numbers

Member IDs are alphanumeric strings with payer-specific structures, and the canonical example is the BCBS Blue Card system. Every BCBS member ID begins with a three-character prefix that identifies which of the 33+ independent Blue Plans administers the policy. The prefix was alpha-only until April 2018, when BCBSA started issuing alphanumeric prefixes because the alpha combinations were running out. The full member ID runs up to 17 characters. Federal Employee Program IDs start with "R" and don't follow the prefix pattern at all. Other major payers each use their own formats across their commercial, Medicare Advantage, and Medicaid managed care product lines, and member IDs frequently include alpha characters that distinguish product family or line of business. To a regex, they all look like opaque strings.

Claim numbers (variously called DCNs, ICNs, or PCNs depending on the payer) are similarly payer-internal. Lengths vary from 8 to 15 digits across payer adjudication systems. A generic numeric regex fires on every order number, ticket ID, internal reference, and same-length portion of a longer identifier in your environment.

ICD-10 codes with disclosure sensitivity

The built-in ICD-10-CM SIT in Microsoft Purview doesn't work the way most people assume. Per the Microsoft Learn entity definition, it uses two dictionaries: one of disease terms ("depression," "asthma") and one of the actual codes. High-confidence detection requires both a term AND a code within 300 characters. Term matches alone produce a medium-confidence hit. Codes alone, which is what you actually have in an EDI 837 HI segment or a FHIR Condition resource, produce no match at all because the SIT's primary match requirement is against the terms dictionary.

Even when the SIT does fire, it returns one match against the entire ICD-10-CM category. Chapter 5 of ICD-10-CM (codes F01-F99) covers mental, behavioral, and neurodevelopmental disorders. Within that chapter, the F10-F19 sub block covers substance use disorders, which under 42 CFR Part 2 carry separate confidentiality protections that attach to records originating in federally assisted SUD programs. B20 is the ICD-10 code for HIV disease. Z21 is asymptomatic HIV infection status. These tokens appear in EDI 837 HI segments, FHIR Condition resources, and clinical notes. The built-in SIT doesn't distinguish between "ICD-10 code present" and "ICD-10 code that triggers additional confidentiality requirements under 42 CFR Part 2 or state HIV reporting law." The categorical distinctions that drive your actual disclosure decisions are not encoded in the default SIT library.

Microsoft Purview ships over 300 built-in SITs. Here are the ones a payer most needs that are not in that 300

Per the Microsoft Learn "Sensitive information type entity definitions" page, the U.S. healthcare-relevant built-in set is roughly:

U.S. Social Security Number
Drug Enforcement Agency (DEA) Number
Medicare Beneficiary Identifier (MBI) card
International Classification of Diseases (ICD-9-CM)
International Classification of Diseases (ICD-10-CM)
U.K. Health Service Number
U.S. Bank Account Number, ABA Routing Number
U.S. Driver's License, U.S. ITIN, U.S./U.K. Passport Number
Named Entity Recognition for medical terms, full names, and U.S. physical addresses

The Sensitive Information Types not in the built-in library, as of the published Microsoft Learn entity definitions list, include:

National Provider Identifier (NPI) with 80840 Luhn validation
EDI X12 837 / 835 / 270 / 271 / 278 envelope or segment recognition
HL7 v2 message-type recognition (ADT, ORM, ORU, DFT)
FHIR Coverage, Patient, Claim, ExplanationOfBenefit resource recognition
State Medicaid identifier formats (50 states, 50 variants)
Payer-specific member identifier patterns
Payer-specific Document Control Number patterns
Prior authorization reference number patterns
42 CFR Part 2 sensitive ICD-10 code subset

The fix is custom engineering across all of them:

→regex SITs with proximity keywords

→custom checksum functions for NPI Luhn

→function-based SITs to recognize EDI segment structures

→EDM SITs for member rosters and provider directories

→trainable classifiers for document types like EOBs and prior authorization request forms

None of this is hard for someone who has done it before. None of it ships in the box.

A payer that enables Microsoft Purview against the default Medical and Health template, runs auto-labeling across SharePoint and Exchange, and considers the asset inventory and ePHI flow map a Purview deliverable is going to find, the way Anthem found in 2018 and the way every Risk Analysis Initiative entity has found since October 2024, that "we have DLP" is not the artifact OCR is going to ask for.

What this means under audit

OCR's Risk Analysis Initiative, launched October 31, 2024 with the Bryan County Ambulance Authority settlement, has produced more than a dozen publicly announced enforcement actions through early 2026. The 11th, Top of the World Ranch Treatment Center at $103,000, was announced February 19, 2026. Every one of those actions cites § 164.308(a)(1)(ii)(A), the current risk analysis requirement. Director Paula Stannard's statement on the August 2025 BST & Co. CPAs settlement maps that requirement directly to the artifact:

"A HIPAA risk analysis is essential for identifying where ePHI is stored and what security measures are needed to protect it. Completing an accurate and thorough risk analysis that informs a risk management plan is a foundational step to mitigate or prevent cyberattacks and breaches."

Read that quote alongside the proposed § 164.308(a)(2)(ii)(A) text requiring the written risk assessment to "review the technology asset inventory and the network map to identify where ePHI may be created, received, maintained, or transmitted." The regulatory direction is one to one. What OCR has been asking for informally is what OCR will be asking for explicitly.

The Anthem precedent matters because it's the only payer-side action at scale that establishes the cost of an inadequate enterprise-wide risk analysis. The 2018 HHS resolution agreement states that:

"Anthem failed to conduct an enterprise-wide risk analysis, had insufficient procedures to regularly review information system activity, failed to identify and respond to suspected or known security incidents, and failed to implement adequate minimum access controls."

The settlement was $16 million for a breach that exposed 78.8 million records. The corrective action plan required Anthem to conduct a risk analysis and submit it to OCR for review and approval before the CAP could close.

If you're a payer CISO running a Purview tenant that classifies SharePoint and Exchange ePHI with built-in SITs, and your EDI gateway, FHIR API server, HL7 v2 interface engine, claim platform data lake, encounter staging area, and enrollment system are either out of scope or scanned with classifiers that can't see the data structures, you don't have an artifact you can hand to an OCR investigator. You have a Microsoft 365 classification report and a CMDB export.

Purview isn't your asset inventory tool, and it isn't your network mapping tool either. Your CMDB holds the asset register and Visio or Lucidchart holds the network diagram. What Purview is supposed to provide is the classification evidence behind those documents: the scan-output proof that each system on the inventory actually contains the ePHI categories your map says flows through it. When the classification engine can't see EDI envelope structure, HL7 v2 segment context, or FHIR Coverage identifiers, the evidence layer behind the inventory is going to be the layer that fails.

Closing this gap is an engineering task

You don't need a different DLP platform. Microsoft Purview can do what your environment needs. What it cannot do, on the day you "turn it on", is recognize the specific data structures that constitute the majority of a payer's ePHI volume by record count.

The work to close the gap is straightforward to describe but slow to execute. The components in dependency order:

→A custom NPI SIT with the 80840-prefix Luhn function, proximity-bound to NM1*82, NM1*85, "rendering provider," "billing provider," "NPI," and the EDI qualifier "XX."

→A set of custom SITs for the EDI X12 envelope: GS*HC for 837 healthcare claims, GS*HP for 835 remittance, GS*HB and GS*HS for 270/271 eligibility, GS*HI for 278 prior authorization.

→A set of custom SITs for HL7 v2 message types triggered on the literal byte signature MSH|^~\&| followed by the message type field. ADT for admissions and registrations, ORU for clinical results, DFT for financial transactions, ORM for orders.

→Custom SITs for FHIR resources triggered on the JSON patterns "resourceType":"Coverage", "resourceType":"Claim", "resourceType":"ExplanationOfBenefit" with proximity to the v2-0203 MB code.

→Exact Data Match SITs for the member roster, the provider directory, and the claim master, configured against the Microsoft 10-schema-per-tenant limit documented in the EDM overview. For a multi-line-of-business payer running commercial, Medicare Advantage, Medicaid managed care, ASO, and Marketplace, that 10-schema limit is an architectural constraint to plan for.

→Custom trainable classifiers for document types that don't align with a pattern: EOB letters, prior authorization request forms, denial letters, and the scanned variants of each. Microsoft Learn documents a 50-sample minimum positive seed (up to 500 considered), an under-24-hour training cycle, and a multi-week iterative review and indexing window before a classifier is publication-ready.

→Sensitivity labels mapped to the result categories with auto-labeling policies scoped to the actual repositories that hold the data: the EDI landing zone (SFTP on-prem or Azure Blob, scanned through the Purview scanner or Data Map), the FHIR server (Azure Health Data Services data exported to ADLS Gen2 and then scanned by Data Map), the analytics environments (Synapse, Databricks, Snowflake, each with its own connector posture), and the Microsoft 365 workloads as a starting point.

→Document fingerprinting is the third built-in custom classification mechanism Microsoft ships, but unfortunately it doesn't help here. Fingerprints detect derivatives of a known blank template. EDI 837 files, HL7 messages, and FHIR bundles are emitted from claim and clinical systems with no preserved boilerplate to fingerprint against. Use fingerprinting on EOB letterhead, denial letters, and standardized PA forms. Don't use it on the transactional layer.

This is months' worth of work in a production payer environment. The benchmark for "done" is whether the classification scan output, along with the asset register and the network diagram, would survive a § 164.308(a)(1)(ii)(A) document request from an OCR investigator under the current rule today and would also satisfy the proposed § 164.308(a)(2)(ii)(A) language tomorrow.

Until you can produce that, the delta between "we have Purview" and "we can prove where ePHI lives" is what's going to show up in an audit.

In the next post, I'll walk through the seven SITs I'd engineer first if I had 90 days to harden a payer tenant's asset inventory evidence ahead of an OCR document request, and I'll cover the precision tradeoffs each one creates.

Matt Silcox is the founder of Severian Technology Group and 1 of 3 U.S.-based Microsoft MVPs in Purview Data Security. He works exclusively with healthcare payers on Purview implementation, data classification, and HIPAA Security Rule compliance. More at severiansecurity.com