The progression of data in the healthcare industry
The increasing adoption of electronic information systems by the healthcare industry is generating vast amounts of data from which data scientists can extract valuable information. This healthcare data provides huge opportunities to improve patient care, optimize delivery of healthcare services, save costs, and ultimately, improve billions of people’s lives.
However, the adoption of electronic information in healthcare has been slower than in other industries due to healthcare’s unique challenges. So, the volume of newly available electronic data and resulting opportunities for data science are still growing relatively rapidly.
Utilizing the data that’s available also poses some industry-specific challenges which data science can help solve.
Structured vs. Unstructured Data
For data science in healthcare, let’s consider two categories of data: structured and unstructured. Most electronic healthcare data is unstructured. Structured data is recorded in a way that makes it easily interpretable by computers. Examples include spreadsheets, database fields, and image tags that follow a defined standard or schema.
In the healthcare industry, there are various standards, including:
- Health Level 7 (HL7)
- Digital Imaging and Communications in Medicine (DICOM)
- Integrating the Healthcare Enterprise (IHE)
- International Statistical Classification of Diseases and Related Health Problems (ICD)
- Systematized Nomenclature of Medicine Clinical Terms (SNOMED)
- Diagnostic and Statistical Manual of Mental Disorders (DSM)
These standards define how data is collected, exchanged, encoded, and stored. The result? Structured data.
On the other hand, unstructured data is data that may be easy for a human to interpret but lacks the type of structure typically required by computers to interpret it. Examples include audio, video or sensor recordings, images, and text data like narrative reports, scanned documents, and emails. To be easily queried, retrieved and processed by computers, these types of data require metadata that describes the content in a structured way.
There are two key reasons why most healthcare data is unstructured:
- The falling cost of digital storage has allowed healthcare providers to store data that would have been discarded in the past.
- The cost of adding metadata to unstructured data. For instance, a point-of-care ultrasound (POCUS) is an ultrasound examination performed and interpreted in real time during a patient’s consultation. Previously, these would not be archived. If a POCUS were recorded, it would remain on the ultrasound machine until it was deleted to free up storage. Now, due to lower storage costs, hospitals can afford to save these images. But they can’t afford the cost of additional labor to add the type of metadata that would be generated in a radiology workflow using orders that specify procedures and image tags that are validated or manually entered by technologists.
Applying structure to unstructured data
Data science can be used to help extract information from structured data. It can also be used to preprocess unstructured data and add structure by generating metadata.
Once a structure is applied, previously unstructured data can be queried, retrieved and processed alongside other structured data to extract information in a data science pipeline. For example, machine learning can be used to identify anatomical structures in POCUS images and generate tags for the images. To apply structure to unstructured textual information, Natural Language Processing (NLP) can be used to extract key data such as measurements and diagnostic findings from documents. For voice recordings or scanned documents, voice recognition or optical character recognition (OCR) can transcribe the content into machine-encoded text, which can then be processed by NLP.
The growing intersection of healthcare and data science consists of many applications. These may use structured data and/or previously unstructured data to which structure has been applied as described above. Some of these applications are specific to healthcare and its unique challenges, while other applications are already common in different industries but are becoming increasingly relevant to healthcare.
For instance, privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA) and Freedom of Information and Protection of Privacy Act (FOIPPA) require the protection of patients’ Personally Identifiable Information (PII) and tend to be more stringent in healthcare than in other industries. So, when patient data is shared for a scientific study, to train machine learning models, or for teaching purposes, the PII must be redacted.
Redaction tools already common in many industries often rely on structured data. But in healthcare, PII text can occur in unstructured data such as scanned documents and medical images that include a patient’s information within their pixel data (e.g. from screen captures). It’s also possible that a patient’s face may be visible in photographs or by performing a 3D reconstruction on a set of MRI images that do not otherwise reveal the patient’s identity. In these cases, machine learning models can be trained to detect and redact PII, automating much of the workload that would need to be performed by humans. The role of humans can be shifted towards validating that all required redactions were performed.