Buying and Selling Big Data, A Practical Solution for AI and Clinical Research

Every now and then someone asks me about or I read an article about someone selling massive amounts of data to one of the big companies out there.  When you have a lot of data the obvious thought is, I want some of that free money!  As a thought exercise lets look at some of the realities in moving more

than a PetaByte of image data.  A PetaByte is 1,024 TeraBytes or 1,048,576 GigaBytes.  Many, dare I say most VNA’s store data in a near DICOM format, that is close but often not a straight .dcm file.  This means that to get data out you can’t simply copy the file but have to do a DICOM transaction.  There are some that do store in straight DCM, but even so, there is still the issue of de-identification so a DICOM store is not the end of the world.

In my experience a single server tops out at somewhere around 15,000 studies per day or ~500GB.  So, doing the simple math, 10 servers dedicated to nothing but copying this data, ignoring a penalty for de-identification or additional compression will move 1 PB in 209 days.  I submit that this is not practical and there is a better way.

First, we are looking at the problem from the wrong end.  Whether clinical research or training an AI engine, it is likely that the buyer doesn’t want ALL data, they are looking for very specific use cases.  In particular what diagnosis are they trying to research or train?  Instead of dumping billions of images on them and letting the buyer figure it out, perhaps a targeted approach is better.  This begins at the report, not the images.  As I would want to have a long-term relationship and sell my data multiple times I propose that instead of answering a single question like send me all chest x-rays with lung cancer, preparing a system that can answer any question.

So, to do this we would build a database that holds all reports (not images) for the enterprise.  Start with pulling an extract from the EMR for all existing reports, and then add the HL7 or FHIR connection to get all new reports.  With the reports parsed into the dat

abase any future questions or requirements can be answered.  The output of this query would be accession number, patient ID, date of service and procedure description.  Obviously, there SHOULD be a 1-1 relationship between accession number on the report and the images in VNA, but the other data will help if Murphy happens which often does.

Armed with this export a savvy VNA team can do a targeted export of specific data that is needed.  Instead of taking a dump truck and leaving all of the data in the parking lot, one can deliver a very specific set of data needed, and setup a relationship that can be very beneficial to both sides moving forward.  Using this method, one could even prepare a sample set of data for the buyer of say 1,000 exams to which the queries can be revised and updated to get a better and better targeted data set.

Now instead of providing all chest x-rays with lung cancer we can provide, Hispanic non-smoker males between the ages of 15-30 with a lung cancer diagnosis.  I am not a researcher, but I suspect that this type of targeted approach would be more beneficial to them as well as much easier to service from the VNA, in effect a win-win.