Sharing raw data from clinical trials
I've recently written a guest blog post for PharmaPhorum, which is all about sharing raw data from clinical trials. You can read it here.
I'd be very interested in any comments on the article, but since PharmaPhorum requires registration to be allowed to comment, if you can't be bothered with going through the registration process, you can leave your comment here.
"If we were all to make our data available using CDISC’s SDTM, then anyone else using the dataset would know exactly what the data mean"
Well, that's not really the case.
Clinical data standard experts at GSK working with CDISC concluded recently "Current standards (company standards, SDTM standards, other standards) do not current deliver the capability we require".
From: CDISC English Speaking User Group (ESUG) Committee webinar: "CDISC SHARE - How SHARE is developing as a project/standard” with Simon Bishop, Standards and Operations Director, GSK http://cdiscportal.digitalinfuzion.com/CDISC%20User%20Networks/Europe/English%20Language/Presentations/2012%2009%2012%20-%20TC%20;%20CDISC%20SHARE/ESUG%20TC%20on%20SHARE%2012Sep2012_V2.pdf
And as pointed out in a position paper behind the presentation by Charlie Mead, HL7 and W3C, at FDA’s public meeting 1) on data exchange standards: “Although, STDM enabled receipt by the FDA of study data from multiple distributed and /or diversified studies, several limitations with the standard became increasingly difficult to overcome.”
From: Solutions for Study Data Exchange Standards: Technologies for Maximizing Cross-Study Analysis Potential http://www.w3.org/2012/Talks/1105-egp-CDER/FDA-v2.pdf
Many thanks for those thoughts, Kerstin, that's very interesting.
I think the validity of my assertion that "anyone else using the dataset would know exactly what the data mean" depends on the purpose for which someone else wants the raw data.
You make a good point that if it's to get raw data from several studies for cross-study analyses, then the "wiggle room" inherent in SDTM is going to make life tricky, particularly if studies have come from different companies. For those cross-study analyses, I think you are right to pick me up on my assertion that the problems are largely solved. Having read your links, I agree that I was being a bit over optimistic there.
It will be interesting to see if the CDISC SHARE initiative can help standardise things. I have to say I'm skeptical about the prospect of having any standard that's sufficiently flexible to be useful in the huge variety of different study designs that are always going to exist, and yet eliminates the "wiggle room", but I'll watch developments with interest.
However, if data recipients are receiving raw data only for the purposes of re-analysing a single study, then I think I stand by my assertion. Certainly it's been my own experience that if I'm receiving SDTM data from someone else, then provided they have done a good job of completing their define.xml document (and of course that's not always true!), I have no problem figuring out what the data mean and doing whatever analyses are needed. I don't think I saw anything in the documents you provided that convinces me otherwise, unless you think I've missed something important?
A couple of thoughts:
Yes, you as a smart human can probably read a lot of underlying decisions, implicit meaning and relationsships into the text strings you basically gets out of a define.xml and SDTM datasets when re-analysing a single study.
However, today we must put more challenging requirements on our data standards, in the light of increase internal reuse and extermal transparency. I would say we need "smart" clinical data standards and metadata that machines can process.
I think you have a great point on variants across different study design, disease areas, regions and over times. Hence, we need to express clinical data standards using a common data model and have strong version and configuration managament. Such as common data model is what the basic semantic web standard, RDF, offers, see http://cdisc2rdf.com/
I'd be curious to hear each of your thoughts on the utility of the results data posted on ClinicalTrials.gov. My work with that particular slice of the data has been cursory at best, but I found it nearly impossible to run any sort of cross study analysis for the reasons you mention on study design variability, etc. It would seem this is not so much the fault of the NLM as it is simply a product of the complexity of the data.
What sorts of questions do you think are reasonable to expect to be able to answer from cross study analyses (assuming some attainable level of data standards/meta data)?