Craig Klugman, Ph.D.
The 1000 Genomes Project has collected anonymous DNA samples from people all over the world. By looking at this massive data set, the project hopes to discover genetic components of diseases or traits. Their data and the sequences are publicly available on the Internet.
The project is massive and follows ethical guidelines having sought input from institutional review boards, bioethicists, and ELSI working groups. This is a similar process to what any university researcher would undergo for an institutional review board review. Anonymizing specimens is a standard practice when looking at tissue samples, interviews, and even survey results. With the cost of sequencing genomes being high and the chances of having comparison samples against the anonymous database being low, most people presumed that the chances of re-identifying DNA was minute or in research ethics language, a minimal risk.
That was until last week. Yaniv Erlich, a computer informatics and genetics researcher at the Whitehead Institute, and his colleagues published a paper in Science where they explained that Erlich and his then graduate student Melissa Gymrek had re-identified 50 of the 1000 Genomes samples. Erlich and Gymrek developed a tool that allowed them to quickly identify repeated areas of DNA on the Y chromosome and applied that technique to 10 male samples from 1000 Genomes. Comparing that information to other web-based genetic databases—from the Center for Study of Human Polymorphisms (CEPH) and the National Institute of General Medical Sciences Human Genetic Cell Repository at the Coriell Institute—provided ages, geographic location, and even family trees that enabled Erlich and Gymrek to identify 50 of the 1000 Genomes donors. In short, they could re-identify the de-identified data.
This news has set the bioinformatics world on its head. The assumption has been that being able to identify a family or particular individual based on a DNA sample was many years (if not decades) in the future given the current high cost of sequencing. It was also presumed that a second DNA sample with a known origin would be needed to find a match. Not only can researchers no longer offer anonymity, they have to rewrite their risks statements to say that there is a strong likelihood that you can be identified based on a segment of your donated DNA. This means that current genetic researchers need to return to subjects and inform them of the change in the risk-benefit profile.
The alternative is to lock up this information tight so that no one can access it except those with a special key. Of course such uber-security reduces scientific knowledge creation. Is the public willing to accept the risk of identification to help further science? There may be genetically-related diseases or conditions that individuals do not want to be made public. Stigma is still strong on health-related conditions and its possible that employment, insurance, and even social life could be affected by information being exposed. Even with federal laws like the Genetic Information Nondiscrimination Act of 2008 (GINA) genetic information can be used to harm individuals. GINA does not address issues of state law, personal control over genetic information, or life or long-term care insurance.
This comes at an interesting time. Federal agencies are reconsidering the rules and regulations regarding privacy and confidentiality and research in tissue samples. Again, there was a belief that a person could not be identified (yet) from a simple tissue sample or leftover tissue from a clinical test. But now, such an assumption is called into doubt. With cheaper DNA sequencing, and only needing to sequence small parts of the genome, re-identification of samples is not only more likely, it is possible.
Perhaps the notion of privacy and confidentiality is quaint and outdated. After all, people willingly share the intimate details of their lives through social media platforms. And with the prevalence of cameras everywhere recording us and GPS on our cell phones knowing our every step, the idea of a private life may be a thing of the past. Younger generations have little expectation of any sort of privacy.
In the meantime, the NIH has removed people’s ages from their databases in the in the hopes of preventing more re-identifying. The long-term solutions, however, remain to be developed.