UK Biobank health data keeps ending up on GitHub

UK Biobank has a leak problem. A tracker built by Luc Rocher at the Oxford Internet Institute has logged 110 DMCA takedown notices targeting 197 GitHub repositories, uploaded by 170 developers across at least 14 countries. All of them containing sensitive genetic and health data from half a million British volunteers who never agreed to public exposure.

The problem is built into the system. UK Biobank requires researchers to decrypt and download data to their own machines using tools like ukbunpack and ukbconv. Once that data sits in a Jupyter notebook or R script on someone's laptop, it's one accidental git push away from public GitHub. Nearly half the targeted files are notebooks. A quarter are genetic data files (PLINK, BOLT-LMM, BGEN) that directly encode participant genotypes.

The Guardian proved the risk isn't theoretical. Using only approximate date of birth and the date of a single major surgery, journalists re-identified a volunteer in an exposed dataset. UK Biobank CEO Sir Rory Collins responded by telling participants to avoid sharing specific personal details online. "Telling participants to change their behavior is not a privacy solution," Jess Morley told BMJ. She and Rocher argued that UK Biobank needs to stop dismissing re-identification risks and start listening to privacy experts.

There's a better model. Platforms like Terra (built by Broad Institute, Microsoft, and Verily) and Amazon SageMaker keep data inside a controlled cloud environment where analysis happens in place. No local downloads. No accidental pushes. UK Biobank has granted access to 20,000 researchers worldwide. Relying on strict agreements and after-the-fact DMCA notices isn't working, and the data already exposed can't be un-exposed.