Competing to Collaborate: How Federated Learning Is Quietly Reshaping Pharma's AI Race

There's an old pharmaceutical paradox: the data that would most accelerate drug discovery is precisely the data companies will never share. Every major player in the industry has a mountain of data on compounds tested, assay results, and clinical observations – all of which could be a treasure trove for machine learning models. However, sharing this data with a competitor? Out of the question.

Federated Learning (FL) is a technology that's silently eliminating this paradox. It allows drug discovery leaders to collaborate on machine learning model development without sharing data with their competitors – not even a peek. And 2024 and 2025 saw this technology go from proof-of-concept to production.



Federated Learning (FL) is the technology quietly dismantling that paradox. It lets companies collaborate on AI model training without ever moving, exposing, or even glimpsing each other's underlying data. And in 2024–2025, the industry moved from proof-of-concept to real-world deployment at scale.


What Federated Learning Actually Does

In traditional ML, data flows to a central server where a model is trained. In federated learning, the model goes to the data instead. Each company trains on its own local dataset, then shares only the model's updates — statistical weights — with a central aggregator. The aggregator merges these updates into a global model improvement and sends it back. Raw data never leaves anyone's firewall.

Think of it as ten chefs each privately perfecting a recipe, then sharing only the lessons learned — not the ingredients.


The Landmark Case: MELLODDY

The clearest proof point is the MELLODDY Project (Machine Learning Ledger Orchestration for Drug Discovery), a three-year, EU-funded initiative involving ten pharmaceutical giants.

Published in the ACS Journal of Chemical Information and Modeling in 2024, the results were striking. The consortium trained federated QSAR (Quantitative Structure-Activity Relationship) models on a dataset of 2.6+ billion confidential experimental activity data points, covering 21+ million small molecules across 40,000+ assays — all without any proprietary data leaving company-controlled systems. Every participating company saw measurable improvements in its individual predictive models as a result of the collective federation.

The technical architecture was as important as the science. MELLODDY used a blockchain ledger to log every interaction between the model and each partner's data. No central authority controlled the process; every data access had to be approved and recorded on a tamper-proof distributed ledger, meaning no one — not even the platform operators — could reverse-engineer what data had been touched.

"We now have an operational platform, rigorously vetted by the consortium's 10 pharmaceutical partners, found to be secure to host their data — an enormous accomplishment." — Hugo Ceulemans, MELLODDY Project Leader, Janssen (via IHI Newsroom)

๐Ÿ”— ACS JCIM Paper | IHI Project Factsheet


The New Frontier: Protein Structure & Foundation Models

MELLODDY proved the concept for small-molecule data. The industry is now applying the same playbook to protein structure prediction — arguably pharma's most competitively sensitive domain.

In early 2025, AbbVie, Johnson & Johnson, Bristol Myers Squibb, Takeda, and Astex Pharmaceuticals announced the Federated OpenFold3 Initiative, coordinated by Apheris. The group is collaboratively fine-tuning OpenFold3 — an open-source protein folding model — on thousands of experimentally determined protein-small molecule structures held across their private labs. Each company's structural data remains on-premises; only model gradients are shared. The goal: a protein structure model more powerful than any single company could build alone.

๐Ÿ”— Apheris Blog: Federated OpenFold3

Separately, Lhasa Limited and eight member organizations have developed FLuID (Federated Learning Using Information Distillation) for drug safety and toxicology prediction — a domain where cross-company signal aggregation is especially valuable but data sharing has historically been impossible.

๐Ÿ”— Lhasa FLuID Blog


The AI & ML Capabilities Being Unlocked

Federated learning isn't just a data governance workaround — it is actively expanding what pharma AI can do:

Drug-Target Interaction Prediction: Federation dramatically increases training set sizes for QSAR and ADMET (absorption, distribution, metabolism, excretion, toxicity) models. Models trained on aggregated cross-company data systematically outperform those trained on any single company's data alone, as demonstrated in MELLODDY.

Adverse Drug Reaction (ADR) Surveillance: A 2025 scoping review in JMIR analyzing 145 studies found that FL combined with large language models significantly improves ADR signal detection by enabling decentralized training across hospital and pharma datasets that could never be pooled under existing privacy law. ๐Ÿ”— JMIR Scoping Review

Real-World Evidence & Clinical Trials: A December 2025 paper in npj Digital Medicine (Nature) explored how federated data networks — combined with synthetic data — enable cross-border pharmacovigilance and comparative effectiveness studies that GDPR and HIPAA would otherwise make impossible. ๐Ÿ”— npj Digital Medicine

Regulatory Science: A 2025 paper in Frontiers in Drug Safety and Regulation proposed federated learning as a tool for regulatory agencies to collaborate on post-market safety monitoring across jurisdictions without exchanging patient-level records. ๐Ÿ”— Frontiers Paper


The Real Risks: What Can Still Go Wrong

Federated learning does not eliminate privacy risk — it redistributes and reduces it. There are several well-documented attack vectors that keep pharma data security teams up at night.

1. Gradient Inversion Attacks

Even though raw data stays local, the model updates (gradients) that are shared contain information. In what's called a gradient inversion attack, an adversary can mathematically reverse-engineer these gradients to reconstruct approximations of the original training data. Research published in ScienceDirect in 2025 demonstrated that in medical imaging FL, an "honest-but-curious" central server could exploit shared gradients — particularly batch normalization statistics — to recover training images with worrying fidelity. ๐Ÿ”— ScienceDirect: Gradient Inversion Defense

2. Model Inversion Attacks

Related but distinct: model inversion attacks exploit a trained model's outputs — confidence scores, predictions — to reconstruct sensitive attributes of the training population. In pharma, this could mean inferring which chemical scaffolds or biological targets a company has been screening, even without accessing their data directly. ๐Ÿ”— WitnessAI: Model Inversion Explainer

3. Data Poisoning

A malicious or compromised federated participant could deliberately contribute corrupted model updates designed to degrade the global model's performance or introduce backdoors. In a competitive industry where rivals are also collaborators, this is not a purely theoretical risk.

4. Membership Inference

Attackers can query a model to determine whether a specific compound or patient record was part of any participant's training data — a form of competitive intelligence that borders on IP theft.

5. Regulatory & IP Ambiguity

Strict data protection laws like GDPR and HIPAA were not written with federated learning in mind. Questions remain about whether shared gradients constitute "personal data" under GDPR, and who owns model improvements derived from proprietary training data. A 2023 scoping review in JMIR found significant unresolved legal ambiguity in FL compliance with GDPR, particularly around data controller/processor responsibilities. ๐Ÿ”— JMIR: FL & GDPR Scoping Review

A 2025 ITIF report added a macro-level warning: overly strict data regulations are already suppressing pharma R&D investment, with one study finding that four years after major data protection law implementation, R&D spending fell by approximately 39% in affected firms — with domestically-focused companies hit hardest. ๐Ÿ”— ITIF Report


The Guardrails: How Pharma Is Managing the Risks



The industry has converged on a layered set of technical and governance guardrails:

๐Ÿ” Differential Privacy (DP)

Mathematical noise is injected into model updates before they leave a company's servers, making it computationally infeasible to reverse-engineer individual data points from the gradients. The tradeoff: too much noise degrades model accuracy. Calibrating this balance is an active research area.

๐Ÿ”’ Secure Multi-Party Computation (SMPC)

Cryptographic protocols allow model aggregation to occur without the central server ever seeing any individual participant's raw update. In MELLODDY, this was implemented alongside the blockchain audit layer so that even the platform operator had no access to partner-specific gradients.

๐Ÿงฎ Homomorphic Encryption (HE)

Allows computations to be performed on encrypted data, so updates can be aggregated in encrypted form. Computationally expensive, but increasingly practical for smaller update batches.

⛓️ Blockchain Audit Trails

As deployed in MELLODDY, every interaction between the model algorithm and a partner's data is logged on an immutable, distributed ledger. This ensures full auditability: who accessed what, when, and what the output was — without revealing the underlying data.

๐Ÿงช Privacy Audits & Red-Teaming

Leading consortia commission independent security audits specifically targeting FL-specific attacks (gradient inversion, membership inference) before any live federated run. MELLODDY's platform was explicitly described as "audited for privacy and security" prior to production use.

๐Ÿ“‹ Legal Frameworks & Data Sharing Agreements

The European Federation of Pharmaceutical Industries and Associations (EFPIA) published a Data Sharing Playbook in 2024 providing cross-industry templates for governing FL collaborations, clarifying IP ownership of jointly trained models, and establishing breach liability protocols.

๐Ÿ”ฌ Gradient Sparsification & Clipping

Only the most significant gradient updates are shared (sparsification), and their magnitude is capped (clipping) — reducing both communication overhead and the amount of information an attacker could exploit.


The Bottom Line

Federated learning is not a silver bullet, but it is the most credible path the pharmaceutical industry has found toward AI models that reflect the full breadth of human biology and chemistry — not just what any one company happens to have in its lab. The MELLODDY results alone demonstrated that even fiercely competing companies can build collectively smarter models without compromising their competitive moats.

The risks are real and technically sophisticated. But the guardrails — differential privacy, SMPC, blockchain audit trails, homomorphic encryption — are maturing rapidly. The remaining gap is not primarily technical; it is legal and governance-related. As regulators in the US, EU, and Asia race to clarify how GDPR, HIPAA, and the EU AI Act apply to federated systems, the companies that invest now in FL infrastructure and governance frameworks will be best positioned when the rules solidify.

The data never moves. The intelligence does. And in pharma, that distinction may be worth billions.


Sources: ACS JCIM (2024), Apheris (2025), Lhasa Limited (2025), JMIR (2025), npj Digital Medicine (2025), Frontiers in Drug Safety and Regulation (2025), ScienceDirect (2025), ITIF (2025), IHI Europa, Norton Rose Fulbright, WitnessAI. 

Comments

Popular posts from this blog

InfiniBand vs. Ethernet vs. Ultra Ethernet

Prompt Engineering: Frequently Asked Questions

FAQ's on Optical Networking Technologies