Why "Accuracy" Can Be Tricky in Drug Discovery
Headline: Is your AI model actually learning, or just memorising?
I’ve been working on a new application to predict ADMET properties (how a drug moves through the body) using XGBoost. During this project, I realised that a "high accuracy score" isn't always what it seems.
In Machine Learning, we usually split our data randomly. But in drug discovery, this can create a "hidden" problem.
The "Family" Problem:
Imagine training a model on a family of similar molecules, like Benzene derivatives.
The Easy Way (Random Split): The model sees some family members during training and others during the test. It gets a great score because it recognises the "family face." It's mostly just memorising.
The Real Way (Scaffold Split): We hide the entire Benzene family during training and test the model on a completely new structure, such as a Pyridine ring.
What I'm Learning:
When I use a Scaffold Split, accuracy usually drops. At first, that feels like a failure. But I’ve learned that this "lower" score is actually more honest. It shows how the model will perform in the real world when it meets a brand-new type of medicine it has never seen before.
For me, the goal isn't just to get a 99% score—it's to build a model we can trust for real discovery.
Comments
Post a Comment