Predictive validity in drug discovery: what it is, why it matters and how to improve it?

Jack Scannell, PhD


Jack Scannell: What I'm going to talk about today is actually also very close to my current professional interests so I do a few things that moment, I'm the part time chief executive of a very little biotech startup. I have a consulting business that actually does pretty close this stuff I'm gonna be talking about in this talk. And I also do some academic work again, some of which will not all of which overlaps with this talk. And I think I got interested in this area. So I've got a slightly weird background, I trained in medicine a very long time ago. I then worked as an academic neuroscientist at Oxford, and in Newcastle University, and at a sort of relatively mature age I sort of stopped being an academic and I moved into some consulting and then into finance and so oscillating back and forth between sort of consulting, finance, and also bits and pieces of real biotech including some biotech drug discovery.

That rather peculiar peculiar background has given me an interest in r&d productivity. And I'm going to kind of give you the one of the big messages right up front. And I think there's quite a good analogy for drug R&D or drug discovery. And it's that finding drugs is a bit like finding some tiny oases of safety and efficacy in vast biological deserts, so nearly every biological or chemical opportunity you look at probably ain't gonna work, but a few do. And the screening and disease models in which we decide whether a drug target is a good one or whether a compound is a good one. They are the things we use to try and point us towards oases. But they only do that if, if you give them a bunch of candidates and they score them in the same way that a human would if a very sick person took the same thing. And, if they don't, they're going to head in the wrong direction. And in particular, I think, possibly lesser in recent years, but certainly was a period, probably from the 80s until the early 2000s when brute force efficiency took over. And that was a bit like running around the desert at great speed with a very bad compass, so it gave the impression of progress, but actually very often you're heading in the wrong in the wrong direction and that's because arguably the screening and disease models weren't good enough.

So, I'm gonna start with a caricature but this is the drug industry's caricature not mine.
And it's how the drug industry, says you know when talking to investment analysts like I used to be talking to politicians how drugs are discovered, you know you have something called target identification. So you start, at least in the caricature process, by doing something called identifying a target and targets are pieces of biological machinery that are presumed to have gone wrong in a particular disease, or if it's a bacterial infection, you might be something that's gone right for the bacteria in a particular disease. And a lot of this comes from, for example, looking at genetic associations with disease risk. Yes, you might find that a particular protein in certain people is different to other people, and maybe that is related to the difference in disease risk that people get. The process called target validation, you know, so for example in this case you know you might express the gene that you've identified in people in a rat see if the rat get a disease that's a bit like human disease. And if they do you're presumed the target validated. What you then do is you then test the method for trying to test therapeutic interventions, often small molecule drugs but sometimes other things to see the extent to which they bind to and modulate the behavior of the target. You then test lots of lots of things against the target to find things that seemed to do a good job in modifying his behavior in appropriate way. You then do something called lead optimization so let's suppose you're trying to develop a headache drug, and you've got something that really binds your headache target, but it can only get into the skull via intracranial injection, well that's probably not going to be commercially successful so you'll try and do a bunch of chemistry so that people can take the pill and other things as well.

You then see if it cures headaches, in rats and you know passes a whole bunch of safety related hurdles. And if that happens, then you can start your clinical trials and people. And this, again, it's a caricature of the industrial process and lots of things happen as well and it's more diverse than this. But effectively by building this kind of industrial process, you've effectively industrialized a specific view and a specific aspiration and the aspiration is something as follows, that actually there's lots of pieces of biological machinery which misbehave and causes disease phenotype. You can drug them: that's in screening and lead optimization, and also there still exists, lots of targets and lots of drugs that may be tested against them. And if those assumptions true, this is a great process for delivering a very low cost, large number of successful drug candidates in the clinic, so that's the sort caricature of the industrial basis drug discovery.

Now, it became clear to me working in investment commercial world, in the very late 90s and early 2000s, that there was a big problem because all of the steps of industrial process or technologies that underlie the steps of industrial process. So for example, DNA sequencing well that's terribly important for target identification. X ray crystallography, well that's really important for the mass screens and screening, and those sorts of activities, and there is a whole  bunch of other ones that we won’t talk about, has got massively better over the last few decades, so this is a logarithmic axis and this is a logarithmic axis. This graph is now out of dates but between the late, 60s, and 2010 DNA sequencing got 10 billion times cheaper, in terms of unit cost. X ray crystallography that allows you to look at the structure of protein so you can guess what drugs might bind it before you do any expensive wet science, well that got roughly 10,000 times cheaper over a similar period. And I could point to a large number of the technologies that are believed to be important for these individual steps. But over that period, the amount of money that the drug industry spent per drug discovered, or drug approved by the FDA fell by roughly 100 fold, and this is real terms and inflation adjusted. And I became inordinately interested in how you can have a process where all of the inputs get hundreds thousands, millions or billions of times cheaper and these are the inputs that most people doing the process think are important, but the process is a hundred times more expensive.

Now things actually have improved a bit since 2010 and that's not going to be a primary part of my talk, I may elude to it but again if anyone's interested in it and I forget to say anything about it, prompt me at the end. But for me this is overwhelmingly interesting bit is this: a hundred times more expensive in terms of drugs approved per billion dollar spent by the global biopharmaceutical industry in the face of massively improving input efficiencies. And I use this slide to remind me to remind the audience, that there are really two ways to explain something like this. So, and I'll tell you what the two pictures are first but the picture on the left is a guy sitting with something that's called a Tasmanian Tiger, or sometimes a Tasmanian wolf.
And on the right, it's not actually an alchemist but it's a picture I got that looks like an alchemist, and let's imagine that this guy's trying to transmute lead into gold. If your inputs have got better and faster and cheaper and your outputs have got 100 times more expensive, you could be a bit like this guy sitting here where actually doesn't matter if your Tasmanian Tiger technology gets a thousand times better because the last Tasmanian Tiger died in Hobart zoo sometime the 1930s so there aren't even any left. So there's a kind of resource depletion set of explanations.
But then there's another logically possible set of explanations, which is the qualitative nature of things that people did in the 50s, 60s and 70s were different from the qualitative nature of things that people have done in the 90s and 00s. tried to transmute that is called something you can't do. And again, as I gently remind folks in the drug industry whenever I showed this slide. The more you disagree with one of these explanations unfortunately the more you're forced to agree with the other. So the if you think, actually, there's lots of fantastic opportunities out there, then you're forced to conclude that actually we somehow must be doing r&d in a much less effective way. On the other hand, if you think, actually, we're really doing it very well and we're incredibly effective then you're forced to conclude that there's been a real sort of catastrophic or potentially catastrophic depletion of opportunities. And I'm going to introduce a slightly geeky decision theoretic slide which is going to formalize that a bit and because I will come back and talk about some of these ideas. If I think about this previous slide here is: okay we've run out of things so there isn't anything discover. 

Here's we're doing it wrong. You can think about that a bit more formally, in terms of finding drugs for diseases. So, yeah, we can imagine that the screening of disease models we have are effectively the tools we have for deciding whether therapeutic candidates are good or bad, but also there's a chemical technologies or the antibody technologies, there are ways of making therapeutic candidates and if you think about the diseases which may exist. Well, some of them are probably not pharmacologically tractable. But maybe modellable. So for example, I guess traumatic limb amputation. That is pretty modellable in an animal, you know it might be an unpleasant thing to do but you know I'm sure that the pathophysiology of traumatic limb amputations on animals is pretty similar to people. But no one would expect, one to be able to discover a drug that was suddenly make limb grow back in a higher level mammal so that's something that may be modellable but not pharmaceutically tractable.

And then there may be diseases that are pharmaceutically tractable, but not pharmaceutically modellable. And I would argue actually history suggests that human depression, may be such a disease so arguably all new classes of anti-depressant drugs were discovered through observation in people. So clearly there's chemistry that helps with depression, but actually it's very hard to find that chemistry by testing it in models. So, you can imagine that if you've got an r&d process, you've got a, you've got a bunch of notionally approvable compounds that might exist or might be able to be created. You've got U, unprovable candidates, and then what you do is you go through a decision process where you use your models, you’ve got a true positive rate for selecting good candidates and a false positive rate, you know, where you've got at each step in the process by finding in your initial screen, you find lots of good candidates and you reject lots of bad candidates. When you get the next step of the process, You progressively enrich Q, which is the ratio good or bad candidates.
So, we have a sort of pharmacological tractability, ie. You know what proportion of randomly selected compounds work in the disease. And then we've got a bunch of decisions that we make, to try to enrich the compounds progressively through the r&d process right before we put it into expensive human trials.

So, I'm going to try and give us a short summary of what I think the main drivers of Eroom’s law are. So this is the, the fact that it's so much more difficult to discover drugs now despite the fact that the technology's got better. And I think the primary problem is something that I described in the paper I wrote back in 2012 we called it the better than the Beatles problem, which shows our age and lack of fashionability, But the analogy is this. Imagine that any new music that was launched had to be A: better than the Beatles B: You could download all the Beatles catalogue for free, and C: you never got bored of listening to Beatles. If you had those three criteria, it would be very very difficult to launch new music. Now the drug industry is effectively an intellectual property business, and it has exactly those better than the Beatles characteristics because what happens is drugs are launched, some of them turn out to be very good. Most of those then eventually become generic. And what that does is it effectively makes it difficult to compete with new drugs in the same therapy area.

And what the graph on the left shows is the proportion of US prescriptions for generic drugs. So in 1994, Most drugs prescribed in the US are branded. Now 90% of drugs prescribed in the USA are generic. So you've got this ever improving back catalogue of  very cheap medicines.
And what that's done is it's pushed the residual r&d into those areas where Historically, the drug industry had been unsuccessful. And if the drug industry has historically been unsuccessful in those areas for the last hundred years which is roughly how long the modern drug industry's existed, they're probably difficult for one reason or another. Now, this is where I'm going to turn a bit more to disease models. For reasons that I'll describe later, I think the Eroom’s law and the better than the Beatles problem, point to the importance of disease models in r&d productivity. Because some of the disease models in this slide, They're good at distinguishing between good and bad therapeutic candidates. Well what happens to the disease models that are good at distinguishing between good and bad therapeutic candidates? well they give us good drugs.

So what happens is the better than the Beatles problem applies to those diseases which eminently modellable where we have good preclinical models, and also probably they apply to the most pharmacologically tractable diseases. We have this selective retirement of the diseases with the good preclinical models and what we're left with is the diseases where the models don't identify the chemistry that cures people. And ironically, because we're left with the diseases where the models don't identify the chemistry that cures people, very often we keep using those same models for year after year. I don't want to pretend that all this going on there's a lot of other important factors as well I mean, the regulatory environment has changed massively over the decades, I mean that that's very obvious. But, I think the decision theory stuff I'm going to share in a bit, points towards disease models being an important part the story, and also possibly an important part of the cure.

So now I'm going to try and talk about screening disease models a little bit more formally. So, imagine that we had infinite money and no ethics. We could in principle test a very large number of therapeutic candidates and these could be drug compounds or they could be targets. We could test it an animal efficacy model. And we could test them all in people in phase three trials. And imagine this graph plot is the correlation you get between the animal advocacy score, and the clinical utility in patients, or you can do the same thing in in vitro tests some sort of test tube test on how well a drug may work and again we can think about how well it works in patients.
So although these numbers are never actually measured in practice. If we didn't believe this, in principle, if we didn't believe that doing better in our animal model was more likely to identify the red dots that work in people and doing better in our in vitro model was more likely to identify the red dots that work in people, if we didn't believe that in principle although we never practically measure it or rarely practically measure it, we wouldn't use the models in the first place.

So the screening and disease models we use in R&D in addition to decision tools that allow us to suspect that some candidates are better than others and move them along the r&d process, and these ones are just made up. Here's some real ones. And I think what the show is they show the difficult data that biologists have to work with. So again, I'm going to tease some of the views that I hear some people express which is you know why doesn’t the drug industry just discover better drug? How hard can it be? Well, what these graph shows actually it's pretty hard.

So, this is a graph where on the horizontal axis we have a measure of drug performance in this case and in vitro toxicology model. So you don't need to worry exactly what this means, but basically drugs that are over this way, look more poisonous. Drugs up this way, look safer.
This axis, this is a sort of ranking roughly of how dangerous the FDA things the drugs are in terms of live injury. So 1 is actually the most dangerous. These are very risky drugs that have been withdrawn from the market or failed in late stage trials because they cause very severe liver injury. These drugs are very safe and the only way these drugs would injure your liver is if someone dropped a crate full of the drugs on top of you. Then you've got slightly more dangerous drugs in between. But what you can see is that with circa 2017, the best in vitro liver toxicology model you could find, the correlation here really is pretty weak. So if we set an arbitrary thresholds and we will decide anything, lower than zero is too dangerous to go ahead. well actually what you do is you throw away quite a lot of relatively safe drugs and you still get through a lot of drugs that are very very toxic. 

And here's another example this is a screen that I know a couple of people on this call john for example would be familiar with this kind of thing. This is a screen of a class of drugs called tyrosine kinase inhibitors, these are often used in cancer. And here's actually two different model measures, there's this measure, a very cheap measure that you can get a very high throughput, it's a binding affinity measure so you would typically do this first, in the R&D process. And then this measure is a more expensive measure, how much you inhibit the enzyme that you want to inhibit, more expensive and harder to do.

And typically you would screen the compounds with this model, before you put a subset of them into this model and again what you can see is you know the correlation here is not great there is some correlation, but, you know, again, if you use this as a guide to that, you're going to get the wrong answer for a lot of the time.

In the work I've been doing recently and again I think there's probably at least one perhaps two of my my collaborators on the call. You can try and start to formalize this a bit, and put some numbers around it. So what these graphs represent are sort of idealized versions of what we saw on the previous slide. So you can think of these as probability density functions where you can imagine, we've got a universe of therapeutic candidates, we test them all our decision tool. And then we test them all in clinical trials. And here we've got a good model. Here we’ve got a lousy model where the correlation between the score on the model and human clinical utility is weak.

And what we can do is we can get more or less stringent. We can say, well we're only going to take really good things on the model. And we're looking for things that exceed some threshold of utility in people. And things that pass on the model and work in people, those are true positives, things that pass on the model but don't work in people those are our false positives. And as an important measure in drug r&d called positive predictive value which is the proportion of things that you think are going to work, that really work and the reason that's important is the process gets more expensive and more ethically difficult as you go down the pipe, so went by the time we get to human trials. You want to have a high confidence that things you are going to put into people work, and high confidence that you avoid false positives.

There is one little bit of mental maths or a little bit of mental gymnastics one has to do here which is, let's suppose we want to use a really high selection threshold to try and improve the ratio of true positives to false positives. Well to do that you've got to test more candidates. So if you want to get the best one candidate out of 10: you have the best 10% of candidates, you have to test at least 10 candidates and find the best one. if you want the best 1% of candidates, you have to test the hundred candidates, if you want the best 0.1% of candidates, you have to test a thousand. Throughput, or brute force efficiency, lets you search further out for things to score things higher and higher on your decision tool or a new model.
So let's see how positive predictive value the ratio true positives to true positive plus false positives varies, as you search further and further out on this axis using a good model, and then using a bad model.

With a good model if you start testing enough candidates, your true positive to false positive ratio becomes very high. Your positive predictive value can become quite high, ie the things that you think will work in people really do working people. But if you've got a poor model, where there isn't much correlation, the ratio of false positive to true positive stays low, even if you test lots and lots of candidates. So if we look at this graph over here, the shading is positive predictive value, And you can see how that varies with the predictive validity of your model, which is the degree of correlation between the score of the decision tool and clinical utility and throughput, which affectively is how far out on this axis you search. And what you find is this that for much of the parameter space that is relevant to drug R&D, changes in the correlation between your model and the human outcome of interest of point one, have a bigger effect on productivity than 10, or maybe even 100 fold change in your brute force efficiency. So for example when your model really doesn't correlate with human clinical utility. You do just as badly when you test 10 to the seven compounds, as you do when you test 200. 

So, the world’s second most useful anti-microbial drug was a drug called Prontosil or sulfanilamide and it was discovered by a guy called Gerhard Domagk in 1932. And he was working for Bayer at the time, and in 1932 there really were not large libraries of drug like compounds that people could test and Domagk had access to a couple of hundred compounds. And he tested a couple hundred compounds and he found this drug called Prontosil which, until penicillin came along, was probably the most broadly useful anti-microbial drug there was. Between 1995-2005 the global drug industry Glaxo but a bunch of other companies as well, decided to throw genomics, at the problem of anti-microbial drug discovery. And they had a very convincing plan.

 Their idea was to sequence the genomes of lots of bacteria, identify genes that are essential to bacterial survival, because if you could block the products of those genes, you might have a good anti-microbial drug. They then cleverly cross reference those human genes to make sure they didn't pick things that are close homologue in man so they wouldn't be toxic and across several companies they ran well over 100 high throughput screening campaigns against 70 or so targets, Many involving more than half a million compounds tested, and they found precisely nothing, that was worth putting into clinical trials in that. So there's a very obvious question. Well how could a guy testing 200 compounds in 1932 outperform/find something useful, whereas people testing ten to the seven or possibly you ten to the eight compound target pairs in 1995 to 2000, discovered nothing useful. And the answer potentially falls out the decision of theoretic maths. If Dogmak’s model had a very high correlation with human clinical utility, And the way that our R&D was conducted later had a very low correlation with human clinical utility, you'd expect 1 compound out of 200, the best one out 200 in 1932 to be more likely to work in people than the best one out of 10 to seven in 2005. So, so why might Domagk’s model have been more correlated with clinical utility and man?
Well I think we now know the answer Dogmak was testing his drugs in mice which he had infected with bacteria and which has sepsis, and some animal infection models are pretty good.
In, 1995 to 2000, the drug companies were testing their compound libraries against isolated bacterial gene products in little pots. And we now know that that decorrelated the results in the 1995 2000 period from human clinical utility in a number of big ways. So the first obvious source of the correlation was that the compounds that drug companies had in compound libraries are heavily enriched for compounds that don't get into whole bacterial cells. And they were studying the screening against isolated bacterial proteins.
So that was one source of the decorrelation. And then we also now know that the genes that are essential for bacteria in vitro cultures are not the same genes are essential for survival in living things, so the compounds didn't get into bugs, when you test them in people and the genes that you found in vitro weren't important, so those two things meant that we decorrelated the model from the human outcome of interest.

So the message from this part of the talk, and it gets a bit more general and applied next, is that the thing that nearly everybody already believes is important, is more important than nearly everyone already believes. 

So no one in the drug industry thinks bad models are good thing. no academic sets out at the beginning “I'm going to invent a really lousy model of disease x”. But I think, unless you run the decision theoretic maths and also take a closer look at the history, you underestimate the quantitative importance of very small changes in model validity. 

And I'll give you one more anecdote before going on to the practicalities. I spent quite a lot of my professional life working as a biological scientist. And if I plotted two graphs, one of which had a correlation 0.6 and one of the correlation 0.7, I would assume there's really no difference between those graphs. But actually, the difference in the model between the one that correlates point six and one that correlates point seven with a human outcome of interest, could have the same productivity impact as doing 10 or 100 times as many experiments.
So, the quantitative power of small changes in validity is very high, 

So what are the practical implications? I think the first one is that certainly the training of biological scientists is blind to this stuff. So people aren't educated about how important model quality or predictive validity is. another issue is that there isn't a common language to talk about these ideas, so if you get a psychopharmacologist in a room with a pain biologist and an infectious disease discoverer, they will not have common language to talk about how good or bad models are.
And another issue is that the statistics that people are taught as biological scientists generally around hypothesis testing and decision making is actually rather different to hypothesis testing and the sort of statistics that will be useful to educate people on if they want to know whether their models are giving them the right answer: They're actually very close to the statistics that you apply to diagnostic testing. It's things about ROC curves, true positive rates, false positive rates, false discovery rates, and so forth and again that sort of stuff is actually largely absent from conventional biological science education. Something I will talk a bit about is, we should throw more effort at evaluating models.
And sometimes when I say this to people in the drug industry, they think I'm sort of arguing for Council of perfection. I.e. : how on earth do we evaluate models if at the moment there's no cures for disease x and if we test a bunch of compounds in the model and then put into trials it can be 20 years before we know if the models any good or not? Well, that's not really what I'm suggesting. Theres a good analogy here which is Bayesian verses frequentist statistics. So there’s a flavor of statistics called Bayesian and it’s the expectations of what will happen. And then you have so called frequentist statistics, which is about counts of what actually did happen.
And my view is that model choices at the moment, are already Bayesian in flavor, so we use a model because we think it's going to tell us something useful. And all I'm arguing here is that we do some practical evaluation to try and drag our guesstimate of how good the model as a little bit closer to reality.
And one very good way of doing this is to develop criteria against which you evaluate models,
Before you have your model. Or rather the thing to do isn't to invent a model and then assert that it’s a great model of a human disease, which I think is what happens a lot at the moment. It's to actually prospectively define what a good model of disease X would look like.
And then as independently and rigorously as possible, evaluate your models against it.
And this will help you decide how much to believe the results, picked by the model, but also it lets you combine models and I'll talk a little bit about that.

A couple of other things is that I think at the moment, most of the financial analysis of pipelines is blind to model quality and decision quality and I'm doing some work to develop alternative financial frameworks so that people can actually value better models.
And then another thing which I probably won’t talk about much is that there is also the economic problems around investing in better models. So models have a lot of the characteristics of public goods in that they make life better for lots of people who are trying to make money out of drug r&d, but in themselves, they don't earn very much money and I think that means a lot of the investment in commercial drug r&d goes into a novel chemistry and investing in the models that would help us decide wether that chemistry was going to be useful or not is something that doesn't have as much as it should. Something that I've been doing lately is, is actually trying to get to grips with the practicalities of model evaluation and again there maybe some of my collaborators on the call – and to a certain extent it is about clever checklists, and for sort of mnemonic and ease of remembering reasons, I structure my clever checklists under four big headings. One is biological recapitulation which is essentially a serious effort to articulate the degree to which the model captures the relevant aspects of the disease state. There's something about tests and endpoints, which is : is the model testing and scoring candidates in a way that's relevant to the human clinical state? There's a third one, well actually there's actually a very large literature around it already, which I've sort of beg, borrowed and stolen, And that's around the extent to which the model minimizes bias, noise, maximizes reproducibility, etc.
And then there's another area of evaluation, which I think is very under discussed in biology that is actually quite useful to think about. And I'll illustrate this in a minute or two with some oncology examples, but it's around the idea of domains of validity. Domains of validity is a term that's very commonly used in many fields of science but as a biologist I only came across it very late in the day when I was working with a cosmologist to become a bioinformatician.
And it's the idea that models are predictive within certain sets of criteria or parameters, but they are not predictive within others. And the example that was used with me to explain this is Newtonian mechanics, which is pretty good at explaining the motion of things that are bigger than an atom, smaller than a black hole and traveling a bit less than the speed of light. So if those conditions are satisfied then you Newtonian mechanics does a pretty good job, but if you're smaller than an atom, nearly as big as a black hole or traveling around the speed of light, then only an idiot would use these Newtonian mechanics and I think as biologists with disease models we often need to be clear and also articulate more clearly where those things that are models can predict versus those things that they can't.
And if you're interested in this I've highlighted on the right, a paper which is a very rare example of a rigorously derived set of criteria against which could evaluate models and this was used by drug industry Consortium to say here is how we're going to decide whether in vitro models of liver toxicity are good or not. And it divides the territory a bit differently to how I've done it on the left of the slide but effectively, it specifies a bunch of things like okay if you got in vitro liver model, it should produce a certain amount of albumin and that should be the same amount of albumin that liver cells produce. it should produce the right amount of urea. it should express a set of genes that we know are expressed in the liver. It should resemble a liver in terms of histology. And also, if we test a bunch of drugs that we know are toxic and close structural analogs of those drugs which are not toxic it should give us the right answer. And if it does all those things, then we will deem this to be a reasonable model for the purposes of in vitro liver tox, it's remarkably how rare that happens and here's a rare example where it has.
And here's an example of an efficacy model again from one of my collaborators.
So this is a slightly different framework and again a lot of the frameworks are sort of different , they have slightly different headings under which they put things but here are a set of criteria, a checklist, where you can describe Duchenne Muscular Dystrophy in great detail, there would be sub criteria against each of these criteria so for example you look at the genetics or the biochemistry, or the epidemiology of the human disease to derive criteria against which you evaluate your model, and then you can look at different models so you can understand their strengths and weaknesses. So this particular case, it turns out that the dog model of Duchenne muscular dystrophy, this particular dog model actually in many respects, looks a whole lot more like people than does the common mouse model, but actually the common mouse models had a whole bunch more drugs tested in it. Probably because it's cheaper.
But again, these sorts of plots give you a sense as to where the strengths and weaknesses of different models are so they help you think about assembling useful sets of models to help you in the decision process.

I'm going to wrap up fairly soon but I'm going to just give one or two concrete examples before I do, on the left are some therapy areas where I know a little, certainly less than some people on the call about many of them.
And on the right there are some practical lessons which illustrate some of the things I've learned when thinking about evaluation of these models. One thing I've learned is that information about models is not merely poor very often in important decision processes, It's often entirely absent. I and some coauthors are in a paper that is currently in press, for example, had to review academic grant applications that try and cure particular diseases.
I now know that probably the most important thing you want to know before giving the academic money or not, is the extent to which the models, they plan to use have predictive validity. I've had to review grant applications in the past where there was nothing at all. There's just no information at all about the likely predictive validity of the models that were going to be used to cure the diseases the academics wants to cure. Similarly investigative brochures, doctors being asked to recruit patients into the clinical trial, an extremely important thing you would want to know, in my view, is what is the predictive validity of the models that were used to bring this drug into the clinic. That information is not simply inadequate in most investigations broches, its generally entirely lacking. Another thing that's very interesting is that very often models are used, and they're asserted to be a model of disease x, when actually, there’s been remarkably little formal attempt to characterize the clinical state.
I've been doing some interesting work recently with Alzheimer’s experts, and it's clear that the very first step in evaluating the models is actually characterizing visceral detailed, What do you think Alzheimer’s is because only when you've done that, can you say the extent to which the model recapitulates it. 

And I would say sometimes it's something even worse than the poor characterization of the clinical state, sometimes, again john and Ian may have views on this is, you might have what's called the reification fallcy where you actually confused the world with the model.
So it may be the case and I’m only suggesting this is that in oncology, for example, the dominance of models associated for example with hyper proliferation during the site of toxic era of oncology drug development effectively encouraged people to think of disease in real patients, as if it were the same as the models. So not only can you have poor characterization of the clinical state which makes r&d inefficient. The models can sometimes force you to think about the disease in people in a way that's wrong.
And one of my favorites is ischemic stroke drugs around tests and endpoints. There is one ischemic stroke drug that showed efficacy in 19, possibly 17 or maybe 19 different published animal studies. And that drug then went into clinical trials. In the animal studies, the time between inducing the stroke and giving the drug, the median time in those 17 or 19 studies was five minutes. In human stroke patients, the median time in the trials was five hours. Surprise, surprise, the drug that worked in animals when dose a few minutes after stroke didn't work in people dosed five hours after the stroke. So I think when it comes to model evaluation, very often, we're actually starting from a pretty low base. Jim Bosley: You mentioned that we don't often evaluate our models for you know what they're really supposed to do. And I think there's one aspect of this that I’d just like to touch on, you can come in and refine on all of these - I'll be trying to be quick. We always use models. Sometimes we use a mental model to integrate lots of complex data, and I've never seen any evaluation of mental models as compared to some of these other things.
That's point one. This second point is models are often thought of as animal models or in vitro models or omics tests so are some other assay. They can also be an integration of all relevant scientific information using mechanistic modeling which is what I do as you know. I won't tote that, but I will note that I, as I looked on your criteria list, I think we, we hit all the bullet points, just humbly. The third thing is really a comment on a very often used metric that I have come to realize it's not a very good one and that's probability of technical success and it's often used in relating how many drugs go into a phase come as to what's coming out of it.
That's bad because it convolves or it conflates false positives and true positives in a very bad way, in a way that changes from when you're early, when you want a lot of, you want a low probability of success to the late stage, but yet in the drug industry promotions and managerial responsibility and decision making power is often given to people who have had a high probability of technical success in their programs. Make any comments you want.

Jack Scannell: Okay, so seeing as we're co authors you won't be surprising here I agree with a lot of what you said. I think when I write about this, I use the term decision tool, which is more general the model and I think models are important set of decision tools but theres a whole bunch of others you know for example i.e. you know the experience one carries in one's head.
So, I would include mental models, and although I don't know about drug r&d, there is some very nice evidence in the expert judgment literature that suggests that even simple quantitative models tend to outperform experts provided they are suitably parameterized, so I think there's quite big literature that actually human judgment, generally underperforms quantitative algorithms when you can prioritize those quantitative algorithms even when they are quite simple.

 I'm going to say something about the probability of technical success, now apologies for those of you who haven't wasted so much of your life in drug and bio-tech investment, but a lot of the sort of financial analysis and practical management of drug industry portfolios uses something called probability technical success. This is a sort of attrition measures so that people know well in general you know we have to put eight things into phase one and then half of them will get into phase two so the probability of success point five. And then, you know, one and a half of them get into phase one so the probability of success there is you know point, roughly point three. Now, my view for what it is worth is that's a very dumb way of building your analysis pipelines because probability of technical successes Jim says conflates good and bad decisions, so if we have attrition, it could be because we're throwing away good drug candidates or it could be, could be because we're throwing away, bad ones and actually it's a huge difference between the two.
And again, I won't go into the details here but one of the areas that weve actively doing some work in is to develop better methods for valuing pipelines so that you can actually plug in assumptions about decision quality.
And if you do that, it suddenly says you know if you get a slightly better model it's worth 100 million dollars. So by doing that it actually encourages people to invest in models.
So you've got this funny situation at the moment the financial tools that people use to evaluate pipelines are blind to model quality. If you made them unblinded model quality maybe a bit more capital would flow into better models because people could see easily how much better models were worth.

Leeza Osipenko: In terms of approach to drug discovery, whether it's in academia or its in commercial sector, is it all about models or other techniques being used and for example now, with increased capacity, there is the ability to test, millions and millions of samples with different ranges just to see what works out. Theres this trial and error kind of approach rather than modeling approach. So, what is the prevalence of modeling prior to the next step?

Jack Scannell: Okay, so again I use models in a sort of very broad sense and again perhaps decision tool is a better term. So, if you think about some conventional bottom up drug R&D, you start off a large set of therapeutic possibilities and those might be targets I.e bits of biology against which you intervene or they might be: I know which bit of biology against which will intervene, but you're looking for chemicals that intervene effectively against that bit of biology. I would regard anything that anyone uses to try and sort of narrow down that set of possible things into one or two things that are more likely to work, I would regard all of those as decision tools, right now, some of them are formal screening disease models. Some of them are in vitro tests like does compound x by and protein Y, in a high throughput screen. And then some of them will be much later on, for example does compound x, improve Parkinson's symptoms in a primate model of Parkinson's. But again, I think the logic, the sort of across a very broad range of that spectrum the decision theory suggests quality beats quantity so testing lots and lots of things in a model that's not very highly correlated with the human outcome of interest is generally a bad way to go and I'll give you another interesting example. 

There's only about 1500 approved drugs, or approved drug compounds, not many.
So the whole pharmacopoeia that is used in people is actually very constrained: a very very small set chemistry and chemical space is almost infinitely large.
So if you look at drugs that launched and then you say well how many subsequent indications or substance respectable uses will those drugs get, what you find is that is an extraordinarily large proportion of those uses observed come from fields discovery. Drugs launch, then 20 years later you say: Well what are these used for? You'll find out that they were used for a bunch of things that they weren't initially launched for, and most of those things at least in the relatively small studies that have been done with this.
The uses were discovered not by scientists, but by users. And my view is that's quite an interesting illustration because if you're using real chemicals in real people, your throughput is very low. It's not like you're testing lots of lots of compounds because there's only 1500 approved drugs. But you're discovering uses, and the reason for that is real people with disease x are a very good model of real people with disease x. So I think the importance of field discovery is another way of showing that quality beats quantity. that humans with a disease  have a very high predictive validity model of other humans who have the same disease which means, with very small numbers of humans and very small numbers of drugs we can still discover quite a lot of useful things.

Andre Brown: I guess really it's only the predictive validity, that matters, and these other things are basically, hopefully, proxies of predictive validity. But ultimately that test of predictive validity isn't all that matters it's, thinking of your liver toxic sample, If the model doesn't express as much albumin as you might like, but it turns out to be more predictive of course it's going to, the better model and so I wonder if there's a way to deal with that. And, and then also maybe if you could just give an example of a model that you think meets most of your criteria but turned out not to be good. Jack Scannell: So I think you're absolutely right. In diseases where there’s been a large number of compounds that have gone through models and into people, in tox in particular  because tox is actually quite rich data for doing this work with because unlike many efficacy models everything has to go through tox models, so you actually have much more data to play with and you can actually go and get compounds and test them all on your tox model and we can at least know to an extent how toxic they are in people.
So tox models are arguably the easiest to do this in, I think it'd be really nice to do we haven't done it is to get a bunch of tox models that don't actually recapitulate the human biology and see if they're worse at tox. 

And again I can talk a bit about, about the work I've done with emulate as they published it in the public domain, but you know they got into what they call micro physiological systems or organs on chips because their view is that the phenotype of lots of cell cultures just stops being like the phenotype of a tissue of interest after a while so sure they’re liver cells, but once you put them in a vat and grown them for a while they stop behaving like themselves, they're not liver cells anymore.
But I think with other diseases, maybe type two diabetes models you could also do, there's enough drugs are going to people. I think it's some oncology models but very often there won't be. But again, I think the important thing or the thing that strikes me is what a low base we starting from. It's not like, actually, a lot of these people have done really good checklists on these models. I think that with a lot of models, there's almost no formal evaluation at all. And people use things because of availability bias: it’s because what they've always used, it’s because what is easy to use, it’s what's the cheapest thing to use.  And again, I come back to my Bayesian point, that changing the correlation between your model and the point of interest by extra point one can have the same productivity impact as doing 10 or 100 times as much stuff. So the checklist doesn't have to nudge stuff very far in the right direction or give you a particularly great insight into which the models better or worse, to probably be quite useful.
And something I've proposed, and no one's done it yet, Is that for example one could use of historical studies to back test stuff, so you go to a bit like they do in machine learning, you know you divide the back catalogue of several large drug companies, you have several hypotheses about how you can characterize models.
You take half the data, you assess the models in that half of the data, and see whether they predict things better or worse, and then you, then you look at some untested other models and see if you can still do it and that might be a way of doing better evaluation framework so there's ways of doing it but I think it'd be quite a lot of work.
Your second question. So things that have been evaluated that look really good but that have been poor. Really interesting I haven't, I haven't looked for that, I've looked for things that have been poor, and then sort of retrospectively try to understand why they were poor. I've looked for things that have been good and then try to actually retrospectively understand why they're good. And I would love to do that but I haven't done it yet and if anyone's got bright ideas, let me know. Because in a sense that might be good as surprises are interesting, so it'd be good to find the surprise. Now there are some models that there's no particular reason why I think anyone believes they should work but they do. And again, I may be showing my ignorance here but for example there’s really obscure things like for example, there are certain guinea pigs or certain rodents that are very good at detecting the anxiolytic effect of benzodiazepines, but only benzodiazepines. Why that should be the case, I've got no idea so I think of good ones that no one understands but you know I can't think of things that should be good that are bad. Jim Bosley: If I guess ask one question: this highlights the problem of false negatives and assessing the number of false negatives. Care to comment on that Jack? Jack Scannell: I'm going to plug the paper that you and I published in 2016, where we talked about this and it in more detail, but I think this is sort of assumption that we don't throw away good stuff, i.e. yeah the drug industry thinks in terms of attrition we're trying to get rid of a lot of bad stuff. But actually, if models just a poorly correlated reality, you're not just progressing bad stuff you're actually throwing away a lot of good stuff which comes back to this point about field observation: So why is it that so many uses of drugs are observed in the field; it’s because you've got a high true positive rate. And it’s when you've got very high throughput and a poor model, you're throwing away the vast majority of good drugs. So actually, if you got a bad model and high throughput you can be pretty sure that nearly all the good drugs are rejected.
It's not simply repressing bad stuff, you throw away lots of lots of good stuff with bad models or rather you have a very low discovery yield.