September 15, 2017

**My Farewell Lecture**

http://richardgill.weblog.leidenuniv.nl/2019/01/30/my-farewell-lecture/

**From killer nurses to quantum entanglement**

Being a statistician has been, for me, a tremendous prerogative. It has been an opportunity to enjoy doing mathematics in the most varying fields of application you can imagine, combining the joy of discovering mathematical beauty and the satisfaction of contributing to real world problem solving, as well as the excitement of learning about new fields, perhaps far from mathematics. Over the years I’ve worked both on fundamental problems within mathematical statistics, as well as rather specific problems from varying applied fields. Today I want to tell you some stories about my most recent experiences in two outside areas: criminal law, and quantum physics.

“Forensic statistics” has become an established field, with its own research community, its own journals and conferences. An apparently strong degree of agreement among practitioners has emerged as to what a statistician is supposed to do, when asked by a court to quantify the weight of evidence. Courts are becoming accustomed to having to evaluate statistical or probabilistic evidence, as epitomized by the probability of a chance DNA match.

My introduction to this field was the case of the Dutch hospital nurse Lucia de Berk, sentenced to life imprisonment for alleged murders and attempted murders of her patients. The case was triggered by the supposedly unexpected death of a 6 month old baby at the Juliana children’s hospital in the Hague, a few days before 9-11. Lucia’s initial conviction was largely based on a statistical analysis of the distribution of incidents on her ward between shifts when she was on duty and shifts when she was not on duty. This generated data which could be summarized in a 2×2 contingency table. Later, data from two other wards at a hospital where she had earlier worked, were added. The standard analysis (Fisher’s exact test, based on the hypergeometric distribution) suggested that many, many more incidents occured in Lucia’s shifts, than should be expected to occur by chance. Anyone could understand the argument. It was enough for the judges at the lower court in the Hague.

On appeal, the verdict was the same but the argument adopted by the court completely changed. Instead of statistical calculations there was now new, apparently hard medical evidence. Baby Amber had died of an overdose of digoxin; Lucia had opportunity and motive to administer it. This made her a murderer and a liar. Successive cases required successively less evidence to convince the court that here too, Lucia was murdering her patients: the famous “chain argument”.

At some point brother and sister Metta de Noo (a medical doctor) and Ton Derksen (emeritus professor philosophy of science) started to reach the media with claims that the case was completely bungled. A book by Derksen exploded the reasoning of the court and cast doubt on the medical “facts” of the case. A popular movement calling for a re-trial began to grow and a number of statisticians were among the first to loudly voice their support. One of the important things to do was to neutralise the effect of the “one in 342 million” chance that was part of the initial conviction. It had had a profound effect on everyone’s thinking about the case, yet it turned out to have been a fantasy. Aside from a major technical error concerning the proper way to combine the results of three 2×2 tables (the product of three p-values is not a p-value), doubts arose as to the integrity of the data used in the analysis. It basically consisted of two lists: at which shifts was Lucia present; at which shifts was there an incident. How were those lists compiled? What is an incident?

A formal definition was never made. Most deaths and some but not all reanimations were included. It is hard to escape the conclusion that the main ingredient was (a) some element of medical surprise, and (b) Lucia’s presence. Since no-one has ever gone back to original hospital records, taken an objective criterium, and checked every single shift, we will never get to see objective data. What we do know now is that the list was compiled by medical doctors at the hospital who were highly suspicious of Lucia, who had been the subject of malicious gossip for half a year already at the time of Amber’s death.

After a long struggle the case was reopened, the court focussed on the medical evidence only, and Lucia was completely exhonerated.

The president of the court at the Hague has several times called for some kind of post-mortem into the Lucia case. I believe that the court actually behaved quite wisely and carefully, but that it was fooled by what effectively was a subconscious conspiracy of hospital staff (doctors and administrators) who fed one another half truths concerning a number of cases where real medical errors had been made and who found it easy to connect at the time unexpected bad treatment outcomes to a notable nurse. As a consequence of the social mechanisms operating inside the hospital, the data (in a broad sense) passing from the hospital to police investigators and later to a judicial investigation was always strongly biased. How could an outsider have known?

It was this background that got me involved in two more “serial killer nurse” cases: Ben Geen in England, Daniela Poggiali in Italy. I was shocked to the core by the similarities I found in those cases with the Lucia case; but shocked even more that it seemed impossible to have any impact on the outcome. Which raises the question: did we actually have any impact on the outcome of Lucia’s case? How did it come about that her conviction was overturned? More constructively: what should the next statistician do who gets involved in a case like this?

I’ll talk today about the British case only. This case had played out during the middle years of the Lucia events, but as far as I know, nobody had ever made a link between the two cases. In 2004 Ben Geen was charged with causing grievous bodily harm, resulting in two deaths, to 18 patients at Horton General Hospital, Banbury, during a few winter months (December 2003 – February 2004). He was ultimately convicted in 17 of the 18 cases. In 7 of these cases, respiratory arrest was claimed to have been caused by administration of a muscle relaxant. There was essentially no direct evidence for any wrong doing by anyone, let alone by Geen.

The case seemed to turn on the question whether or not “respiratory arrest” is an unusual occurrence in a hospital’s accident and emergency unit. Lawyers for Geen, attempting to get the case reopened by the UK’s CCRC (criminal cases review commission) used FOI requests to obtain data from a large number of hospitals giving the monthly numbers of events of various kinds, over a period of nearly 10 years. I made some analysis of this data and concluded that though respiratory arrest is three times less common than cardiac arrest, it certainly cannot be called “rare”. However, this whole exercise turned out to be a wild goose case. On the one hand, it was already known that the number of respiratory, cardiac and hypoglycaemic arrests causing sudden transfer from A&E to intensive care was only one larger than the previous year (7 versus 6; actually, the number of admittances was 10% larger as well). This was offered as excuse why the hospital was not able to earlier stop Geen: the number of such events was close to what one would expect in the winter months. Cardiac arrests were strikingly down, respiratory up. “Publicity bias”? (aka “awareness bias”): once one case has been so diagnosed, we observe many more. The distinction is often not clear cut. The main reason used by the CCRC to turn down the application to reopen the case was that the key point was not “respiratory arrest” per se, but “unexplained respiratory arrest”. Indeed, a key prosecution witness, professor of anaesthesiology, had stated that unexplained RA is very rare: in fact he had never ever experienced such a case.

So where did the 18 cases of the charge come from? They were selected by hospital doctors alarmed by a “trigger case”, going on a trawling expedition to find out “which other patients did Ben harm”. We are back with Lucia, and with medical doctors who are already suspicious of a nurse, themselves deciding case by case whether what they see can be medically explained or not … knowing case by case whether or not a particular nurse, already suspected of murder, was present or not. In fact, they are only investigating the patients of the nurse who is already under suspicion! Experts called by the defence did not find any of the 18 events suspicious. Some were difficult to understand. Even the prosecution experts couldn’t decide what Ben Geen had done to the patient!

In the Netherlands, a board of wise judges weighs all the evidence. In England, a jury is directed by a judge. In the Ben Geen case, the judge gave pretty clear directions to the jury, what they should think. In particular, the evidence from a medical statistician called by the defense, which addressed the multitude of ways that bias can enter in medical diagnosis, was discarded by the judge on the grounds that “it was barely more than common sense”.

Ben Geen, like Lucia, stood out in the crowd. He had been associated by his own colleagues with unexpected events: they gave him the nickname ‘Bev Allit”, after an earlier UK Lucia case. His career aim was to join the army and he was enthusiastic to get tough experience. And (supposing he’s innocent) he made a terrible mistake. He arrived at the hospital with a half-full syringe of muscle relaxant in his pocket at exactly the moment when the police were waiting there to interogate him. His claim: he had accidentally taken it home in his nurse’s scrubs after a particularly chaotic day in Emergency. His girl-friend, also a nurse, had found it and told him to return it to the hospital to be disposed of properly.

I have noticed that people’s judgement of the Ben Geen case depends almost entirely on whether or not they find his claim believable (I do). And indeed, this is the hardest piece of evidence which there is in the whole case: the medical evidence on all those 18 patients is truly wafer thin.

So, what did I learn from these experiences? What is a forensic statistician to do, the next time around?

The current dogma in forensic statistics is “compute the likelihood ratio”. That is the ratio of the probability of getting the data (for instance, a DNA match), in the case that the suspect is guilty, to the probability of getting the data, in the case that the suspect is innocent. There are many challenging issues with this paradigm. Is “the probability” as well defined as the words suggest? That depends on what we mean by probability and this is where a major confict between concepts rises to the fore. In short: does probability mean “personal degree of belief”, or is it an objective property of a physical system? In either case, anyway, do we know it? If statisticians are going to report probabilities in court, they are going to have to explain to the court what they mean by probability. Unfortunately there are several meanings available. As a mathematician, one can be neutral: the *rules* of probability are the same. However as soon as one wants to apply this mathematics in practice, one has to make a choice. Presently, the community seems moving to an uneasy compromise position. Bayesian (subjective) probabilities are fine, but the prior probabilities should be chosen objectively. This allows both data dependent priors, and priors based on principles of “non informativeness”. In both cases however, the “pure” Bayesian position has to be abandoned. The court is not asking the statistician for their actual personal belief, but for the beliefs of a mythical independent, capable, well-informed scientist. These beliefs must be scientifically reproducible: from the same assumptions as to prior knowledge anyone gets the same result as to statistical conclusions, and everyone agrees that the assumptions are justified.

For serial killer nurse cases, we are far, far away from being able to write down well justified and mathematically tractable statistical models for nursing roster and patient admission and discharge (or death) data. Even if we could resolve the issues of Bayesian versus frequentist approaches, there is no way to follow the likelihood ratio dogma. I think that each case needs to be studied on its own merits.

What is really important is to understand the generation of the data. Data generated by a hospital in the course of its own investigations into a suspected killer nurse is suspect data. It is data prepared by a biased witness (and possibly even by a culprit). The statistician has to convince judges, defense lawyers, journalists to be suspicious. Do not take data at face value. Hospitals are not independent forensic research institutes. Hospital doctors are not forensic scientists.

Also in the Italian case I mentioned, Daniela Poggiali, we discovered extraordary anomalies in the data. It became absolutely clear that the time of death written in hospital records is the result of a complex process influenced by administrative processes as much as medical truth. One cannot take hospital records at face value.

I’m convinced that Ben Geen is innocent. His case seems hopeless. There is no hard medical evidence. The statistical evidence is in his favour. But it has all been heard in court and successive juries have found him guilty. His only chance seems to me that a new medical analysis of some of the cases of his alleged victims turns up convincing medical evidence that Ben had nothing to do with the alleged medical incident. For this it will be necessary to get powerful medical supporters on Ben’s side. I don’t see it happening. The syringe is firmly in the way.

Did the efforts of so many outsiders, laypersons and scientists, save Lucia? I’m now inclined to believe that the success of the Lucia story was actually to a large extent a matter of extremely good luck; just as the initial case against her was based on what for her was largely bad luck. The surplus of deaths on her shifts was real but was just bad luck. It was part of a chain of events which can be understood as a social phenomenon: a witch hunt led to a witch trial. Everybody involved played their part and the wheels of justice turned and did their job. How did Lucia get freed again? Mainly through the extraordinary good luck that a whistle-blower from medicine, Metta de Noo became involved. Metta had inside knowledge through a personal connection at the hospital and strong and broad medical expertise. I’m afraid that Ben Geen will remain lost as long as no influential, authoritative, medical expert stands up on his behalf.

There is a book to be written about these and other cases. Not just about statistics but also about social psychology and the modern medical world.

Let me turn to a more cheerful topic. Over the years I have kept coming back to the famous Bell inequalities, and Bell theorem: a theoretical analysis made by physicist John S. Bell in 1964, which shows that quantum physics is dramatically different from classical physics: it predicts phenomena which could only occur in a classical physical world with the help of “action at a distance”, which roughly means: changes that you make to a physical system in one place can be felt (can be noticed) at other distant locations, instantaneously.

Just as shocking is the weaker deduction that information is being transmitted faster than the speed of light. But what does “changes can be felt” mean, exactly? The correct deduction is that *if* the quantum predictions actually have a classical physical explanation, *then* action at a distance takes place in a hidden world “behind the scenes” which is only partially visible to us. Action at a distance at a hidden level is needed to explain the observed facts in a mechanistic way. But there are no observable instantaneous changes at one place due to changes at some distant location. There is information flow in the hidden world behind the scenes which allows coordination of events at distant locations, but without information flow occuring in the outside world.

Bell’s 1964 analysis was theoretical. He described a thought experiment and contrasted the predictions of quantum mechanics with the restrictions which a classical picture of the experiment would entail. But can we actually do the experiment? It has taken more than 50 years before physicists were able to succesfully perform a rigorous experiment. In recent years they were on the brink and several groups around the world were racing to be first. At last, in 2015, Delft won the race; Vienna, NIST, and Munich were close behind. The experiment needs a statistical analysis and one of the statistical ideas in that analysis was contributed by myself, a number of years earlier. I will try to explain it to you.

The actual experiment involves lasers, photons, mirrors and crystals; I’ll replace these with some persons playing a game called the Bell game.

Many times (rounds), the following is repeated. Alice and Bob, in separate rooms, each receive a key. Their key may be silver and it may be gold. What keys they get is completely random. They don’t know the other’s key, only their own. Now they may each use their key just once, turning it clockwise or anticlockwise. Their aim is to open a box and share the prize inside. The box only opens in the following circumstances: if one or both keys are silver then both keys have to turned in the same direction; but if both keys are gold then they have to be turned in opposite directions. Trouble is, Alice does not know which key Bob has, and Bob does not know which key Alice has. In order to maximise the chance that they get the box open it would seem wise to coordinate their actions in advance, telling each other what they’ll do under either eventuality. It’s not difficult to see that any such plan results in failure at least one time out of four.

There are just 16 different plans. Because each plan is an assignment of clockwise or anti-clockwise to each of the four: Alice silver, Alice gold, Bob silver, Bob gold. We could just list them and check for each plan for which of the four eventualities it succeeds and which it fails. That’s tedious but easily doable. But I’ll try to give a shorter analysis.

One of the 16 plans is: Alice and Bob both turn clockwise, anyway. It fails for gold, gold, but wins otherwise.

To compensate for the gold-gold failure when both Alice and Bob always turn their key the same way, whatever it is, it’s natural to modify the plan by saying: the same (always clockwise), except that when Bob gets gold, he turns his key anti-clockwise. This plan now fails for the combination silver-gold, wins otherwise.

Trying out a few more combinations one will quickly make the following discovery. Every time we change just one assignment, we reverse the status (win/lose) in two of the four silver/gold combinations (Alice’s key and Bob’s key). This means that we can win three out of four times, or one out of four times, but nothing else. In particular, always winning is impossible; and the best winning chance is three out of four. There are eight ways to achieve this; the other eight of the sixteen all win only once in four times.

Alice and Bob might use those different plans with different probabilities. The average success rate will be something between a quarter and three quarters. It can’t be more.

Is there anything else which Alice and Bob could do? What if we replace Alice and Bob by computers running some computer programs? What if we replace them by arbitrary physical systems?

The argument that it’s not possible to better a 75% success rate goes through as long as we can argue that Alice and Bob might just as well decide, in advance, what each is going to do in each eventuality. For instance, a random choice of direction, depending on which key Alice is given, might just as well be implemented by making a random choice for each possible key, in advance. So using computers or physical systems to implement random choices (even with probabilities depending on the key) is just the same as choosing one of the 16 fixed plans at random (perhaps according to some very elaborate probability distribution).

At least, this is true if we use physical systems whose physics satisfies a property called realism; some people prefer the term “counterfactual definiteness”. It means: if a physical system can be measured in two different ways, then it’s possible to imagine the outcomes of both measurements, even if only one can actually be performed. The potential outcome of the not performed measurement still exists (in a mathematical sense), even if not revealed. The choice of measurement merely selects which of two preexisting values gets to be seen.

This would certainly be true of computers set up to simulate our experiment. And it is an obvious property of all pre-quantum physical theories. The final state of a physical system depends deterministically on its initial state. We may not be able to do the computation, or we may not know the initial state, and tiny variations in initial state might have enormous impact on the outcome, but even if the outcomes are effectively random we can still understand their statistics from deterministic reasoning. The physics is essentially deterministic.

Quantum mechanics was a revolutionary step in physics since it never even attempts to explain why what happens does happen. It merely tells us what are the probabilities of what can happen. And since the birth of quantum mechanics, physicists have dreamt of eventually coming up with theories which explained how those probabilities arose, as the reflection of uncontrolled and unknowable initial conditions of a richer and essentially deterministic system. At least, that was the dream, one might say, till Bell came along.

Bell not only showed that classical physics did not allow a bigger success rate than 75%, but also showed that using randomisation devices involving a phenomenon from quantum mechanics called entanglement it was theoretically possible to break the 75% success rate barrier. This means that if the randomness in this experiment is merely the reflection of random initial conditions in hidden layers of a richer, deterministic, description of what is going on, then that deterministic description exhibits action at a distance, in short “non locality”.

Bell described a schematic experiment which since 1964 has actually been performed many times, confirming quantum mechanics and achieving a higher than 75% success rate; however, till 2015, all the experiments suffer from major defects. A higher success rate could mean that Alice’s key is somehow known to Bob, and vice-versa. To rule out any mundane explanations it is necessary to choose Alice’s key (silver of gold) at random, at Alice’s location, just before Alice turns it clockwise or anti-clockwise, and within such a short time interval, that there is no way that what is happening at Bob’s place can be known before Alice is done. So Alice and Bob should be far apart in space, while the time interval between creating inputs and obtaining outputs must be small.

Satisfying these constraints makes the experiment “loophole free”. And performing a succesful, loophole free experiment, has been a holy grail in quantum physics for many decennia … till 2015 and the Delft experiment.

In Delft, at each of the two locations there is a diamond with a Nitrogen-vacancy defect, home to an electron spin which can be manipulated and interogated with lasers. Before each round, the two spins, in laboratories one at each end of the campus, are brought into their entangled state by a process called entanglement swapping which involves tricks with lasers, mirrors and crystals at a central location as well as at the two laboratories. Call this phase “preparation”, it’s the stage at which a strategy for Alice and Bob can be thought of as being established. Then the inputs are fed into the spins and the decisions are read out. All within a tiny time-interval preventing any kind of communication by any subluminal means.

In Delft, the game was played 245 times and the success rate was 80%. That is just statistically significantly bigger than 75%. The result has been confirmed by other research groups. I think we can be quite sure that the Delft result is not a mere chance fluctuation. By the way, according to quantum mechanics the highest possible success rate is 85% (more precisely: probability a half plus a quarter of the square root of two). There has been significant progress in recent years in showing that any physical theory which would allow higher success rates still would have counter-intuitive features even more weird than quantum mechanics.

Since Delft achieved an 80% success rate, statistically significantly larger than the theoretical long term optimum 75%, must we conclude that those two quantum spins do actually communicate with one another, at superluminal speed? (That is what the newspapers would have you believe). Following Bell I’ve shown you that any classical explanation of the process whereby query is converted into response would need superluminal communication. That is because we can imagine that after the joint system is prepared, each component is cloned, and each copy gets a different one of the two inputs. After that we simply select which copy is needed in the specific case at hand. This thought experiment has shown that the whole system is acting as if it had just chosen one of those sixteen decision rules and then followed the rule blindly. But then it would only have had a 75% success rate. The only way to do better, classically, is to have prior information as to what the inputs will be, and this requires communication between the two locations.

But there is an alternative, which is to suppose that classical explanations of the underlying process do not exist. Instead, the quantum mechanical description is the bottom line. The measurements of those quantum spins are truly random. Quantum systems can’t be cloned. The two responses to the two possible inputs are not predetermined. Instead, the one which is asked for is created afresh. It is irreducibly random: unlike the outcome of the toss of a coin or a dice, which is a deterministic function of the initial conditions of the coin or dice throw.

Quantum randomness is actually a creative phenomenon which makes things possible in a world run according to the laws of quantum mechanics, which would be impossible in a classical physical world. Indeed, the up to 85% success rate possible in the Bell game can be harnessed to give applications in cryptography, for instance, which are impossible in a classical world.

Quantum physics is a far more radical departure from classical physics than most physicists imagine. Randomness has become a “ground-level” part of the description of reality, it is not an emergent feature. Irreducible randomness is necessary in order to reconcile the statistics of the Bell game with no-action-at-a-distance. Quantum randomness is real and needs to be included in the axiomatic ground-floor of physics, not added as an after-thought. Irreducible randomness creates in some cases barriers (there are a number of famous “impossibility” theorems in quantum mechanics, such as the no-cloning theorem) but it also creates opportunities, new possibilities.

I still need to explain to you the contribution which I made to this history. Quite a few years ago I was engaged in discussions with a well-known mathematician who believed that the Bell experiment could be explained in classical physical terms. In fact, he believed that a succesful experiment could even be simulated on a network of ordinary computers. I was sure that he was wrong, and came up with the idea of making a bet. The idea was to play the Bell game with a network of computers. My opponent was to have complete freedom in programming the three computers. I insisted on the inputs being supplied completely at random, by a trusted third party. And I proposed a number of rounds and a win/lose criterion. For instance: at least 80% success rate, 10 thousand rounds. I wanted to be pretty certain of winning, and therefore I had some worries. As the game progresses, my opponent might be learning from the earlier rounds, and might be adapting his strategy. There is simply no guarantee that the rounds are independent and the probabilities of winning each round constant. Everything might be changing as time goes on. Moreover, my opponent is just running programs on classical computers: the results are deterministic. How can I do any probability calculations to find out if my bet is safe or not?

The answer was to use the probabilities which are under my control: the repeated, random choices of inputs. I noticed how to express the win/lose criterium from a traditional statistical analysis in terms of a simple count of success/fail over the rounds and how to use martingale theory to get probability bounds on the final total number of successes. The point here is that the argument that the success probability is at most 75% also applies to each round separately, given all information gathered in preceding rounds; and the probability comes from the deliberate randomisation, not from the physics. To my delight the physicists have taken up (and both simplified and refined) this idea so that finally the statistical conclusions of the Delft experiment, as well as of the others, cannot be fought on the grounds that time-dependence and time-variation invalidate the statistical analysis, as long as one is prepared to trust the randomness of the inputs in the experiment: the keys.

Coming to the end of my lecture I would like to briefly connect serial-killer nurses and quantum entanglement. There is an argument that nurse Lucia de Berk was actually saved by quantum statistics, and here it is. At some point I set up an internet petition asking for a re-trial in the case of Lucia de B. I canvassed for support and was able to use my quantum connection to Gerhard ’t Hooft, nobel prize winner, to have a conversation with him at which I asked him if he too would sign my petition. He said he would think about it. And after a weekend in which I believe he had consulted his family, some of whom are medical doctors, he signed the petition, leaving a remark there which went to the heart of the problem. This was noticed in the media and, I suspect, in the top of Netherlands legal system. It meant that in the subsequent legal proceedings, some of the best people were involved and the case was handled with care and respect.

I want to conclude my lecture with some words of thanks. Leiden University has been the perfect environment in which I could follow my scientific instinct wherever it led. The faculty of science and especially the mathematical institute has been a warm home where teaching and research stimulate one another. Working together with students and PhD students is motivating and rewarding. Official retirement will not put an end to it, I am sure.

In particular during the last years the creation of the master programme “Statistical Science” has been a dream coming true, the driving force behind it not me, but Jacqueline Meulman. I think statistics has a brilliant future in Leiden, cutting across traditional department and faculty divisions.

My family (especially wife and children) and friends have had to put up with my obsessions whether with nurses or the quantum or whatever. I hope that in the future I’ll be a little more attentive to you.

I’ve just been treated to a wonderful symposium organised by Aad van der Vaart, Peter Grunwald and Giulia Cereda at which there have been beautiful contributions by many colleagues around the world, both former students and former teachers, with whom I have such enjoyed working together over the years.