hello.
39 stories
·
1 follower

In case you want to binge read the (Strong) Likelihood Principle in 2025

1 Share

.

I took a side trip to David Cox’s famous “weighing machine” example” a month ago, an example thought to have caused “a subtle earthquake” in foundations of statistics, because  knew we’d be coming back to it at the end of December when we revisit the (strong) Likelihood Principle [SLP]. It’s been a decade since I published my Statistical Science article on this, Mayo (2014), which includes several commentators, but the issue is still mired in controversy. It’s generally dismissed as an annoying, mind-bending puzzle on which those in statistical foundations tend to hold absurdly strong opinions. Mostly it has been ignored. Yet I sense that 2025 is the year that people will return to it again, given some recent and soon to be published items. This post gives some background, and collects the essential links that you would need if you want to delve into it. Many readers know that each year I return to the issue on New Year’s Eve…. But that’s tomorrow.

By the way, this is not part of our lesurely tour of SIST. In fact, the argument is not even in SIST, although the SLP (or LP) arises a lot. But if you want to go off the beaten track with me to the SLP conundrum, here’s your opportunity.

What’s it all about? An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

SLP (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E1 and E2 with different probability models f1, f2, but with the same unknown parameter θ, if outcomes x* and y* (from E1 and E2 respectively) determine the same (i.e., proportional) likelihood function (f1(x*; θ) = cf2(y*; θ) for all θ), then x* and y* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

Violation of SLP:

Whenever outcomes x* and y* from experiments E1 and E2 with different probability models f1, f2, but with the same unknown parameter θ, and f1(x*; θ) = cf2(y*; θ) for all θ, and yet outcomes x* and y* have different implications for an inference about θ.

For an example of a SLP violation, E1 might be sampling from a Normal distribution with a fixed sample size n, and E2 the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).

The SLP tells us  (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E1, where n was fixed, say, at 100, and experiment E2 where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.

———————-

Now for the surprising part: In Cox’s weighing machine example, recall, a coin is flipped to decide which of two experiments to perform.  David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of the particular Ei. Nothing could be more obvious.     

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixture experiments, and so uncontroversial a principle as sufficiency (SP)–although even that has been shown to be optional to the argument, strictly speaking. Were this true, it would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).

Although his argument purports that [(WCP and SP) entails SLP], in fact data may violate the SLP while holding both the WCP and SP. Such cases also directly refute [WCP entails SLP].

Binge reading the Likelihood Principle.

If you’re keen to binge read the SLP–a way to break holiday/winter break doldrums–or if it comes up during 2025, I’ve pasted most of the early historical sources below. The argument is simple; showing what’s wrong with it took a long time.

My earliest treatment, via counterexample, is in Mayo (2010)–in an appendix to a paper I wrote with David Cox on objectivity and conditionality in frequentist inference.  But the treatment in the appendix doesn’t go far enough, so if you’re interested, it’s best to just check out Mayo (2014) in Statistical Science.[ii] An intermediate paper Mayo (2013) corresponds to a talk I presented at the JSM in 2013.

Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend.

This conundrum is relevant to the very notion of “evidence”, blithely taken for granted in both statistics and philosophy. [iii] There’s no statistics involved, just logic and language.My 2014 paper shows the logical problem, but I still think that it will take an astute philosopher of language to adequately classify the linguistic fallacy being committed.

To have a list for binging, I’ve grouped some key readings below.

Classic Birnbaum Papers:

  • Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57(298), 269-306.
  • Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). “Discussion on Birnbaum’s On the Foundations of Statistical Inference”, Journal of the American Statistical Association 57(298), 307-326.
  • Birnbaum, Allan (1969).” Concepts of Statistical Evidence“. In Ernest Nagel, Sidney Morgenbesser, Patrick Suppes & Morton Gabriel White (eds.), Philosophy, Science, and Method. New York: St. Martin’s Press. pp. 112–143.
  • Birnbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
  • Birnbaum, A (1972), “More on Concepts of Statistical Evidence“Journal of the American Statistical Association, 67(340), 858-861.

Note to Reader: If you look at the (1962) “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.

Some additional early discussion papers:

Durbin:

There’s also a good discussion in Cox and Hinkley 1974.

Evans, Fraser, and Monette:

Kalbfleisch:

My discussions (also noted above):

[i] Savage on Birnbaum: “This paper is a landmark in statistics. . . . I, myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. . . . [T]his paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. …once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics” (Savage 1962, 307-308).

[ii] The link Mayo (2014) includes comments on my paper by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu, and my rejoinder.

[iii] In Birnbaum’s argument, he introduces an informal, and rather vague, notion of the “evidence (or evidential meaning) of an outcome z from experiment E”. He writes it: Ev(E,z).

In my formulation of the argument, I introduce a new symbol ⇒ to represent a function from a given experiment-outcome pair, (E,z) to a generic inference implication. It (hopefully) lets us be clearer than does Ev.

(E,z) ⇒ InfrE(z) is to be read “the inference implication from outcome z in experiment E” (according to whatever inference type/school is being discussed).
If E is within error statistics, for example, it is necessary to know the relevant sampling distribution associated with a statistic. If it is within a Bayesian account, a relevant prior would be needed.

[iv] I’ve blogged these links in the past; please let me know if any links are broken.



Read the whole story
jthaman
89 days ago
reply
Maryland
Share this story
Delete

On some publications of Sir David Cox

1 Share
Abstract Sir David Cox published four papers in the Scandinavian Journal of Statistics and two in the Scandinavian Actuarial Journal. This note provides some brief summaries of these papers.
Read the whole story
jthaman
200 days ago
reply
Maryland
Share this story
Delete

“My basic question is do we really need data to be analysed by both methods?”

1 Share

Ram Bajpai writes:

I’m an early career researcher in medical statistics with keen interest in meta-analysis (including Bayesian meta-analysis) and prognostic modeling. I’m conducting a methodological systematic review of Bayesian meta-analysis in the biomedical research. After reading these studies, many authors presented both Bayesian and classical results together and comparing them and usually say both methods provide similar results (trying to validate). However, being a statistician, I don’t see any point of analysing data from both techniques as these are two different philosophies and either one is sufficient if well planned and executed. Consider me no Bayesian expert, I seek your guidance on this issue. My basic question is do we really need data to be analysed by both methods?

My quick answer is that I think more in terms of methods than philosophies. Often a classical method is interpretable as a Bayesian method with a certain prior. This sort of insight can be useful. From the other direction, the frequency properties of a Bayesian method can be evaluated as if it were a classical procedure.

This also reminds me of a discussion I had yesterday with Aaditya Ramdas at CMU. Ramdas has done a lot of theoretical work on null hypothesis significance testing; I’ve done lots of applied and methodological work using Bayesian inference. Ramdas expressed the view that it is a bad thing that there are deep philosophical divisions in statistical regarding how to approach even the simplest problems. I replied that I didn’t see a deep philosophical divide between Bayesian inference and classical significance testing. To me, the differences are in what assumptions we are willing to swallow.

My take on statistical philosophy is that all statistical methods require assumptions that are almost always clearly false. Hypothesis testing is all about assumptions that something is exactly zero, which does not make sense in any problem I’ve studied. If you bring this up with people who work on or use hypothesis testing, they’ll say something along the lines of, Yeah, yeah, sure, I know, but it’s a reasonable approximation and we can alter the assumption when we need to. Bayesian inference relies assumptions such as normality and logistic curves. If you bring this up with people who work on or use Bayesian inference, they’ll say something along the lines of, Yeah, yeah, sure, I know, but it’s a reasonable approximation and we can alter the assumption when we need to. To me, what appears to be different philosophies are more like different sorts of assumptions that people are comfortable with. It’s not just a “matter of taste”—different methods work better for different problems, and, as Rob Kass says, the methods that you use will, and should, be influenced by the problems you work on—I just think it makes more sense to focus on differences in methods and assumptions rather than frame as incommensurable philosophies. I do think philosophical understanding, and misunderstanding, can make a difference in applied work—see section 7 of my paper with Shalizi.

Read the whole story
jthaman
211 days ago
reply
Maryland
Share this story
Delete

An apparent paradox regarding hypothesis tests and rejection regions

1 Share

Ron Bloom wrote in with a question:

The following pseudo-conundrum is “classical” and “frequentist” — no priors involved; only two PDFS (completely specified) and a “likelihood” inference. The conundrum however may be interesting to you in its simple scope; and perhaps you can see the resolution. I cannot; and it is causing me to experience something along the lines of what Kendall says somewhere (about something else entirely) about “… the problem has that aspect of certain optical illusions; giving different appearances depending upon how one looks at it…”

Suppose I have p(x|mu0) and p(x|mu1) both weighted Gaussian sums with stipulated standard deviations and stipulated weights; for definiteness say both are three term sums; moreover all three constituent Gaussians have the common mean named in the expression p(x|mu). so they look like “heavy tailed” Gaussians at least from a distance.

Suppose mu0 < mu1 are both stipulated too; in fact everything is stipulated; so this is *not* an estimation problem; nothing to do with "EM" or maximum likelihood. Just classical test between two simple alternatives. A single datum is acquired: x. The classical procedure for deciding between "H0" and "H1" is to choose the test "size". Put down the threshold cut T on the right tail of p(x|mu0) so the area above that cut is the test size; the power of that test against the stipulated alternative H1 is of course the area above T under p(x|mu1). When the PDFs are Gaussian or in an exponential family or when "a sufficient statistic is available" this procedure above is identical to what one does if he uses the Neyman-Pearson likelihood criterion: which amounts to putting a cut with the same "size" on the more complicated random variable L(x) = p(x|mu0)/p(x|mu1). When the PDFS are nicely behaved or more generally *monotonic* the probability statement about a rejection test on the variate L(x) translates into a a statement about a rejection test on the variate (x) simpliciter. But in the case of this "nice" Gaussian mixture I discover that for mu1 sufficiently close to mu0 (and certain combinations of weights and standard deviations) that the likelihood ratio L(X) is *not* monotonic and so I am suddenly faced with an unexpected perplexity: it seems (to the eye anyway) that there's only one way set up a right-tailed rejection test for such a pair of simple hypotheses: and yet the Neyman Pearson argument seems to say that making that cut using the PDF of L(x) and making that cut using p0(x|mu0) itself will not yield the same "x" --- for the same test size. Can you see the resolution of this (pseudo)-conundrum?

I replied: Yes, I can see how this would happen. Whether Neyman-Pearson or Bayes, if you believe the model, the relevant information is the likelihood ratio, which I can well believe in this example is not a monotonically increasing then decreasing function of x. That’s just the way it is! It doesn’t seem like a paradox to me, as there’s no theoretical result that would imply that the ratio of two unimodal functions is itself unimodal.

Bloom responded:

I finally was able to see what is obvious. That indeed there are many alternative “rejection regions of the same size” and if the PDF of the “alternative” is bumpy (as in this example) or more generally if the likelihood ratio is not monotone (and this is *not* “easy to see” for ratios of “simple” Gaussian mixtures all of whose kernels have common mean) then indeed the best (most powerful) test is not necessarily the upper tail rejection test. See my badly drawn diagram. This by the way can be filed under your topic of how the Gaussianity ansatz sufficiently well-learned can really impede insights that would otherwise be patently obvious (to the unlearned).

Read the whole story
jthaman
240 days ago
reply
Maryland
Share this story
Delete

Teaching general problem-solving skills is not a substitute for teaching math [pdf] (2010)

1 Share

Article URL: https://www.ams.org/notices/201010/rtx101001303p.pdf

Comments URL: https://news.ycombinator.com/item?id=40890847

Points: 179

# Comments: 143

Read the whole story
jthaman
268 days ago
reply
Maryland
Share this story
Delete

Is there a balance to be struck between simple hierarchical models and more complex hierarchical models that augment the simple frameworks with more modeled interactions when analyzing real data?

1 Share

Kiran Gauthier writes:

After attending your talk at the University of Minnesota, I wanted to ask a follow up regarding the structure of hierarchical / multilevel models but we ran out of time. Do you have any insight on the thought that probabilistic programming languages are so flexible, and the Bayesian inference algorithms so fast, that there is a balance to be struck between “simple” hierarchical models and more “complex” hierarchical models that augment the simple frameworks with more modeled interactions when analyzing real data?

I think that a real benefit of the Bayesian paradigm is that (in theory) if the data doesn’t converge my uncertainty in a parameter, then the inference engine should return my prior (or something close to it). Does this happen in reality? I know you’ve written about canary variables before as an indication of model misspecification which I think is an awesome idea, I’m just wondering how to strike that balance between a simple / approximate model, and a more complicated model given that the true generative process is unknown, and noisy data with bad models can lead good inference engines astray.

My reply: I think complex models are better. As Radford Neal put it so memorably, nearly thirty years ago,

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

That said, I don’t recommend fitting the complex model on its own. Rather, I recommend building up to it from something simpler. This building-up occurs on two time scales:

1. When working on your particular problem, start with simple comparisons and then fit more and more complicated models until you have what you want.

2. Taking the long view, as our understanding of statistics progresses, we can understand more complicated models and fit them routinely. This is kind of the converse of the idea that statistical analysis recapitulates the development of statistical methods.

Read the whole story
jthaman
305 days ago
reply
Maryland
Share this story
Delete
Next Page of Stories