P of significance: Is it better to avoid it if it is poorly understood?

Santibáñez, Miguel; García-Rivero, Juan Luis; Barreiro, Esther

doi:10.1016/j.arbr.2019.11.016

Archivos de Bronconeumología

ISSN: 0300-2896

Archivos de Bronconeumologia is an international journal that publishes original studies whose content is based upon results of research initiatives dealing with several aspects of respiratory medicine including epidemiology, respiratory physiology, pathophysiology of respiratory diseases, clinical management, thoracic surgery, pediatric lung diseases, respiratory critical care, respiratory allergy and translational research. Other types of articles such as editorials, reviews, and different types of letters are also published in the journal. Additionally, the journal expresses the voice of the following scientific societies: the Spanish Respiratory Society of Pneumology and Thoracic Surgery (SEPAR; https://www.separ.es/), the Latin American Thoracic Society (ALAT; https://alatorax.org/), and the Iberian American Association of Thoracic Surgery (AIACT; http://www.aiatorax.com/).

It is a monthly journal in which all manuscripts are sent to peer-review and handled by the editor or an associate editor from the team and the final decision is made on the basis of the comments from the expert reviewers and the editors. The journal is published solely in English. All the published data is composed of novel manuscripts not previously published in any other journal and not being in consideration for publication in any other journal..

The journal is indexed at Science Citation Index Expanded, Medline/Pubmed, Embase and SCOPUS. Access to any published article is possible through the journal's web page as well as from Pubmed, ScienceDirect, and other international databases. Furthermore, the journal is also present in X, Facebook and Linkedin. Manuscripts can be submitted electronically using the following web site: https://www.editorialmanager.com/ARBR/.

Indexed in:

Medline, Science Citation Index Expanded (SCIE)

This editorial follows on from a previously published editorial which explained the role of inferential statistics in the scientific method.1 The aim of this second editorial is to highlight the most common errors in the interpretation of the p-value and statistical significance, in line with recent articles and comments in impact journals such as Nature that echo initiatives such as that of more than 800 prestigious scientists who call for an end to the use of significance thresholds and the dichotomous notion of statistical significance.2–5

To understand the above, we must remember that the aim of so-called inferential statistics is to evaluate the role of chance in our results. This can be quantified or estimated by obtaining the standard error, calculating the probability that the results can be explained by chance under the null hypothesis or H0, giving us a p-value in statistical significance tests. This approach, known as null hypothesis significance testing (NHST), was invented in the 1920s and 1930s by Ronald Aylmer Fisher (recognized as the father of inferential statistics), in order to determine which fertilizer increased maize production to the greatest extent. NHST involves a dichotomous approach, as follows: if the p-value is less than a statistical significance threshold (0.05 based on the consensus of an alpha risk of 5%), the null hypothesis is rejected and the alternative hypothesis is therefore accepted.

This has resulted in a reductionist interpretation, in which if p<0.05, a result is considered significant (e.g. a 120 ml difference in FEV1 between groups in favor of a new inhaled therapy molecule versus another standard treatment) and “there are differences between the two treatments”, whereas if the same treatment with the same 120 ml difference has a p of, for instance, 0.06, it is considered non-significant.

The main objective of this editorial is to make clear that non-statistically significant differences are not synonymous with equivalence. The fact that a result is not statistically significant does not necessarily imply that the interventions are equivalent. However, the authors of a published study were alarmed to find that in more than 50% of articles, when p is non-significant, it is erroneously concluded that “there are no differences between the 2 treatments” or, worse still, both drugs or interventions are considered to be “equal or equivalent”.2,6–9

This editorial does not aim to provide a comprehensive explanation of statistics, but we should remember that when we accept the null hypothesis (Ho), a beta error emerges, which is the probability of not having found differences when they actually exist, that is, the probability of not rejecting the null hypothesis when it is false. The complementary aspect is statistical power (1 — beta error), which is the probability of finding statistically significant differences if they really do exist.

There is an example in English where a researcher is compared to Michael Jordan (the basketball player)10 and another, adapted to Spanish, where the ability of a researcher and Leo Messi (the soccer player) to shoot penalties is compared.11

In the latter example, both shoot 8 penalties from the same positions with a defensive wall of 5 players. Messi scores 8 goals, all in the back of the net, and the researcher scores 4 and misses another 4. Later, at home that night, the researcher enters the data in the computer to check whether statistically there is much difference between their scores and those of Messi, and calculates the p-value using Fisher’s exact test (2-tailed). The p-value is 0.077. In other words, the difference is not statistically significant.

If the researcher goes to bed, happy in the knowledge that there are no differences between their penalty shootout results and Messi’s, he is being easily fooled, because in reality it is clear that there are differences between the two. Therefore, if we accept the null hypothesis we will be falling into the beta-type error, which in this case is high because the power of the study to detect differences is low due to the low sample size (number of penalties shot).

Let’s not forget that the standard error can be used in both the p-value approach to significance and also in the construction of 95% confidence intervals (95% CI). The latter also support the rejection of the null hypothesis, but the width of the intervals, whether narrow or wide, reports the so-called “effect size”, and as such the precision of the study.

Logically, in the case of the example of Messi, the 95% CI of the difference in percentage of goals will be very wide, that is, very imprecise. If we increase the number of penalty shots to, for example, 80, we would see how the standard error decreases because the sample size increases and the same difference in the percentage of goals (100% for Messi and 50% for the researcher) becomes statistically significant (p < 0.001), with a much more precise 95% CI.

Finally, the International Conference on Harmonization (ICH) defines an equivalence trial as a clinical trial in which the main objective is to show that the response to the 2 treatments differs by an amount that is not clinically important.12 Thus, in order to truly compare a hypothesis of equivalence between Messi and the researcher, you would need to: (a) have set non-inferiority and non-superiority limits (which would establish the percentage differences in goals scored that would be considered as equivalent); b) have determined the 95% CI of the percentage difference instead of the p-value of significance, and (c) have verified that the 95% CI was within these limits.

References

[1]

M. Santibáñez, J.L. García-Rivero, E. Barreiro.

Don’t put the cart before the horse (if you want to publish in a journal with impact factor).

Arch Bronconeumol, 56 (2020), pp. 70-71

http://dx.doi.org/10.1016/j.arbr.2019.05.019 | Medline

[2]

V. Amrhein, S. Greenland, B. McShane.

Scientists rise up against statistical significance.

Nature, 567 (2019), pp. 305-307

http://dx.doi.org/10.1038/d41586-019-00857-9 | Medline

[3]

S.H. Hurlbert, R.A. Levine.

Utts. Coup de Grâce for a Tough Old Bull: “Statistically Significant” expires.

Am Stat, 73 (2019), pp. 352-357

[4]

B.B. McShanea, D. Galb, A. Gelmanc, C. Robertd, J.L. Tackette.

Abandon statistical significance.

Am Stat, 73 (2019), pp. 235-245

[5]

R.L. Wasswerstein, A.L. Schirm, N.A. Lazar.

Moving to a world beyond p < 0.05.

Am Stat, 73 (2019), pp. 1-19

[6]

P. Schatz, K.A. Jay, J. McComb, J.R. McLaughlin.

Misuse of statistical tests in Archives of Clinical Neuropsychology publications.

Arch Clin Neuropsychol, 20 (2005), pp. 1053-1059

http://dx.doi.org/10.1016/j.acn.2005.06.006 | Medline

[7]

F. Fidler, M.A. Burgman, G. Cumming, R. Buttrose, N. Thomason.

Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology.

Conserv Biol, 20 (2006), pp. 1539-1544

http://dx.doi.org/10.1111/j.1523-1739.2006.00525.x | Medline

[8]

R. Hoekstra, S. Finch, H.A. Kiers, A. Johnson.

Probability as certainty: dichotomous thinking and the misuse of p values.

Psychon Bull Rev, 13 (2006), pp. 1033-1037

http://dx.doi.org/10.3758/bf03213921 | Medline

[9]

F. Bernardi, L. Chakhaia, L. Leopold.

Sing me a song with social significance: the (Mis)Use of Statistical Significance Testing in European Sociological Research.

Eur Sociol Rev, 33 (2017), pp. 1-15

[10]

AJ Vickers.

Michael Jordan won’t accept the null hypothesis: notes on interpreting high P values.

Mescape, 7 (2006), pp. 3

[11]

J. Pascual-Huerta.

Yo no tiro las faltas como Leo Messi, porque no rechazar la hipótesis nula no es aceptarla.

Rev Esp Podol, 28 (2017), pp. 119-120

[12]

ICH Harmonised Tripartite Guideline.

Statistical principles for clinical trials. International Confer- ence on Harmonisation E9 Expert Working Group.

Stat Med, 18 (1999), pp. 1905-1942

Medline

☆

Please cite this article as: Santibáñez M, García-Rivero JL, Barreiro E. p de significación: ¿mejor no usarla si se interpreta mal? Arch Bronconeumol. 2020;56:613–614.