Researchers have tried unsuccessfully for many years using randomized controlled trials to show the efficacy of prone ventilation in treating ARDS. These failed attempts were of use in designing the successful PROSEVA trial, published in 2013. However, the evidence provided by meta-analyses in support of prone ventilation for ARDS was too low to be conclusive. The present study shows that meta-analysis is indeed not the best approach for the assessment of evidence as to the efficacy of prone ventilation.

MethodsWe performed a cumulative meta-analysis to prove that only the PROSEVA trial, due to its strong protective effect, has substantially impacted on the outcome.

We also replicated nine published meta-analyses including the PROSEVA trial. We performed leave-one-out analyses, removing one trial at a time from each meta-analysis, measuring p values for effect size, and also the Cochran's Q test for heterogeneity assessment. We represented these analyses in a scatter plot to identify outlier studies influencing heterogeneity or overall effect size. We used interaction tests to formally identify and evaluate differences with the PROSEVA trial.

ResultsThe positive effect of the PROSEVA trial accounted for most of the heterogeneity and for the reduction of overall effect size in the meta-analyses. The interaction tests we conducted on the nine meta-analyses formally confirmed the difference in the effectiveness of prone ventilation between the PROSEVA trial the other studies.

ConclusionsThe clinical lack of homogeneity between the PROSEVA trial design and the other studies should have discouraged the use of meta-analysis. Statistical considerations support this hypothesis, suggesting that the PROSEVA trial is an independent source of evidence.

The first report on ventilation in the prone position in ARDS was published in 1976.1 The approach was appreciated by intensivists because improved oxygenation was observed in most cases. However, no evidence was given to show that the method resulted in a higher rate of survival.2 Moreover, pronation of critically ill patients exposed them to several risks and required skilled specialists to be performed safely. For these reasons, risk and benefit assessments were called for.

The efficacy of prone ventilation In ARDS was investigated in several trials published between 2001 and 2009.3–6 Although all unsuccessful, they helped researchers to generate hypotheses concerning what patients would benefit the most from the treatment. They also helped them to greatly improve the treatment.7 Guerin and his collaborators designed the PROSEVA trial taking case-mix selection as well as pronation and ventilatory protocols into account, capitalizing on the experience gained through the above-mentioned previous trials.8 These specific features had not been sufficiently brought out by the standard meta-analyses used to investigate prone ventilation in ARDS after 2013. Nine of these meta-analyses were published between 2014 and 2021,9–17 each of which used between six and eleven randomized controlled trials (RCTs) selected from a pool of thirteen.3–6,8,18–25 The efficacy of this treatment, however, was only weakly supported by evidence from the meta-analyses carried out after the PROSEVA study was published. The findings of eight of these were non-significant, although in various subgroups (e.g., patients with more severe hypoxemia, subjected to prone ventilation for a longer time than others were, or treated with protective ventilation) a significant reduction in mortality was observed. In particular, the Cochrane meta-analysis provided only a rather feeble indication for prone ventilation in the subset of patients with severe hypoxemia. This is no surprise, given the exploratory nature of subgroup analyses in meta-analysis, where studies are assigned to a subgroup according to a specific feature, without accounting for the distribution of other confounders.

Using cumulative meta-analysis and an analysis of heterogeneity, we tested the hypothesis that the PROSEVA trial is an independent source of evidence for the use of prone positioning in ARDS

MethodsStandard and cumulative meta-analyses in ATS meta-analysis trialsWe began by using cumulative meta-analysis to investigate how the findings of the trials evolved over time.26 Cumulative meta-analysis generates chronological meta-analytical summary effects including in the analysis one study at a time. This approach allows us to evaluate the influence of treatment on the outcome as studies add up, with a stabilizing effect due to the increase of the sample size.

We based this analysis on the eight trials that were included among the meta-analyses published by the Annals of the American Thoracic Society in 2017.9 We performed two different cumulative meta-analyses, the first one using the studies that reported short-term outcomes (28-day or ICU mortality), the second one including those that reported medium to long term outcomes (from 2 to 6 months).

We also performed standard meta-analyses of the eight trials, and sensitivity meta-analyses, both standard and cumulative. We did so after excluding trials that randomized fewer than one hundred patients, to avoid the small-study bias also known as the “winner curse” (i.e., an exaggerated effect in favour of intervention studied in the trial),27,28 which is difficult to detect with Egger's regression, as this test is underpowered when performed on fewer than ten studies.29

Standard and leave-one-out meta-analysesThe second step was to spot outliers among the nine meta-analyses that investigated prone ventilation for ARDS after the PROSEVA trial was published.9–16 We first replicated each meta-analysis using the same original datasets. To improve homogeneity and comparability of the nine meta-analyses – in terms of effect size, effect precision, and statistical heterogeneity – we used relative risks as the outcome measure and random effects regardless of the choices made in the original papers. We then performed a leave-one-out analysis by fitting the model repeatedly while excluding one RCT at a time, to evaluate the influence of each RCT on the summary effect and on the statistical heterogeneity of each meta-analysis.30 Thus, for example, for a meta-analysis including eight trials, eight different meta-analyses were performed, each including seven trials.

We then plotted the paired p values in a scatter plot for effect size and heterogeneity provided by each leave-one-out analysis. This plot is a modification of the Baujat graphical method,31 which we call the “double-p plot”. We chose to use it because the axis scales are probabilities that are more familiar to the average reader than the standardized square differences used in the Baujat plot. Although the visual information they provide is similar, the two plots are different. In the Baujat plot, each point indicates the contribution of each study to heterogeneity and overall effect, whereas in the double-p plot each point indicates what the p values for heterogeneity and overall effect would be if the meta-analysis was performed on all the trials but one. It is thus a means to assess the impact of the excluded trial on heterogeneity and overall effect. The double-p plot visually represents the correlation between p values for heterogeneity and p values for the overall effect, bringing out particular combination patterns that may occur when one or the other trial is left out of the meta-analysis.

Finally, we performed a subgroup analysis for each of the nine meta-analyses, comparing the PROSEVA trial with the other trials and formally assessing the differences in the effect estimates by means of an interaction test.32

We assessed heterogeneity in all the meta-analyses, by performing Cochran's Q statistical tests under a null hypothesis of homogeneity,33 and measuring the fraction of heterogeneity attributable to between-study variability with I2.34,35 We calculated I2 95% confidence intervals to assess the degree of imprecision of this measurement.

All statistical analyses were performed with the metafor package for R (version 4.0.5).36,30

ResultsCumulative meta-analysisSeven of the eight trials included in the ATS meta-analysis reported short-term mortality and seven the medium/long-term outcome. Thus, we performed two different cumulative meta-analyses.

Both cumulative meta-analyses showed similar trends over time (Figs. 1 and 2). As the sample size increased, there was a substantial stabilization of the results, which indicated a lack of protective effect for prone ventilation. Before the PROSEVA trial was added, the relative risk was 0.96 (95% confidence interval 0.85-1.08) and 0.98 (95% confidence interval 0.89-1.08) for early and medium/long term mortality, respectively. In contrast, mortality dropped significantly when the PROSEVA trial was included in the analysis: the respective relative risks for the two outcomes were 0.83 and 0.85 (95% confidence interval 0.66-1.05 and 0.70-1.04, respectively).

No small-study effect was observed after excluding trials with less than one hundred patients from the cumulative meta-analyses (ESM Figs. 1 and 2). However, the standard meta-analyses showed an increase of I2 with increased precision when small studies with strong protective effects (thus closer to the PROSEVA trial) were removed (ESM Figs. 3 to 6).

The plot depicts the degree of correlation between p values for effect size and p values for heterogeneity. Points within the dashed square, are the meta-analyses performed leaving out the PROSEVA trial by Guerin et al. In all the other cases (points within the dashed triangle), the trial is included and the other RCTs are left out one at the time. When the PROSEVA trial is left out from the meta-analysis, both p values increase significantly compared to when the other trials are left out. This means that both the p values for both the overall effect and heterogeneity are strongly influenced by the presence of the PROSEVA trial, which should be regarded as the only outlier. CMAJ = Canadian Medical Association Journal, ATS = American Thoracic Society, Crit Care = Critical Care, Crit Care Med = Critical Care Medicine, J Thorac Dis = Journal of Thoracic Diseases, Cochrane = Cochrane Database of Systematic Reviews, Med Int = Medicina Intensiva, Int Care Med = Intensive Care Medicine.

Each of the nine meta-analyses included between six and eleven RCTs, selected from a pool of thirteen RCTs, and yielded consistent findings (Table 1). Relative risks ranged between 0.84 and 0.86, with no significant drop in mortality. In six out of eight studies, the Cochran's Q test for homogeneity was statistically significant, indicating that between-study variability could not be ascribed to chance. Although I2 values were never lower than 50% (high heterogeneity according to conventional thresholds34), the results were inconclusive because the confidence intervals ranged between low and high heterogeneity.

Results of the nine meta-analyses performed using relative risks to measure outcome and random-effects models for variance assessment. RR = Relative risk, Heterogeneity p value = p value for the Cochran's Q statistics for assessment of heterogeneity, 95%-CI = 95% confidence interval, CMAJ = Canadian Medical Association Journal, ATS = American Thoracic Society, Crit Care = Critical Care, Crit Care Med = Critical Care Medicine, J Thorac Dis = Journal of Thoracic Diseases, Cochrane = Cochrane Database of Systematic Reviews, Med Int = Medicina Intensiva, Int Care Med = Intensive Care Medicine.

Journal-Year | N of RCTs | Outcome - Mortality | RR (95%-CI) | Estimate p value | I² (95%-CI) | Heterogeneity p value |
---|---|---|---|---|---|---|

Annals of the ATS-2017 | 8 | at 28 days, if not available, at 30 days, hospital or ICU discharge | 0.84 (0.67-1.05) | 0.123 | 63 (4-92) | 0.018 |

Crit Care-2014 | 7 | at 28-30 days | 0.86 (0.68-1.09) | 0.203 | 68 (12-92) | 0.010 |

Crit Care Med-2014 | 11 | at the longest available follow-up | 0.86 (0.73-1.01) | 0.063 | 50 (0-80) | 0.087 |

CMAJ-2014 | 10 | at hospital discharge, if not available, at the longest duration of follow-up | 0.85 (0.72-1.01) | 0.072 | 50 (0-83) | 0.085 |

Cochrane | 8 | at 10 to 30 days, or ICU discharge | 0.84 (0.67-1.06) | 0.137 | 65 (7-91) | 0.014 |

Int Care Med-2014 | 8 | at 60 days | 0.84 (0.68-1.02) | 0.084 | 65 (14-92) | 0.006 |

J Int Care Med-2021 | 6 | at 28-30 days, or ICU discharge | 0.84 (0.65-1.09) | 0.193 | 73 (20-95) | 0.006 |

J Thorac Dis-2015 | 8 | at the longest available follow-up | 0.86 (0.71-1.04) | 0.124 | 66 (8-93) | 0.013 |

Med Int-2015 | 7 | at the longest available follow-up | 0.86 (0.7-1.04) | 0.123 | 70 (17-96) | 0.007 |

In the double-p plot, each colour designates one of the nine meta-analyses. The number of points per colour is equal to the number of leave-one-out procedures (equal, in its turn, to the number of RCTs included in each meta-analysis).

The plot (Fig. 3) shows that every time the PROSEVA study is excluded from a meta-analysis the p values for both effect and heterogeneity increase strikingly, whereas when other trials are omitted, they remain low with small variations.

As shown in Table 2, the interaction test, which formally investigates the heterogeneity of effects across subgroups, was always significant when the PROSEVA trial was compared with the rest of the studies.

Subgroup analysis comparing the findings of the PROSEVA trial with those of the remaining trials for each meta-analysis. RR = Relative risk, 95%-CI = 95% confidence interval, CMAJ = Canadian Medical Association Journal, ATS = American Thoracic Society, Crit Care = Critical Care, Med Crit Care = Critical Care Medicine, J Thorac Dis = Journal of Thoracic Diseases, Cochrane = Cochrane Database of Systematic Reviews, Med Int = Medicina Intensiva, Int Care Med = Intensive Care Medicine.

Journal-year | PROSEVA trial | Meta-analysis ofthe other trials | p value for interaction | ||
---|---|---|---|---|---|

RR (95%-CI) | p value | RR (95%-CI) | p value | ||

Annals of the ATS-2017 | 0.49 (0.35-0.69) | < 0.01 | 0.96 (0.85-1.09) | 0.559 | < 0.01 |

Crit Care-2014 | 0.49 (0.35-0.69) | < 0.01 | 0.98 (0.87-1.11) | 0.778 | < 0.01 |

Crit Care Med-2014 | 0.62 (0.47-0.81) | < 0.01 | 0.96 (0.86-1.06) | 0.408 | < 0.01 |

CMAJ-2014 | 0.58 (0.44-0.77) | < 0.01 | 0.95 (0.85-1.05) | 0.326 | < 0.01 |

Cochrane | 0.49 (0.35-0.69) | < 0.01 | 0.96 (0.85-1.08) | 0.512 | < 0.01 |

Int Care Med-2014 | 0.54 (0.4-0.73) | < 0.01 | 0.95 (0.83-1.09) | 0.452 | < 0.01 |

J Int Care Med-2021 | 0.49 (0.35-0.69) | < 0.01 | 0.98 (0.86-1.11) | 0.720 | < 0.01 |

J Thorac Dis-2015 | 0.58 (0.44-0.76) | < 0.01 | 0.98 (0.89-1.08) | 0.639 | < 0.01 |

Med Int-2015 | 0.58 (0.44-0.76) | < 0.01 | 0.98 (0.89-1.08) | 0.638 | < 0.01 |

Prone ventilation is considered to be standard care for severe ARDS, and has been strongly recommended by recent guidelines based on subgroup analyses of negative meta-analysis and negative trials.37,38 However, the Cochrane handbook regards subgroup analyses as “observational by nature” (9.6.2 What are subgroup analyses?). Indeed, subgroup analyses in meta-analysis are exploratory in nature, since studies are assigned to a subgroup according to a specific feature, without accounting for the distribution of other confounders. Thus, when more confounders than one are present in the subgroup, we may be unable to recognize the ones affecting the outcome.

Meta-regression was also used in three studies to try to identify the subset of patients who responded to prone ventilation.13,14,17 However, it has the same limitations as subgroup analysis (Cochrane handbook 10.11.6 Interpretation of subgroup analyses and meta-regressions). Only one variable at a time is tested as a moderator when the number of studies is too small to adopt a multivariable approach (as a rule of thumb, ten studies are required for each moderator). Therefore, the information provided by these studies should not be regarded as evidence but preferably as a possible basis for an hypothesis.

Thus, evidence in support of prone ventilation in ARDS is inherently weak if we rely on meta-analytical approaches, implicitly prioritizing them with respect to the PROSEVA trial.

On the other hand, the standard assessment of heterogeneity by means of I2 is of limited usefulness when dealing with only a few studies, as in our case. If we only looked at the I2 estimates, which were high in the nine meta-analyses (Table 1), we could wrongly conclude that they showed high heterogeneity. Nevertheless, we considered heterogeneity assessment using the I2 to be inconclusive because confidence intervals ranged between low and high heterogeneity. In our case, lack of power diminished the usefulness of I2. However, when numerous large studies are meta-analysed, the opposite problem may come up, with clinically homogeneous studies (i.e., with similar case-mixes, treatments and controls) turning out to be statistically heterogeneous. Hence, clinical reasoning should always be used to support and integrate statistical analyses.

Given these premises, the use of a leave-one-out approach to investigate outliers turned out to be a winning strategy for testing our hypothesis that the PROSEVA trial was clinically too different from the previous trials to be meta-analytically combined with them. However, none of the nine meta-analyses adopted this strategy, although Cochrane's handbook does recommend meta-analyses both with and without clearly outlying studies (9.5.3 Strategies for addressing heterogeneity).

Only one study investigated outliers,17 using studentized residuals.39 However, seeing PROSEVA trial was an outlier did not substantially affect its conclusions.

The use of a leave-one-out strategy allowed us to investigate heterogeneity in greater detail, providing evidence in support of our hypothesis by showing that the higher heterogeneity was mainly attributable to the PROSEVA trial.

There were good clinical reasons for not combining the PROSEVA trial with the other studies. The main four RCTs that were published earlier had been unsuccessful in demonstrating improved survival (ESM Figs. 1 and 2).3–6 However, progress in physiological knowledge regarding ventilation in ARDS led to improvements in how the trials were designed.40 Thus, it was possible to hypothesize that patients with the most severe hypoxia would benefit the most from pronation,41 the duration of pronation needed to be prolonged, and protective ventilation needed to be associated with prone ventilation. The PROSEVA trial was designed on the basis of these hypotheses, which make it inherently different from previous RCTs, as Gattinoni et al. have shown.7

However, when trials bearing differences in case-mix, study treatment and other associated treatments that may modify the outcome are combined in a meta-analysis, the risk of generating spurious findings is high.42

Our study supports the hypothesis that the PROSEVA trial is an outlier compared to the RCTs that preceded it. Our cumulative meta-analyses show that the PROSEVA trial changes the estimate of effect strikingly. Moreover, given the strong protective effect of the PROSEVA trial compared to the other trials, the exclusion of the PROSEVA trial from any meta-analysis markedly increases the Cochran's Q p value and the p value for effectiveness. Finally, the protective effect of the PROSEVA trial was formally confirmed by interaction tests (Table 2).

Several topics in medicine have been investigated over long periods of time as knowledge of these topics evolved. Thus, the case of the PROSEVA trial should not be considered an exception. When it does happen, attention should be paid to clinical heterogeneity before studies are selected for meta-analysis. The problem of heterogeneity across studies, typical for aggregated data, could be offset by sharing trial datasets making individual-patient granular information public. Individual-patient data could be analysed with powerful multivariable and machine-learning tools, as has recently been done for ARDS phenotypes.43

LimitationsThe protocol for our meta-analysis was neither registered nor published.

Cochran's Q test has been criticized for its sensitivity to the number of studies included in meta-analysis and for its lack of power when performed on a low number of studies.35 We argue, however, that neither the number of studies nor power issues could have affected the results of our study. At each round in the leave-one-out process, the number of studies remains the same (the total number of RCTs minus one) for each meta-analysis. As to power, it can be an issue if a threshold has to be reached; in our study, however, we only focus on Cochran's Q p-value variations on a continuous scale, regardless of any threshold.

The double-p plot is not a formal approach; it only provides visual information. The combination with an interaction test gives more precise measurement of difference in statistical terms but statistically significant differences do not necessarily have clinical relevance. Moreover, the double-p plot may yield very different patterns, which may not be readily interpretable. However, in the specific case of prone ventilation for ARDS, the clinical premises of heterogeneity and very clear patterns leave little uncertainty about the advisability of considering evidence from a PROSEVA trial independently of other trials.

Some authors have questioned the quality of evidence provided by the PROSEVA trial, arguing that a higher prevalence of important prognostic factors may have favoured survival in the prone ventilation group.44 However, assessment of quality lies outside the purview of our study, which is to evaluate the appropriateness of the use of meta-analysis in a particular case.

ConclusionsWeak evidence provided by meta-analyses may have fostered the low rate of prone ventilation prescriptions in severe ARDS reported in literature.45 Our study brings out the limitations of the meta-analytical approach in treating ARDS and strongly reinforces Cochrane's recommendation that a one-by-one study exclusion analysis be performed when potential outlier studies are identified.

For clinical recommendations, the findings of the PROSEVA trial should be regarded as an independent source of evidence.

Funding statementResearch for the present paper has not been funded by any grant from funding agencies.

This research has not received any grant from funding agencies in the public, commercial or not-for-profit sectors. non-profit sectors.

AuthorshipDaniele Poole had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Daniele Poole, Roberto Fumagalli and Andrea Pisa contributed to the interpretation of studies and the writing of the manuscript.

CRediT authorship contribution statementD. Poole: Conceptualization, Writing – review & editing, Data curation, Formal analysis, Methodology, Writing – original draft. A. Pisa: Conceptualization, Writing – review & editing. R. Fumagalli: Conceptualization, Writing – review & editing.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.

*et al*.