Brief Clarifications, Open Questions: Commentary on Liu et al. (2018)

Liu et al. (2018) present a study that implements the conceptual modality switch (CMS) paradigm, which has been used to investigate the modality-specific nature of conceptual representations (Pecher et al., 2003). Liu et al.‘s experiment uses event-related potentials (ERPs; similarly, see Bernabeu et al., 2017; Collins et al., 2011; Hald et al., 2011, 2013). In the design of the switch conditions, the experiment implements a corpus analysis to distinguish between purely-embodied modality switches and switches that are more liable to linguistic bootstrapping (also see Bernabeu et al., 2017; Louwerse & Connell, 2011). The procedure for stimulus selection was novel as well as novel; thus, it could prove useful in future studies, too. In addition, the application of bayesian statistics is an interesting and promising novelty in the present research area.

In reviewing the literature, Liu (2018) and Liu et al. (2018) contend that previous studies may be strongly biased due to methodological decisions in the analysis of ERPs. These decisions particularly regard the latency—i.e., time windows—and the topographic regions of interest—i.e., subsets of electrodes. Thus, Liu et al. identify a ‘highly inconsistent’ (p. 6) landscape in the ERP components that have been ascribed to the CMS effect in previous studies. Similarly, Liu (p. 47) writes:

Several studies have looked for the ERP manifestations of modality switching costs (Bernabeu, Willems, & Louwerse, 2017; Collins, Pecher, Zeelenberg, & Coulson, 2011; Hald, Hocking, Vernon, Marshall, & Garnham, 2013; Hald, Marshall, Janssen, & Garnham, 2011). However, what they found was not a clear picture. Not only was a significant effect found in the time window for the N400 component, but also a so-called early N400-like effect around 300ms (Bernabeu et al., 2017; Hald et al., 2011), the N1-P2 complex around 200ms (Bernabeu et al., 2017; Hald et al., 2013, 2011), as well as the late positivity component (LPC) after 600ms (Bernabeu et al., 2017; Collins et al., 2011; Hald et al., 2011).

Drastic conclusions

Liu et al. (2018) conclude that a confirmatory research approach is not warranted, and adopt a semi-exploratory approach, creating time windows of 50 ms each, rather than windows linked to known ERP components (Swaab et al., 2012), and using bayesian statistics. Such a drastic conclusion appears to stem from the assumption that, if the CMS were robust enough, it would present in the same guise across studies. Such an assumption, however, may merit further examination, considering the multiplicity of known and unknown variables that may differ across experiments (Barsalou, 2019). This variability influences the replication praxis, as it were. Indeed, Liu et al. themselves allude to one such variable (p. 7).

These previous studies did not only examine the effect of modality switching but also other linguistic factors such as negated sentences, which could easily distort observed waveforms (Luck, 2005).

Liu et al.‘s (2018) conclusions about the ‘inconsistent’ results do not seem to duly weigh the differences across studies. None of the existing studies is a direct replication of another. The example quoted from Liu et al. above, regarding the presence of negated sentences, is one of the less important differences because one of the corresponding studies implemented the negation as a controlled, experimental condition, contrasting with affirmative sentences (Hald et al., 2013). Yet, other differences exist in virtually every aspect, from the types of modality switch used to the time-locking of ERPs. For instance, the studies vary in the implementation of the modality switches. Whereas Hald et al. (2011) distinguish between a switch and a non-switch condition, Bernabeu et al. (2017) analysed each type of switch separately—i.e., auditory-to-visual, haptic-to-visual, visual-to-visual. Another difference across the studies is the onset point for ERPs. For instance, Bernabeu et al. time-locked ERPs to the point at which the modality switch is actually elicited in the CMS paradigm—namely, the first word in target trials. In contrast, the other studies time-locked ERPs to the second word. In addition, these studies vary in their timelines—i.e., presentation of words and inter-stimulus intervals—, as well as in the words that were used as stimuli, in the preprocessing of ERPs, in the statistical analysis, and even in the language of testing in one case (all studies using English except Bernabeu et al., 2017, which used Dutch). Moreover, the studies differ in the sample size, ranging from ten finally-analysed participants (Hald et al, 2011) to 46 finally-analysed participants (Bernabeu et al., 2017). Last, the studies differ in the number of items per modality switch condition, ranging from 17 (in one of Liu et al.‘s, 2018 conditions) to 40 per condition (Hald et al., 2011, 2013). Undoubtedly, seeing larger sample sizes and more stimulus items used in ERP studies is something to promote and celebrate. Last, it may be noted that Liu et al.‘s stance on the inconsistency of previous results starkly contrasts with their use of a single study—Hald et al. (2011)—, with N = 10, as the motivation for their sample size (Albers & Lakens, 2018).


Liu et al. (2018) apply bayesian statistics reportedly to help reduce the bias that may exist in the present literature. However, the reviews by Liu (2018) and Liu et al. appear to gloss over some important aspects regarding previous studies. For instance, Liu writes (p. 53):

The inflation of the probability of Type I error leads to an over-confidence in the interpretation of the results. For example, in the studies on modality switching costs, different time windows were chosen to test the early effect of modality switching costs. While Bernabeu et al. (2017), Hald et al. (2011) and Hald et al. (2013) examined the segment of ERP waveform between 190ms and 300ms or 160ms and 215ms based on visual inspection and found significant effects, Collins et al. (2011) chose a prescribed time window between 100ms and 200ms before the analysis and did not find the effect.

Liu also refers to the issue of multiple tests related to the multiple time windows and electrodes we find in ERP studies (p. 59):

It is typical for ERP studies to conduct multiple comparisons (e.g., running the same ANOVA repeatedly on different subsets of data like different time windows and different groups of electrodes). This would massively increase Type I error if no post hoc correction is conducted. However, if Bonferroni or other correction is conducted, it will render the study over-conservative, thus increasing the chance of Type II error. In the present thesis, 90 electrodes will be analysed individually, with 20 time slices in each trial. That results in 1800 NHSTs for each critical variable. A correction of multiple comparison will require a critical level of 2.78 x 10ˆ-5 for each test for a family-wise critical level of .05 (and an uncorrected test will almost definitely lead to false positive results). This stringent criterion could conceivably render it meaningless any p-values we can obtain from a statistical package.

Arguably, the scenario presented by Liu (2018), in which a researcher could conduct a purely data-driven analysis of ERP data, is extreme. The field of psycholinguistics, in general, does not have a tradition of purely data-driven analysis. Instead, it blends a humanistic background with a scientific methodology. As a result, the hypotheses and methods tend to be largely driven by the available literature. For instance, Bernabeu et al. (2017, quoted below from p. 1632) based their time windows and regions of interest on the most relevant of the preceding studies (also see Bernabeu, 2017).

Electrodes were divided into an anterior and a posterior area (also done in Hald et al., 2011). Albeit a superficial division, we found it sufficient for the research question. Time windows were selected as in Hald et al., except for the last window, which was extended up to 750 ms post word onset, instead of 700 ms, because the characteristic component of that latency tends to extend until then, as we confirmed by visual inspection of these results.

The literature-based approach follows the advice from Luck and Gaspelin (2017, p. 149), who wrote: ‘a researcher who wants to avoid significant but bogus effects would be advised to focus on testing a priori predictions without using the observed data to guide the selection of time windows or electrode sites’ (also see Armstrong, 2014; for a more conservative stance, see Luck & Gaspelin, 2017). In addition, notice that the extension of the last window by 50 ms was informed by Swaab et al. (2012), who report results by which the P600 component (the main component occurring after the N400 in word reading) extended up to 800 ms.

Next, Liu et al. (2018, pp. 6–7) write:

However, the findings of these components have been highly inconsistent. The N400 effect alone was found in the posterior region in some cases (Bernabeu et al., 2017; Hald et al., 2013), while in anterior region in others (Collins et al., 2011, Hald et al. (2011)).

Yet, Bernabeu et al. (2017, p. 1632) had written:

In certain parts over the time course, the effect appeared in both anterior and posterior areas

Liu et al. (2018) continue (p. 7):

In some cases, it was found in the typical window around 400ms (Collins et al., 2011), while in others an earlier window from 270ms to 370ms (Bernabeu et al., 2017; Hald et al., 2011).

However, Bernabeu et al. (2017, p. 1632) had written:

The ERP results revealed a CMS effect from Time Window 1 on, larger after 350 ms.

Regarding the time-locking of ERPs, Liu (2018, p. 43) writes:

Because the properties were usually salient for the concepts, the switching costs might have already been incurred when participants were processing the concept word. Bernabeu et al. (2017), in their recent replication of previous ERP studies, reversed the order of concept and property and did not find an immediate effect from the property onset. In future studies, it is recommended to adopt the reverse order, control the concept words so that they do not automatically activate the properties before the words are shown, or analyse epochs after both the concept and property words.

The above excerpt seems to reveal a misunderstanding of Bernabeu et al.‘s (2017) method and results (we may assume that, by ‘immediate’, Liu (2018) is referring to the 200 ms point or afterwards, since that is about as immediate as it gets; see Amsel et al., 2014; Swaab et al., 2012; Van Dam et al., 2014). Bernabeu et al.‘s abstract mentioned (p. 1629):

Event-Related Potentials (ERPs) were time-locked to the onset of the first word (property) in the target trials so as to measure the effect online and to avoid a within-trial confound. A switch effect was found, characterized by more negative ERP amplitudes for modality switches than no-switches. It proved significant in four typical time windows from 160 to 750 milliseconds post word onset, with greater strength in posterior brain regions, and after 350 milliseconds.

Indeed, time-locking ERPs to the beginning of the trial was one of the principal features of Bernabeu et al.‘s (2017) experiment. Complementing that, the property words were placed in the first trial because they are more perceptually loaded than concepts (Lynott & Connell, 2013), thus better suiting the main basis of the CMS paradigm. We know that the semantic processing of a word often commences within the first 200 ms (Amsel et al., 2014; Van Dam et al., 2014). Considering the importance of the time course in the grounding of conceptual representations (Hauk, 2016), it seems important to measure the CMS from the moment that it is elicited—namely, in all experiments, from the first word of the target trial (Bernabeu, 2017; Bernabeu et al., 2017), rather than letting several hundreds of milliseconds elapse. Nonetheless, from a methodological perspective, it would be interesting to compare the two approaches in a dedicated study. This would precisely reveal the speed at which modality-specific meaning becomes activated during conceptual processing.

Outstanding issues: Random effects and correction for multiple tests

A methodological issue affecting the statistical analysis of all the studies hereby considered (Bernabeu et al., 2017; Collins et al., 2011; Hald et al., 2011, 2013; Liu et al., 2018) is the absence of some applicable random effects. Some of the studies did not apply any random effects (Collins et al., 2011; Hald et al., 2011, 2013). Even in psycholinguistics, use of linear mixed-effects models is still increasing (Meteyard & Davies, 2020; Yarkoni, 2020). Yet, in those studies that did apply random effects, the corresponding structure was not as exhaustive as it should have been, as they lacked random slopes. In Bernabeu et al. (2017), a model selection approach was applied (Matuschek et al., 2017), whereby each random effect was tested and only kept in the model if it significantly improved the fit. In Liu et al. (2018), random slopes were deemed unfeasible due to computational constraints (for background, see Brauer & Curtin, 2017). Applying a complete random effects structure is important for a robust statistical analysis (Barr et al., 2013; Yarkoni, 2020).

Another issue is that of multiple tests. Where a small number of levels is used (e.g., time windows, topographic regions of interest) and these are informed by the literature, the advice has often been ambiguous as to whether a correction should be applied (e.g., Armstrong, 2014; for a more conservative stance, see Luck & Gaspelin, 2017). Indeed, no correction was applied in any of the four studies that used frequentist statistics (Bernabeu et al., 2017; Collins et al., 2011; Hald et al., 2011, 2013).

Reanalysis of Bernabeu et al. (2017)

The results of Bernabeu et al. (2017) were reanalysed after publication using a more complete random effects structure that incorporated by-participant random slopes for the modality-switch factor—i.e., (condition | participant). The results were also corrected for multiple tests using the Holm-Bonferroni correction (Holm, 1979). For this purpose, the lowest p-value in the four time windows was multiplied by 4, the next p-value by 3, the next by 2, and the highest p-value was left as it unmodified. In this stepwise correction, if a nonsignificant p-value was reached, all the subsequent p-values became nonsignificant (see analysis script). The results differed from the original, slopes-free models (available script and results) in that the main effect of the modality switch factor became nonsignificant in the second time window (160–216 ms) and in the fourth one (500–750 ms), while remaining significant in the third time window (350–550 ms). Incidentally, note that main effects may not be directly interpretable as the same same variables are present interactions (Kam & Franzese, 2007). Yet, the interactions of modality switch with participant group (quick/slow) and scalp location (anterior/posterior) retained the same significance.

Open questions

Liu (2018) and Liu et al. (2018) raise interesting and important questions. Firstly, future research may be conducted to investigate what determines the variability of ERP results—in terms of ERP components, time windows and topographic regions of interest. This research could include a comparison with other measurements, such as response times, to test whether ERPs are less reliable—i.e., more variable across studies—than response times. Similarly, future research may investigate whether the ERP literature is more biased than literature employing other measures, such as response times. In addition, future research could investigate whether moving to exploratory, bayesian research designs is a necessary or sufficient condition to reduce bias in research and improve the precision of experimental measurements. Current alternatives to such an approach include direct (or conceptual) replications designed to achieve a higher power than previous studies (e.g., Chen et al., 2019). Arguably, policies determining funding decisions would need to change if we are to fully acknowledge the importance of direct replication (Howe & Perfors, 2018; Kunert, 2016; Simons, 2014; Zwaan et al., 2018). Last, future research may investigate whether ‘clear picture’ results are realistic, desirable or necessary, and whether unclear-picture results should be eschewed; or whether, on the contrary, clear-picture results may largely be the product of publication bias—that is, the pressure to hide or misreport those aspects of a study that could challenge its acceptance by peer-reviewers or any other academics.


Albers, C., & Lakens, D. (2018). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Journal of Experimental Social Psychology, 74, 187–195.

Amsel, B. D., Urbach, T. P., & Kutas, M. (2014). Empirically grounding grounded cognition: the case of color. Neuroimage, 99, 149-157.

Armstrong, R. A. (2014). When to use the Bonferroni correction. Ophthalmic and Physiological Optics, 34(5), 502-508.

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.

Barsalou, L. W. (2019). Establishing generalizable mechanisms. Psychological Inquiry, 30(4), 220-230.

Bernabeu, P. (2017). Modality switches occur early and extend late in conceptual processing: evidence from ERPs [Master's thesis]. School of Humanities, Tilburg University.

Bernabeu, P., Willems, R. M., & Louwerse, M. M. (2017). Modality switch effects emerge early and increase throughout conceptual processing: evidence from ERPs. In G. Gunzelmann, A. Howes, T. Tenbrink, & E. J. Davelaar (Eds.), Proceedings of the 39th Annual Conference of the Cognitive Science Society (pp. 1629-1634). Austin, TX: Cognitive Science Society.

Brauer, M., & Curtin, J. J. (2018). Linear mixed-effects models and the analysis of nonindependent data: A unified framework to analyze categorical and continuous independent variables that vary within-subjects and/or within-items. Psychological Methods, 23(3), 389–411.

Chen, S., Szabelska, A., Chartier, C. R., Kekecs, Z., Lynott, D., Bernabeu, P., … Schmidt, K. (2018). Investigating object orientation effects across 14 languages. PsyArXiv.

Collins, J., Pecher, D., Zeelenberg, R., & Coulson, S. (2011). Modality switching in a property verification task: an ERP study of what happens when candles flicker after high heels click. Frontiers in Psychology, 2.

Hald, L. A., Hocking, I., Vernon, D., Marshall, J.-A., & Garnham, A. (2013). Exploring modality switching effects in negated sentences: further evidence for grounded representations. Frontiers in Psychology, 4, 93.

Hald, L. A., Marshall, J.-A., Janssen, D. P., & Garnham, A. (2011). Switching modalities in a sentence verification task: ERP evidence for embodied language processing. Frontiers in Psychology, 2.

Hauk, O. (2016). Only time will tell–why temporal information is essential for our neuroscientific understanding of semantics. Psychonomic Bulletin & Review, 23(4), 1072-1079.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65-70.

Howe, P. D., & Perfors, A. (2018). An argument for how (and why) to incentivise replication. Behavioral and Brain Sciences, 41, e135-e135.

Kam, C. D., & Franzese, R. J. (2007). Modeling and interpreting interactive hypotheses in regression analysis. Ann Arbor, MI: University of Michigan Press.

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success. Psychonomic Bulletin & Review, 23(5), 1631-1638.

Liu, P. (2018). Embodied-linguistic conceptual representations during metaphor processing. Doctoral thesis, Lancaster University, UK.

Liu, P., Lynott, D., & Connell, L. (2018). Continuous neural activations of simulation-linguistic representations in modality switching costs. In P. Liu, Embodied-linguistic conceptual representations during metaphor processing. Doctoral thesis, Lancaster University, UK.

Luck, S. J. (2005). Ten simple rules for designing ERP experiments. In T. C. Handy (Ed.), Event-related potentials: A methods handbook. MIT Press.

Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn't). Psychophysiology, 54(1), 146-157.

Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing type 1 error and power in linear mixed models. Journal of Memory and Language, 94, 305–315.

Meteyard, L., & Davies, R. A. (2020). Best practice guidance for linear mixed-effects models in psychological science. Journal of Memory and Language, 112, 104092.

Pecher, D., Zeelenberg, R., & Barsalou, L. W. (2003). Verifying different-modality properties for concepts produces switching costs. Psychological Science, 14, 2, 119-24.

Simons, D. J. (2014). The value of direct replication. Perspectives on Psychological Science, 9(1), 76–80.

Swaab, T. Y., Ledoux, K., Camblin, C. C., & Boudewyn, M. A. (2012). Language-related ERP components. In S. J. Luck & E. S. Kappenman (Eds.), Oxford handbook of event-related potential components (pp. 397–440). Oxford University Press.

Van Dam, W. O., Brazil, I. A., Bekkering, H., & Rueschemeyer, S.-A. (2014). Flexibility in embodied language processing: context effects in lexical access. Topics in Cognitive Science, 6(3), 407–424.

Yarkoni, T. (2020). The generalizability crisis. Behavioral and Brain Sciences, 1-37.

Zwaan, R., Etz, A., Lucas, R., & Donnellan, M. (2018). Making replication mainstream. Behavioral and Brain Sciences, 41, E120.

comments powered by Disqus